Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

LlamaIndex + Keeptrusts: Governed Document AI

LlamaIndex's OpenAI integration accepts custom API base URLs, so routing through the Keeptrusts gateway requires minimal configuration. This guide covers client setup, query engines, and governance patterns for document AI pipelines.

Use this page when

  • You are configuring LlamaIndex to route LLM and embedding calls through the Keeptrusts gateway.
  • You need to set up a governed query engine or document AI pipeline with LlamaIndex.
  • You want to apply output quality policies or response evaluation through the gateway.
  • You are handling policy blocks in LlamaIndex query engine workflows.

Primary audience

  • Primary: Python developers building document AI or RAG systems with LlamaIndex
  • Secondary: AI Engineers evaluating governance for indexing pipelines, MLOps Engineers testing document Q&A quality

Basic Client Configuration

from llama_index.llms.openai import OpenAI as LlamaOpenAI

llm = LlamaOpenAI(
model="gpt-4o",
api_base="http://localhost:41002/v1",
api_key="sk-...",
temperature=0,
)

response = llm.complete("What is AI governance?")
print(response.text)

Chat Interface

from llama_index.core.llms import ChatMessage

messages = [
ChatMessage(role="system", content="You are a compliance analyst."),
ChatMessage(role="user", content="Summarize GDPR data retention rules."),
]

response = llm.chat(messages)
print(response.message.content)

Embeddings Through the Gateway

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_base="http://localhost:41002/v1",
api_key="sk-...",
)

embedding = embed_model.get_text_embedding("AI governance compliance document")
print(f"Embedding dimension: {len(embedding)}")

Both LLM completions and embedding calls pass through the gateway's policy chain.

Governed Query Engine

Build a standard LlamaIndex query engine with all LLM calls routed through the governed gateway:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings to route through the gateway
Settings.llm = LlamaOpenAI(
model="gpt-4o",
api_base="http://localhost:41002/v1",
api_key="sk-...",
temperature=0,
)

Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_base="http://localhost:41002/v1",
api_key="sk-...",
)

# Load and index documents
documents = SimpleDirectoryReader("./compliance-docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query with governance enforcement
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the data retention requirements for healthcare?")
print(response)

Every LLM call the query engine makes — from synthesizing answers to refining results — passes through the gateway.

Streaming Query Responses

query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
streaming_response = query_engine.query("Summarize the incident response procedures.")

for text in streaming_response.response_gen:
print(text, end="", flush=True)
print()

The gateway evaluates output policies on the assembled stream. Policy violations terminate the stream cleanly.

Sub-Question Query Engine

Complex queries that decompose into sub-questions generate multiple LLM calls — each one governed:

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Create specialized query engines
hr_engine = hr_index.as_query_engine()
finance_engine = finance_index.as_query_engine()

query_engine_tools = [
QueryEngineTool(
query_engine=hr_engine,
metadata=ToolMetadata(
name="hr_policies",
description="HR policy documents including hiring and termination",
),
),
QueryEngineTool(
query_engine=finance_engine,
metadata=ToolMetadata(
name="finance_policies",
description="Financial compliance and audit procedures",
),
),
]

sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
)

response = sub_question_engine.query(
"Compare HR onboarding requirements with financial audit timelines"
)
print(response)

The gateway sees each sub-question as an independent LLM call and applies the full policy chain to every one.

Handling Policy Blocks

from openai import APIStatusError

def safe_query(query_engine, question: str) -> str:
try:
response = query_engine.query(question)
return str(response)
except APIStatusError as e:
if e.status_code == 409:
error_body = e.response.json()
policy = error_body.get("error", {}).get("policy", "unknown")
return f"[Query blocked by policy: {policy}]"
raise

Retry with Exponential Backoff

import time
from openai import APIStatusError

def resilient_query(query_engine, question: str, max_retries: int = 3) -> str:
for attempt in range(max_retries):
try:
response = query_engine.query(question)
return str(response)
except APIStatusError as e:
if e.status_code == 409:
error_body = e.response.json()
return f"[Blocked: {error_body.get('error', {}).get('message', 'Policy violation')}]"
if e.status_code == 429 and attempt < max_retries - 1:
wait = 2 ** attempt
time.sleep(wait)
continue
raise
raise RuntimeError("Exceeded max retries")

Quality Scoring with Governance

Combine LlamaIndex evaluation modules with the governed gateway for auditable quality scoring:

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# Evaluators use the governed LLM (set in Settings.llm)
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()

query = "What is the maximum data retention period?"
response = query_engine.query(query)

# Each evaluation call is a governed LLM call
faithfulness_result = faithfulness_evaluator.evaluate_response(response=response)
relevancy_result = relevancy_evaluator.evaluate_response(query=query, response=response)

print(f"Faithfulness: {'Pass' if faithfulness_result.passing else 'Fail'}")
print(f"Relevancy: {'Pass' if relevancy_result.passing else 'Fail'}")
print(f"Faithfulness feedback: {faithfulness_result.feedback}")

The evaluation calls themselves are governed — so even your quality-scoring pipeline is subject to PII filtering and content policies.

Policy Configuration for Document AI

policies:
- name: block-pii-in-answers
type: output_filter
action: block
pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
message: "Response blocked: contains PII pattern"

- name: enforce-source-citation
type: output_quality
action: warn
check: must_contain_source_reference
message: "Warning: response does not cite source documents"

- name: log-all-queries
type: observe
action: log

Best Practices

  1. Set api_base in Settings globally — all LlamaIndex components inherit the gateway route.
  2. Route embeddings through the gateway too — get full observability on embedding calls.
  3. Wrap query engines with error handlers — 409 blocks should return fallback responses, not crash.
  4. Use evaluation modules through the gateway — quality scoring calls get the same policy enforcement.
  5. Test with observe-only policies — validate your document pipeline before enabling blocking.
  6. Monitor token usage via decision events — the gateway logs tokens per call for cost tracking.

Next steps

For AI systems

  • Canonical terms: LlamaIndex, LlamaOpenAI, api_base, OpenAIEmbedding, query engine, VectorStoreIndex, Settings.llm, evaluation module, policy block (409).
  • Key config: LlamaOpenAI(api_base="http://localhost:41002/v1"). Embeddings: OpenAIEmbedding(api_base=...).
  • Use Settings.llm and Settings.embed_model to route all LlamaIndex components through the gateway globally.
  • Best next pages: Function Calling, Streaming Patterns, Error Handling.

For engineers

  • Set api_base in Settings globally so all LlamaIndex components (LLM, embeddings, evaluators) inherit the gateway route.
  • Both complete() and chat() methods work through the gateway; use ChatMessage for multi-turn document Q&A.
  • Route embeddings through the gateway for full observability on indexing and retrieval calls.
  • Wrap query engines with error handlers — 409 policy blocks should return fallback responses, not crash the pipeline.
  • Use quality-scoring policies (output_quality type) to enforce citation requirements and source references.
  • Monitor token usage via decision events — the gateway logs tokens per call for cost tracking.

For leaders

  • LlamaIndex is the leading framework for document AI — governed document Q&A requires only a URL change.
  • Output quality policies enforce citation and accuracy standards at the gateway level, not in application code.
  • All document AI queries produce audit events, satisfying compliance requirements for knowledge-based AI systems.
  • Test with observe-only policies before enforcement to avoid disrupting existing document pipelines.