LlamaIndex + Keeptrusts: Governed Document AI
LlamaIndex's OpenAI integration accepts custom API base URLs, so routing through the Keeptrusts gateway requires minimal configuration. This guide covers client setup, query engines, and governance patterns for document AI pipelines.
Use this page when
- You are configuring LlamaIndex to route LLM and embedding calls through the Keeptrusts gateway.
- You need to set up a governed query engine or document AI pipeline with LlamaIndex.
- You want to apply output quality policies or response evaluation through the gateway.
- You are handling policy blocks in LlamaIndex query engine workflows.
Primary audience
- Primary: Python developers building document AI or RAG systems with LlamaIndex
- Secondary: AI Engineers evaluating governance for indexing pipelines, MLOps Engineers testing document Q&A quality
Basic Client Configuration
from llama_index.llms.openai import OpenAI as LlamaOpenAI
llm = LlamaOpenAI(
model="gpt-4o",
api_base="http://localhost:41002/v1",
api_key="sk-...",
temperature=0,
)
response = llm.complete("What is AI governance?")
print(response.text)
Chat Interface
from llama_index.core.llms import ChatMessage
messages = [
ChatMessage(role="system", content="You are a compliance analyst."),
ChatMessage(role="user", content="Summarize GDPR data retention rules."),
]
response = llm.chat(messages)
print(response.message.content)
Embeddings Through the Gateway
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_base="http://localhost:41002/v1",
api_key="sk-...",
)
embedding = embed_model.get_text_embedding("AI governance compliance document")
print(f"Embedding dimension: {len(embedding)}")
Both LLM completions and embedding calls pass through the gateway's policy chain.
Governed Query Engine
Build a standard LlamaIndex query engine with all LLM calls routed through the governed gateway:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Configure global settings to route through the gateway
Settings.llm = LlamaOpenAI(
model="gpt-4o",
api_base="http://localhost:41002/v1",
api_key="sk-...",
temperature=0,
)
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_base="http://localhost:41002/v1",
api_key="sk-...",
)
# Load and index documents
documents = SimpleDirectoryReader("./compliance-docs").load_data()
index = VectorStoreIndex.from_documents(documents)
# Query with governance enforcement
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the data retention requirements for healthcare?")
print(response)
Every LLM call the query engine makes — from synthesizing answers to refining results — passes through the gateway.
Streaming Query Responses
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
streaming_response = query_engine.query("Summarize the incident response procedures.")
for text in streaming_response.response_gen:
print(text, end="", flush=True)
print()
The gateway evaluates output policies on the assembled stream. Policy violations terminate the stream cleanly.
Sub-Question Query Engine
Complex queries that decompose into sub-questions generate multiple LLM calls — each one governed:
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
# Create specialized query engines
hr_engine = hr_index.as_query_engine()
finance_engine = finance_index.as_query_engine()
query_engine_tools = [
QueryEngineTool(
query_engine=hr_engine,
metadata=ToolMetadata(
name="hr_policies",
description="HR policy documents including hiring and termination",
),
),
QueryEngineTool(
query_engine=finance_engine,
metadata=ToolMetadata(
name="finance_policies",
description="Financial compliance and audit procedures",
),
),
]
sub_question_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=query_engine_tools,
)
response = sub_question_engine.query(
"Compare HR onboarding requirements with financial audit timelines"
)
print(response)
The gateway sees each sub-question as an independent LLM call and applies the full policy chain to every one.
Handling Policy Blocks
from openai import APIStatusError
def safe_query(query_engine, question: str) -> str:
try:
response = query_engine.query(question)
return str(response)
except APIStatusError as e:
if e.status_code == 409:
error_body = e.response.json()
policy = error_body.get("error", {}).get("policy", "unknown")
return f"[Query blocked by policy: {policy}]"
raise
Retry with Exponential Backoff
import time
from openai import APIStatusError
def resilient_query(query_engine, question: str, max_retries: int = 3) -> str:
for attempt in range(max_retries):
try:
response = query_engine.query(question)
return str(response)
except APIStatusError as e:
if e.status_code == 409:
error_body = e.response.json()
return f"[Blocked: {error_body.get('error', {}).get('message', 'Policy violation')}]"
if e.status_code == 429 and attempt < max_retries - 1:
wait = 2 ** attempt
time.sleep(wait)
continue
raise
raise RuntimeError("Exceeded max retries")
Quality Scoring with Governance
Combine LlamaIndex evaluation modules with the governed gateway for auditable quality scoring:
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
# Evaluators use the governed LLM (set in Settings.llm)
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()
query = "What is the maximum data retention period?"
response = query_engine.query(query)
# Each evaluation call is a governed LLM call
faithfulness_result = faithfulness_evaluator.evaluate_response(response=response)
relevancy_result = relevancy_evaluator.evaluate_response(query=query, response=response)
print(f"Faithfulness: {'Pass' if faithfulness_result.passing else 'Fail'}")
print(f"Relevancy: {'Pass' if relevancy_result.passing else 'Fail'}")
print(f"Faithfulness feedback: {faithfulness_result.feedback}")
The evaluation calls themselves are governed — so even your quality-scoring pipeline is subject to PII filtering and content policies.
Policy Configuration for Document AI
policies:
- name: block-pii-in-answers
type: output_filter
action: block
pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
message: "Response blocked: contains PII pattern"
- name: enforce-source-citation
type: output_quality
action: warn
check: must_contain_source_reference
message: "Warning: response does not cite source documents"
- name: log-all-queries
type: observe
action: log
Best Practices
- Set
api_baseinSettingsglobally — all LlamaIndex components inherit the gateway route. - Route embeddings through the gateway too — get full observability on embedding calls.
- Wrap query engines with error handlers — 409 blocks should return fallback responses, not crash.
- Use evaluation modules through the gateway — quality scoring calls get the same policy enforcement.
- Test with observe-only policies — validate your document pipeline before enabling blocking.
- Monitor token usage via decision events — the gateway logs tokens per call for cost tracking.
Next steps
- Function Calling — governed tool use and budget controls
- Streaming Patterns — SSE and chunked transfer deep dive
- Error Handling — complete error envelope reference
For AI systems
- Canonical terms: LlamaIndex,
LlamaOpenAI,api_base,OpenAIEmbedding, query engine,VectorStoreIndex,Settings.llm, evaluation module, policy block (409). - Key config:
LlamaOpenAI(api_base="http://localhost:41002/v1"). Embeddings:OpenAIEmbedding(api_base=...). - Use
Settings.llmandSettings.embed_modelto route all LlamaIndex components through the gateway globally. - Best next pages: Function Calling, Streaming Patterns, Error Handling.
For engineers
- Set
api_baseinSettingsglobally so all LlamaIndex components (LLM, embeddings, evaluators) inherit the gateway route. - Both
complete()andchat()methods work through the gateway; useChatMessagefor multi-turn document Q&A. - Route embeddings through the gateway for full observability on indexing and retrieval calls.
- Wrap query engines with error handlers — 409 policy blocks should return fallback responses, not crash the pipeline.
- Use quality-scoring policies (
output_qualitytype) to enforce citation requirements and source references. - Monitor token usage via decision events — the gateway logs tokens per call for cost tracking.
For leaders
- LlamaIndex is the leading framework for document AI — governed document Q&A requires only a URL change.
- Output quality policies enforce citation and accuracy standards at the gateway level, not in application code.
- All document AI queries produce audit events, satisfying compliance requirements for knowledge-based AI systems.
- Test with observe-only policies before enforcement to avoid disrupting existing document pipelines.