LlamaIndex + Keeptrusts: Governed Document AI

LlamaIndex's OpenAI integration accepts custom API base URLs, so routing through the Keeptrusts gateway requires minimal configuration. This guide covers client setup, query engines, and governance patterns for document AI pipelines.

Use this page when

You are configuring LlamaIndex to route LLM and embedding calls through the Keeptrusts gateway.
You need to set up a governed query engine or document AI pipeline with LlamaIndex.
You want to apply output quality policies or response evaluation through the gateway.
You are handling policy blocks in LlamaIndex query engine workflows.

Primary audience

Primary: Python developers building document AI or RAG systems with LlamaIndex
Secondary: AI Engineers evaluating governance for indexing pipelines, MLOps Engineers testing document Q&A quality

Basic Client Configuration

from llama_index.llms.openai import OpenAI as LlamaOpenAI

llm = LlamaOpenAI(
    model="gpt-4o",
    api_base="http://localhost:41002/v1",
    api_key="sk-...",
    temperature=0,
)

response = llm.complete("What is AI governance?")
print(response.text)

Chat Interface

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are a compliance analyst."),
    ChatMessage(role="user", content="Summarize GDPR data retention rules."),
]

response = llm.chat(messages)
print(response.message.content)

Embeddings Through the Gateway

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_base="http://localhost:41002/v1",
    api_key="sk-...",
)

embedding = embed_model.get_text_embedding("AI governance compliance document")
print(f"Embedding dimension: {len(embedding)}")

Both LLM completions and embedding calls pass through the gateway's policy chain.

Governed Query Engine

Build a standard LlamaIndex query engine with all LLM calls routed through the governed gateway:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI as LlamaOpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Configure global settings to route through the gateway
Settings.llm = LlamaOpenAI(
    model="gpt-4o",
    api_base="http://localhost:41002/v1",
    api_key="sk-...",
    temperature=0,
)

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_base="http://localhost:41002/v1",
    api_key="sk-...",
)

# Load and index documents
documents = SimpleDirectoryReader("./compliance-docs").load_data()
index = VectorStoreIndex.from_documents(documents)

# Query with governance enforcement
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query("What are the data retention requirements for healthcare?")
print(response)

Every LLM call the query engine makes — from synthesizing answers to refining results — passes through the gateway.

Streaming Query Responses

query_engine = index.as_query_engine(streaming=True, similarity_top_k=3)
streaming_response = query_engine.query("Summarize the incident response procedures.")

for text in streaming_response.response_gen:
    print(text, end="", flush=True)
print()

The gateway evaluates output policies on the assembled stream. Policy violations terminate the stream cleanly.

Sub-Question Query Engine

Complex queries that decompose into sub-questions generate multiple LLM calls — each one governed:

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata

# Create specialized query engines
hr_engine = hr_index.as_query_engine()
finance_engine = finance_index.as_query_engine()

query_engine_tools = [
    QueryEngineTool(
        query_engine=hr_engine,
        metadata=ToolMetadata(
            name="hr_policies",
            description="HR policy documents including hiring and termination",
        ),
    ),
    QueryEngineTool(
        query_engine=finance_engine,
        metadata=ToolMetadata(
            name="finance_policies",
            description="Financial compliance and audit procedures",
        ),
    ),
]

sub_question_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
)

response = sub_question_engine.query(
    "Compare HR onboarding requirements with financial audit timelines"
)
print(response)

The gateway sees each sub-question as an independent LLM call and applies the full policy chain to every one.

Handling Policy Blocks

from openai import APIStatusError

def safe_query(query_engine, question: str) -> str:
    try:
        response = query_engine.query(question)
        return str(response)
    except APIStatusError as e:
        if e.status_code == 409:
            error_body = e.response.json()
            policy = error_body.get("error", {}).get("policy", "unknown")
            return f"[Query blocked by policy: {policy}]"
        raise

Retry with Exponential Backoff

import time
from openai import APIStatusError

def resilient_query(query_engine, question: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            response = query_engine.query(question)
            return str(response)
        except APIStatusError as e:
            if e.status_code == 409:
                error_body = e.response.json()
                return f"[Blocked: {error_body.get('error', {}).get('message', 'Policy violation')}]"
            if e.status_code == 429 and attempt < max_retries - 1:
                wait = 2 ** attempt
                time.sleep(wait)
                continue
            raise
    raise RuntimeError("Exceeded max retries")

Quality Scoring with Governance

Combine LlamaIndex evaluation modules with the governed gateway for auditable quality scoring:

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# Evaluators use the governed LLM (set in Settings.llm)
faithfulness_evaluator = FaithfulnessEvaluator()
relevancy_evaluator = RelevancyEvaluator()

query = "What is the maximum data retention period?"
response = query_engine.query(query)

# Each evaluation call is a governed LLM call
faithfulness_result = faithfulness_evaluator.evaluate_response(response=response)
relevancy_result = relevancy_evaluator.evaluate_response(query=query, response=response)

print(f"Faithfulness: {'Pass' if faithfulness_result.passing else 'Fail'}")
print(f"Relevancy: {'Pass' if relevancy_result.passing else 'Fail'}")
print(f"Faithfulness feedback: {faithfulness_result.feedback}")

The evaluation calls themselves are governed — so even your quality-scoring pipeline is subject to PII filtering and content policies.

Policy Configuration for Document AI

policies:
  - name: block-pii-in-answers
    type: output_filter
    action: block
    pattern: "\\b\\d{3}-\\d{2}-\\d{4}\\b"
    message: "Response blocked: contains PII pattern"

  - name: enforce-source-citation
    type: output_quality
    action: warn
    check: must_contain_source_reference
    message: "Warning: response does not cite source documents"

  - name: log-all-queries
    type: observe
    action: log

Best Practices

Set api_base in Settings globally — all LlamaIndex components inherit the gateway route.
Route embeddings through the gateway too — get full observability on embedding calls.
Wrap query engines with error handlers — 409 blocks should return fallback responses, not crash.
Use evaluation modules through the gateway — quality scoring calls get the same policy enforcement.
Test with observe-only policies — validate your document pipeline before enabling blocking.
Monitor token usage via decision events — the gateway logs tokens per call for cost tracking.

Next steps

Function Calling — governed tool use and budget controls
Streaming Patterns — SSE and chunked transfer deep dive
Error Handling — complete error envelope reference

For AI systems

Canonical terms: LlamaIndex, LlamaOpenAI, api_base, OpenAIEmbedding, query engine, VectorStoreIndex, Settings.llm, evaluation module, policy block (409).
Key config: LlamaOpenAI(api_base="http://localhost:41002/v1"). Embeddings: OpenAIEmbedding(api_base=...).
Use Settings.llm and Settings.embed_model to route all LlamaIndex components through the gateway globally.
Best next pages: Function Calling, Streaming Patterns, Error Handling.

For engineers

Set api_base in Settings globally so all LlamaIndex components (LLM, embeddings, evaluators) inherit the gateway route.
Both complete() and chat() methods work through the gateway; use ChatMessage for multi-turn document Q&A.
Route embeddings through the gateway for full observability on indexing and retrieval calls.
Wrap query engines with error handlers — 409 policy blocks should return fallback responses, not crash the pipeline.
Use quality-scoring policies (output_quality type) to enforce citation requirements and source references.
Monitor token usage via decision events — the gateway logs tokens per call for cost tracking.

For leaders

LlamaIndex is the leading framework for document AI — governed document Q&A requires only a URL change.
Output quality policies enforce citation and accuracy standards at the gateway level, not in application code.
All document AI queries produce audit events, satisfying compliance requirements for knowledge-based AI systems.
Test with observe-only policies before enforcement to avoid disrupting existing document pipelines.

Use this page when​

Primary audience​

Basic Client Configuration​

Chat Interface​

Embeddings Through the Gateway​

Governed Query Engine​

Streaming Query Responses​

Sub-Question Query Engine​

Handling Policy Blocks​

Retry with Exponential Backoff​

Quality Scoring with Governance​

Policy Configuration for Document AI​

Best Practices​

Next steps​

For AI systems​

For engineers​

For leaders​