Google Vertex AI
Google Vertex AI provides enterprise-grade access to Gemini, Claude, and open models with Google Cloud IAM, VPC Service Controls, and audit logging. Keeptrusts adds a governance gateway layer on top of Vertex AI, enforcing policy before and after each LLM call. Authentication uses Application Default Credentials (ADC) or a service account key -- no separate API key is required in the client.
Use this page when
- You need the exact command, config, API, or integration details for Google Vertex AI.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
- A GCP project with the Vertex AI API enabled (
aiplatform.googleapis.com) - The IAM role
roles/aiplatform.usergranted on the project (or on specific model resources) - The Google Cloud CLI authenticated locally, or a service account key file
# Option A -- authenticate with user credentials (development)
gcloud auth application-default login
# Option B -- authenticate with a service account key (CI / production)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-key.json
Configuration
pack:
name: vertex-ai-governance
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- audit-logger
policy:
prompt-injection:
threshold: 0.8
action: block
pii-detector:
action: redact
fields:
- email
- phone
- ssn
audit-logger:
retention_days: 365
providers:
targets:
- id: vertex-gemini-flash
provider: google-vertex:chat:gemini-2.0-flash
- id: vertex-gemini-pro
provider: google-vertex:chat:gemini-1.5-pro-002
- id: vertex-claude
provider: google-vertex:chat:claude-3-5-sonnet-v2@20241022
- id: vertex-embeddings
provider: google-vertex:embeddings:text-embedding-004
Start the gateway:
kt gateway run --policy-config policy-config.yaml
Provider Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
provider | string | [OK] | -- | Provider ID. Use google-vertex for the base runtime, or google-vertex:chat:<model> / google-vertex:embeddings:<model> to pin the model inline. |
gcp_project | string | [OK] | -- | Your GCP project ID (e.g., my-gcp-project). |
gcp_region | string | us-central1 | GCP region where the model is deployed (e.g., us-central1, europe-west4, us-east5). | |
model | string | -- | Model ID when not encoded in the provider field. | |
format | string | google-gemini | Request/response format. Keeptrusts auto-translates OpenAI-format requests to google-gemini format. | |
base_url | string | auto | Overrides the Vertex AI endpoint. Normally derived automatically from gcp_project and gcp_region. | |
timeout_secs | integer | 120 | Request timeout in seconds. | |
max_retries | integer | 2 | Number of retries on transient errors (429, 503). |
Supported Models
| Model ID | Type | Context Window | Notes |
|---|---|---|---|
gemini-2.0-flash | Chat | 1M tokens | Default; fast and cost-efficient |
gemini-2.0-flash-lite | Chat | 1M tokens | Lowest latency in the Gemini family |
gemini-1.5-pro-002 | Chat | 2M tokens | Highest context, multimodal |
gemini-1.5-flash-002 | Chat | 1M tokens | Balanced speed and quality |
claude-3-5-sonnet-v2@20241022 | Chat | 200K tokens | Via Vertex Model Garden; requires additional enablement |
text-embedding-004 | Embeddings | 2048 tokens input | 768-dimension output |
Model Garden models (Claude, Llama, Mistral) must be enabled individually in the Vertex AI Model Garden console before they can be called.
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
# Keeptrusts gateway -- no Vertex credentials needed in the client
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="any", # gateway handles GCP auth
)
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the risks of using LLMs in financial advice."},
],
temperature=0.3,
max_tokens=512,
)
print(response.choices[0].message.content)
import OpenAI from "openai";
// Keeptrusts gateway -- no Vertex credentials needed in the client
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "any", // gateway handles GCP auth
});
const response = await client.chat.completions.create({
model: "gemini-2.0-flash",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Summarize the risks of using LLMs in financial advice." },
],
temperature: 0.3,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
curl http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any" \
-d '{
"model": "gemini-2.0-flash",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Summarize the risks of using LLMs in financial advice." }
],
"temperature": 0.3,
"max_tokens": 512
}'
Streaming
Vertex AI supports server-sent event (SSE) streaming. Keeptrusts passes chunks through after applying streaming-compatible policy checks.
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(base_url="http://localhost:41002/v1", api_key="any")
with client.chat.completions.stream(
model="gemini-2.0-flash",
messages=[{"role": "user", "content": "Write a short story about AI governance."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:41002/v1", apiKey: "any" });
const stream = await client.chat.completions.create({
model: "gemini-2.0-flash",
messages: [{ role: "user", content: "Write a short story about AI governance." }],
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) process.stdout.write(delta);
}
curl http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any" \
--no-buffer \
-d '{
"model": "gemini-2.0-flash",
"messages": [{ "role": "user", "content": "Write a short story about AI governance." }],
"stream": true
}'
Advanced Configuration
Workload Identity (GKE)
When running inside GKE with Workload Identity, no key file is needed. Bind a Kubernetes service account to a GCP service account with roles/aiplatform.user, and Keeptrusts will pick up credentials automatically via the metadata server.
pack:
name: google-vertex-ai-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: vertex-gemini-flash
provider: google-vertex:chat:gemini-2.0-flash
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Multi-Region Failover
Define multiple targets with different regions and enable routing by priority:
pack:
name: google-vertex-ai-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: vertex-us
provider: google-vertex:chat:gemini-2.0-flash
- id: vertex-eu
provider: google-vertex:chat:gemini-2.0-flash
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Embeddings
from openai import OpenAI
client = OpenAI(base_url="http://localhost:41002/v1", api_key="any")
result = client.embeddings.create(
model="text-embedding-004",
input="Enterprise AI governance best practices",
)
print(result.data[0].embedding[:5]) # 768-dimension vector
Best Practices
- Use ADC in development and a dedicated service account with minimum
roles/aiplatform.userin production. Avoid broader roles such asroles/editor. - Pin model versions using the full suffix (e.g.,
gemini-1.5-pro-002) rather than alias names to avoid unexpected behavior after Google updates a model alias. - Enable VPC Service Controls around the
aiplatform.googleapis.comAPI to restrict Vertex AI access to your corporate network. - Set
timeout_secsappropriately for long-context requests -- Gemini 1.5 Pro with 2M-token windows can take 30–60 seconds for large inputs. - Monitor quota via Cloud Monitoring. Vertex AI imposes per-project QPM limits that vary by model and region; set
max_retriesto handle transient 429s gracefully. - Route Model Garden models (Claude, Llama) through a separate provider target so you can apply stricter policies or different retention rules to non-Google models.
For AI systems
- Canonical terms: Keeptrusts gateway, Google Vertex AI, Vertex AI, GCP, Model Garden, service account, OAuth2, provider target, policy-config.yaml,
provider: "google-vertex-ai". - Config field names:
provider,model,gcp_project_id,gcp_region,gcp_service_account_key_env,format,provider_type: "google-vertex-ai",pricing. - Auth: OAuth2 via GCP service account key (JSON) or Application Default Credentials (ADC).
- Key behavior: Keeptrusts handles OAuth2 token refresh and Vertex AI endpoint construction for Gemini and Model Garden models.
- Best next pages: Google AI Studio integration (consumer tier), AWS Bedrock integration, Policy configuration.
For engineers
- Prerequisites: GCP project with Vertex AI API enabled, service account with
aiplatform.endpoints.predictpermission,ktCLI installed. - Required config:
gcp_project_id,gcp_region, and eithergcp_service_account_key_envor Application Default Credentials. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Monitor GCP per-project QPM quotas via Cloud Monitoring — set
max_retriesto handle transient 429s. - For Model Garden models (Claude, Llama), configure separate provider targets with distinct policies.
- Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"gemini-2.0-flash","messages":[{"role":"user","content":"hello"}]}'.
For leaders
- Vertex AI provides enterprise GCP controls: VPC Service Controls, CMEK encryption, IAM-based access, and Cloud Audit Logs.
- Data residency is configurable per GCP region — traffic stays within your selected region for sovereignty compliance.
- Model Garden provides access to third-party models (Claude, Llama) under GCP's data handling agreements.
- GCP quotas (QPM per project/region) require capacity planning; Keeptrusts health probes and fallback routing help maintain availability.
Next steps
- Google AI Studio integration — simpler API key auth for non-enterprise workloads
- AWS Bedrock integration — alternative cloud-native LLM with data residency
- Provider routing strategies — multi-region failover
- Policy configuration — prompt-injection and PII policy reference
- Quickstart — install
ktand run your first gateway