Google Vertex AI

Google Vertex AI provides enterprise-grade access to Gemini, Claude, and open models with Google Cloud IAM, VPC Service Controls, and audit logging. Keeptrusts adds a governance gateway layer on top of Vertex AI, enforcing policy before and after each LLM call. Authentication uses Application Default Credentials (ADC) or a service account key -- no separate API key is required in the client.

Use this page when

You need the exact command, config, API, or integration details for Google Vertex AI.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

A GCP project with the Vertex AI API enabled (aiplatform.googleapis.com)
The IAM role roles/aiplatform.user granted on the project (or on specific model resources)
The Google Cloud CLI authenticated locally, or a service account key file

# Option A -- authenticate with user credentials (development)
gcloud auth application-default login

# Option B -- authenticate with a service account key (CI / production)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-key.json

Configuration

pack:
  name: vertex-ai-governance
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection
  - pii-detector
  - audit-logger
policy:
  prompt-injection:
    threshold: 0.8
    action: block
  pii-detector:
    action: redact
    fields:
    - email
    - phone
    - ssn
  audit-logger:
    retention_days: 365
providers:
  targets:
  - id: vertex-gemini-flash
    provider: google-vertex:chat:gemini-2.0-flash
  - id: vertex-gemini-pro
    provider: google-vertex:chat:gemini-1.5-pro-002
  - id: vertex-claude
    provider: google-vertex:chat:claude-3-5-sonnet-v2@20241022
  - id: vertex-embeddings
    provider: google-vertex:embeddings:text-embedding-004

Start the gateway:

kt gateway run --policy-config policy-config.yaml

Provider Fields

Field	Type	Required	Default	Description
`provider`	string	[OK]	--	Provider ID. Use `google-vertex` for the base runtime, or `google-vertex:chat:<model>` / `google-vertex:embeddings:<model>` to pin the model inline.
`gcp_project`	string	[OK]	--	Your GCP project ID (e.g., `my-gcp-project`).
`gcp_region`	string		`us-central1`	GCP region where the model is deployed (e.g., `us-central1`, `europe-west4`, `us-east5`).
`model`	string		--	Model ID when not encoded in the `provider` field.
`format`	string		`google-gemini`	Request/response format. Keeptrusts auto-translates OpenAI-format requests to `google-gemini` format.
`base_url`	string		auto	Overrides the Vertex AI endpoint. Normally derived automatically from `gcp_project` and `gcp_region`.
`timeout_secs`	integer		`120`	Request timeout in seconds.
`max_retries`	integer		`2`	Number of retries on transient errors (429, 503).

Supported Models

Model ID	Type	Context Window	Notes
`gemini-2.0-flash`	Chat	1M tokens	Default; fast and cost-efficient
`gemini-2.0-flash-lite`	Chat	1M tokens	Lowest latency in the Gemini family
`gemini-1.5-pro-002`	Chat	2M tokens	Highest context, multimodal
`gemini-1.5-flash-002`	Chat	1M tokens	Balanced speed and quality
`claude-3-5-sonnet-v2@20241022`	Chat	200K tokens	Via Vertex Model Garden; requires additional enablement
`text-embedding-004`	Embeddings	2048 tokens input	768-dimension output

Model Garden models (Claude, Llama, Mistral) must be enabled individually in the Vertex AI Model Garden console before they can be called.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Keeptrusts gateway -- no Vertex credentials needed in the client
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="any",  # gateway handles GCP auth
)

response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the risks of using LLMs in financial advice."},
    ],
    temperature=0.3,
    max_tokens=512,
)

print(response.choices[0].message.content)

import OpenAI from "openai";

// Keeptrusts gateway -- no Vertex credentials needed in the client
const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "any", // gateway handles GCP auth
});

const response = await client.chat.completions.create({
  model: "gemini-2.0-flash",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Summarize the risks of using LLMs in financial advice." },
  ],
  temperature: 0.3,
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

curl http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer any" \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Summarize the risks of using LLMs in financial advice." }
    ],
    "temperature": 0.3,
    "max_tokens": 512
  }'

Streaming

Vertex AI supports server-sent event (SSE) streaming. Keeptrusts passes chunks through after applying streaming-compatible policy checks.

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="any")

with client.chat.completions.stream(
    model="gemini-2.0-flash",
    messages=[{"role": "user", "content": "Write a short story about AI governance."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

import OpenAI from "openai";

const client = new OpenAI({ baseURL: "http://localhost:41002/v1", apiKey: "any" });

const stream = await client.chat.completions.create({
  model: "gemini-2.0-flash",
  messages: [{ role: "user", content: "Write a short story about AI governance." }],
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content;
  if (delta) process.stdout.write(delta);
}

curl http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer any" \
  --no-buffer \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [{ "role": "user", "content": "Write a short story about AI governance." }],
    "stream": true
  }'

Advanced Configuration

Workload Identity (GKE)

When running inside GKE with Workload Identity, no key file is needed. Bind a Kubernetes service account to a GCP service account with roles/aiplatform.user, and Keeptrusts will pick up credentials automatically via the metadata server.

pack:
  name: google-vertex-ai-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: vertex-gemini-flash
    provider: google-vertex:chat:gemini-2.0-flash
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Multi-Region Failover

Define multiple targets with different regions and enable routing by priority:

pack:
  name: google-vertex-ai-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: vertex-us
    provider: google-vertex:chat:gemini-2.0-flash
  - id: vertex-eu
    provider: google-vertex:chat:gemini-2.0-flash
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Embeddings

from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="any")

result = client.embeddings.create(
    model="text-embedding-004",
    input="Enterprise AI governance best practices",
)

print(result.data[0].embedding[:5])  # 768-dimension vector

Best Practices

Use ADC in development and a dedicated service account with minimum roles/aiplatform.user in production. Avoid broader roles such as roles/editor.
Pin model versions using the full suffix (e.g., gemini-1.5-pro-002) rather than alias names to avoid unexpected behavior after Google updates a model alias.
Enable VPC Service Controls around the aiplatform.googleapis.com API to restrict Vertex AI access to your corporate network.
Set timeout_secs appropriately for long-context requests -- Gemini 1.5 Pro with 2M-token windows can take 30–60 seconds for large inputs.
Monitor quota via Cloud Monitoring. Vertex AI imposes per-project QPM limits that vary by model and region; set max_retries to handle transient 429s gracefully.
Route Model Garden models (Claude, Llama) through a separate provider target so you can apply stricter policies or different retention rules to non-Google models.

For AI systems

Canonical terms: Keeptrusts gateway, Google Vertex AI, Vertex AI, GCP, Model Garden, service account, OAuth2, provider target, policy-config.yaml, provider: "google-vertex-ai".
Config field names: provider, model, gcp_project_id, gcp_region, gcp_service_account_key_env, format, provider_type: "google-vertex-ai", pricing.
Auth: OAuth2 via GCP service account key (JSON) or Application Default Credentials (ADC).
Key behavior: Keeptrusts handles OAuth2 token refresh and Vertex AI endpoint construction for Gemini and Model Garden models.
Best next pages: Google AI Studio integration (consumer tier), AWS Bedrock integration, Policy configuration.

For engineers

Prerequisites: GCP project with Vertex AI API enabled, service account with aiplatform.endpoints.predict permission, kt CLI installed.
Required config: gcp_project_id, gcp_region, and either gcp_service_account_key_env or Application Default Credentials.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Monitor GCP per-project QPM quotas via Cloud Monitoring — set max_retries to handle transient 429s.
For Model Garden models (Claude, Llama), configure separate provider targets with distinct policies.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"gemini-2.0-flash","messages":[{"role":"user","content":"hello"}]}'.

For leaders

Vertex AI provides enterprise GCP controls: VPC Service Controls, CMEK encryption, IAM-based access, and Cloud Audit Logs.
Data residency is configurable per GCP region — traffic stays within your selected region for sovereignty compliance.
Model Garden provides access to third-party models (Claude, Llama) under GCP's data handling agreements.
GCP quotas (QPM per project/region) require capacity planning; Keeptrusts health probes and fallback routing help maintain availability.

Next steps

Google AI Studio integration — simpler API key auth for non-enterprise workloads
AWS Bedrock integration — alternative cloud-native LLM with data residency
Provider routing strategies — multi-region failover
Policy configuration — prompt-injection and PII policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Workload Identity (GKE)​

Multi-Region Failover​

Embeddings​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​