github envoyproxy/ai-gateway v0.7.0

11 hours ago

Envoy AI Gateway v0.7.0

Envoy AI Gateway v0.7.0 adds hostname-based routing to AIGatewayRoute, enabling multi-tenant deployments where different hostnames expose different model sets through a single Gateway. A new Anthropic Messages → AWS Bedrock Converse translator lets Anthropic-native clients reach Bedrock without switching protocols. OpenAI audio transcription and translation endpoints arrive alongside Azure OpenAI Responses API support. Quota-aware rate limiting takes its first step with backend rate limit filter injection for QuotaPolicy. Claude Opus 4.7 gains full reasoning support including the display parameter and xhigh effort tier. Anthropic-to-OpenAI translation now handles reasoning blocks and images end-to-end. MCP tools/list responses respect authorization rules, and multimodal support grows with audio_url and video_url content types. Several SSE streaming and provider translation bugs are fixed.

✨ New Features

Multi-Tenant Hostname Routing

  • Hostname-based model scoping on AIGatewayRoute — Serve different model sets from a single Gateway by assigning hostnames to each AIGatewayRoute. The /v1/models endpoint automatically returns only the models declared by routes matching the request's Host header, so tenants on teamA.ai.example.com and teamB.ai.example.com each see their own catalog without separate Gateways. Wildcard hostnames (*.ai.example.com) are supported following the Gateway API hostname matching rules.

Provider Translation

  • Anthropic /v1/messages → AWS Bedrock Converse API — Send requests in Anthropic Messages format and have them translated to Bedrock's Converse and ConverseStream APIs automatically. Supports text, images, tool use, thinking blocks, and streaming — so Anthropic-native clients can reach any Bedrock model without changing their integration. Complements the existing OpenAI → Bedrock Converse and Anthropic → Bedrock InvokeModel paths.
  • Reasoning and image support for Anthropic-to-OpenAI translation — The Anthropic /v1/messages → OpenAI /v1/chat/completions path now handles thinking/reasoning content and image blocks end-to-end. Thinking config (enabled/disabled/adaptive) passes through, thinking and redacted_thinking blocks are preserved in multi-turn conversations, and image blocks (base64 and URL) convert to OpenAI image_url format. Previously these were silently dropped.
  • Claude Opus 4.7 and Mythos Preview reasoning — Full support for Claude Opus 4.7's reasoning features: the display parameter (summarized/omitted) controls thinking content visibility, and xhigh joins the reasoning effort tiers for long-horizon agentic and coding tasks. Both claude-opus-4-7 and claude-mythos-preview models are recognized for effort-based thinking control.
  • Custom request paths for Anthropic backends via prefix — The prefix field on VersionedAPISchema now works for Anthropic-schema backends, producing endpoints like /{prefix}/messages instead of the default /v1/messages. Useful for routing to Anthropic-compatible providers that use a non-standard path.
  • Anthropic anthropic-beta header forwarded to AWSAnthropic — The anthropic-beta request header is now mapped into the anthropic_beta body field when routing to AWSAnthropic backends, so beta features like extended thinking and token counting work through the gateway without manual body rewriting.

OpenAI API Compatibility

  • Audio transcription and translation endpoints — Full data-plane support for OpenAI's /v1/audio/transcriptions (Whisper transcription) and /v1/audio/translations (Whisper translation) endpoints. These accept multipart/form-data requests containing audio files, enabling speech-to-text workloads to flow through the gateway with the same auth, rate limiting, and observability as other traffic.
  • Azure OpenAI Responses API — The OpenAI-compatible /v1/responses endpoint now works with Azure OpenAI backends, routing requests to Azure's /openai/responses?api-version=... path while preserving existing request and response handling. Azure users get Responses API support without changing client code.
  • audio_url and video_url content types — OpenAI chat completion requests can now include audio_url and video_url content parts, enabling multimodal audio and video inputs for compatible backends like vLLM with phi-4-mm and Qwen 3.5 models.

Quota-Aware Routing

  • Backend quota rate limit filter injection — First step toward quota-aware routing: the controller now injects a backend rate limit filter when a QuotaPolicy is attached to an AIServiceBackend. The QuotaPolicy controller reconciles the policy, builds rate limit descriptor trees, and configures the rate limit service. This enables per-backend request throttling based on upstream provider quotas.

MCP Gateway

  • Authorization-filtered tools/list responses — MCP tools/list now applies the same authorization rules used by tools/call, omitting tools the caller isn't authorized to invoke. Prevents unauthorized callers from discovering tool names and avoids wasted LLM turns on tools that would fail at call time.

Observability

  • Smarter log redaction preserves developer-authored metadata — Debug log redaction (--enableRedaction) no longer masks developer-authored schema metadata that was previously over-redacted: tool definition description and parameters, tool call function.name, response_format.json_schema, and guided_json are now visible in debug logs. User-provided content and AI-generated text remain redacted, making debug logs significantly more useful without compromising privacy.

🔗 API Updates

  • AIGatewayRoute.spec.hostnames — New optional field accepting a list of hostnames for hostname-based request filtering. When specified, the generated HTTPRoute includes these hostnames, and the /v1/models endpoint scopes its response to models from matching routes. Follows Gateway API hostname semantics including wildcard support.
  • AIGatewayRoute.spec.rules capped at 15 — Maximum rules per AIGatewayRoute reduced from 128 to 15 to match the Gateway API HTTPRoute limit (one slot is reserved for a controller-injected catch-all rule). To configure more rules on the same Gateway, split them across multiple AIGatewayRoute resources.
  • VersionedAPISchema.prefix supported for Anthropic — The prefix field now applies to Anthropic-schema backends in addition to OpenAI. The version field is ignored for Anthropic; use prefix for custom paths. Note: prefix is ignored for AWSAnthropic and GCPAnthropic as these override paths internally.
  • QuotaPolicy rate limit filter injection (runtime enforcement) — The QuotaPolicy CRD (introduced as API-only in v0.6) now has its first runtime behavior: when attached to an AIServiceBackend, a backend rate limit filter is injected to enforce quota-based throttling. Full quota-aware routing across multiple backends is planned for future releases.

🐛 Bug Fixes

  • SSE parser handles fields without space after colon — The SSE event parser now correctly handles fields formatted as data:{json} (no space after the colon), in addition to the standard data: {json}. Fixes silent field drops when proxying responses from providers that omit the optional space.
  • Responses API streaming SSE buffering — OpenAI Responses API and speech streaming translators now buffer incomplete SSE events across response body chunks instead of treating each chunk as self-contained. Fixes dropped or mangled events when TCP segment boundaries split an SSE event mid-frame.
  • Responses API token usage from incomplete and failed streams — Token usage is now captured from response.incomplete and response.failed SSE events, not just response.completed. Streams that hit max_output_tokens or encounter post-generation failures no longer report zero tokens.
  • Nil output guard in AWS Bedrock response translator — Bedrock can return HTTP 200 with no output field (e.g. guardrail interventions or UnknownOperationException). Previously this caused a nil-pointer panic in the ext-proc; now it returns a clean error to the caller.
  • Comprehensive Gemini finish-reason mapping — Gemini finish reasons like SAFETY, BLOCKLIST, RECITATION, MALFORMED_FUNCTION_CALL, and others now map to their correct OpenAI equivalents instead of all falling through to content_filter. Unknown reasons map to error rather than silently misreporting as a content filter event.
  • Empty delta in GCP Vertex AI streaming chunks — Streaming response chunks from GCP Vertex AI that lack candidate content now emit an empty delta object instead of omitting the field, conforming to the OpenAI streaming contract and fixing parse errors in strict clients.
  • Typeless assistant output messages in Responses API — Multi-turn Responses API inputs that include assistant messages without an explicit type: "message" field (e.g. from OpenCode) now parse correctly. Previously these were treated as easy-input messages, causing unmarshalling failures on output_text content blocks.

📖 Upgrade Guidance

Using Hostname-Based Routing

To serve different model sets per hostname, add hostnames to your AIGatewayRoute:

apiVersion: aigateway.envoyproxy.io/v1beta1
kind: AIGatewayRoute
metadata:
  name: team-a-route
spec:
  hostnames:
    - "team-a.ai.example.com"
  rules:
    - matches:
        - headers:
            - name: x-ai-eg-model
              value: gpt-4o
      backendRefs:
        - name: openai-backend

Routes without hostnames remain accessible on all hosts. When at least one route uses hostname scoping, the /v1/models endpoint automatically returns only the models for the matching host.

Rules-Per-Route Limit

AIGatewayRoute.spec.rules is now capped at 15 (down from 128) to match the Gateway API HTTPRoute limit. If you have routes with more than 15 rules, split them across multiple AIGatewayRoute resources attached to the same Gateway.

Using Anthropic Prefix

If you route Anthropic traffic to a provider with a non-standard path, use the prefix field:

schema:
  name: Anthropic
  prefix: /custom/v2 # produces /custom/v2/messages

Note: prefix is ignored for AWSAnthropic and GCPAnthropic backends as they override paths internally.

Adopting Claude Opus 4.7 Reasoning

If you use Claude Opus 4.7 or Mythos Preview models, note that display defaults to omitted (unlike earlier Claude models which default to summarized). To receive summarized thinking content, set display: "summarized" explicitly. The new xhigh reasoning effort tier is available for long-horizon agentic tasks.

📦 Dependency Versions

Dependency Version
Go 1.26.2
Envoy Gateway v1.7.0
Envoy Proxy v1.37
Gateway API v1.4.1
Gateway API Inference Extension v1.0.2
MCP Go SDK v1.6.0

🙏 Acknowledgements

We extend our gratitude to all contributors who made this release possible. Special thanks to:

  • The growing community of adopters for their valuable feedback and production insights
  • Everyone who reported bugs, submitted PRs, and participated in design discussions
  • The Envoy Gateway team for their continued collaboration

🔮 What's Next

We're already working on features for future releases:

  • Full quota-aware routing — building on the rate limit filter injection landed in v0.7 to route around rate-limited upstreams automatically across multiple backends
  • MCPBackend CRD — a dedicated custom resource for MCP backend servers, decoupling MCP backend configuration from MCPRoute
  • Expanded multimodal support — additional audio, video, and image generation backends across cloud providers
  • Deeper MCP authorization — finer-grained policy across tools, resources, and prompts
  • More provider translation paths — filling coverage gaps across Anthropic, Bedrock, and Vertex AI

Don't miss a new ai-gateway release

NewReleases is sending notifications on new releases.