sgl-project/sglang gateway-v0.3.0 on GitHub

🚀 SGLang Model Gateway v0.3.0 Released!

We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!

⚠️ Breaking Changes

📊 Metrics Architecture Redesigned

Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.

🔧 UUID-Based Worker Resource Management

Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.

✨ New Features

🌐 Unified Inference Gateway Mode (IGW)

Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:

gRPC router (PD and regular mode)
HTTP router (PD and regular mode)
OpenAI router
Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.

🔤 Tokenize/Detokenize HTTP Endpoints

Direct HTTP endpoints for tokenization operations
Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
TokenizerRegistry for efficient dynamic loading

🧠 Parser Endpoints

/parse/reasoning - Parse reasoning outputs
/parse/function_call - Parse function call responses
GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models

📊 Embeddings Support

Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.

🔐 Server-Side TLS Support

Secure your gateway deployments with native TLS support.

🌐 Go Implementation, contributed by iFlytek MaaS team.

Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!

⚡ Major Enhancements

Control Plane - Workflow Engine

Intelligent lifecycle orchestration with:

DAG-based parallel execution with pre-computed dependency graphs
Concurrent event processing for maximum throughput
Modular add/remove/update workflows

Performance Optimization

Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
Optimized router management: Improved selection algorithms and state management

Resilience & Reliability:

Retry and circuit breaker support for OpenAI and gRPC routers
Enhanced circuit breaker with better state management
Graceful shutdown for TLS and non-TLS servers
Unified error responses with error codes and X-SMG-Error-Code headers

Infrastructure:

Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
Custom Prometheus duration buckets
Improved logging across all modules

🐛 Bug Fixes & Stability

Fixed cache-aware routing in gRPC mode
Resolved load metric tracking and double-decrease issues for cache aware load balancing
Improved backward compatibility for GET endpoints
Fixed gRPC scheduler launcher issues
Fixed token bucket negative duration panics
Resolved MCP server initialization issues

📚 Documentation

Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.

⚠️ Migration checklist:

Update Prometheus dashboards for new metrics
Update worker API integrations for UUID-based management
Review new error response format

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (108 commits)

[model-gateway] release smg 0.3.0 (#15781) by @slin1237 in #15781
[model-gateway] Fix logging module name, parse endpoint context, and tokenizer factory (#15782) by @slin1237 in #15782
[model-gateway] Implement Zero-Copy Vision Tensor Access (#15750) by @ppraneth in #15750
[model-gateway] Fix IGW routing and optimize RouterManager (#15741) by @slin1237 in #15741
Fix smg_http_requests_total semantics (#15655) by @fzyzcjy in #15655
[model-gateway]Enable IGW mode with gRPC router and auto enable IGW when service discovery is turned on (#15459) by @YouNeedCryDear in #15459
[docs] major SGL Model Gateway documentation update (#15715) by @slin1237 in #15715
[model-gateway] add back router worker health metric and fix init state (#15622) by @fzyzcjy in #15622
[mode;-gateway] add back fixes of incorrect metrics after worker removal (#15624) by @fzyzcjy in #15624
[model-gateway] Add tokenize/detokenize HTTP endpoints and tokenizer management (#15702) by @slin1237 in #15702
[model-gateway] Fix tokenizer caching and improve error handling (#15695) by @slin1237 in #15695
[model-gateway]: add gRPC router embeddings endpoint implementation (#15273) by @Ratish1 in #15273
[model-gateway] Optimize router selection with lock-free snapshots (#15672) by @ppraneth in #15672
[model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router (#12968) by @YouNeedCryDear in #12968
Improve engine customization interface (#15635) by @merrymercy in #15635
Tiny add back missing router per attempt response metric (#15621) by @fzyzcjy in #15621
Fix router gRPC mode launch error caused by async loading (#15368) by @fzyzcjy in #15368
[model-gateway] return 503 when all workers are circuit-broken (#15611) by @slin1237 in #15611
[model-gateway] add retry support to OpenAI router chat endpoint (#15589) by @slin1237 in #15589
Optimize Rust CI builds with proper sccache configuration (#15581) by @slin1237 in #15581
[model-gateway] add retry and circuit breaker support to gRPC routers (#15585) by @slin1237 in #15585
[model-gateway] refactor WorkerManager with fan_out helper and thin handlers (#15583) by @slin1237 in #15583
[model-gateway] add WorkerService abstraction for worker business logic (#15580) by @slin1237 in #15580
[model-gateway] minor code clean up (#15578) by @slin1237 in #15578
[model-gateway] Use UUIDs for router-managed worker resources (#15540) by @alphabetc1 in #15540
[model-gateway] /parse/easoning and parse/function_call for sgl-model-gateway (#15568) by @UbeCc in #15568
[model-gateway]: Tool parser for glm47 (#15520) by @UbeCc in #15520
[model-gateway] bugfix: backward compatibility for GET endpoints (#15413) by @alphabetc1 in #15413
[model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching (#15515) by @ppraneth in #15515
[model-gateway] add model gateway multi-arch docker build, test and document docker image (#15544) by @slin1237 in #15544
[model-gateway] Implement RAII load guard with response body attachment (#15507) by @slin1237 in #15507
[router] bugfix: cache_aware in grpc inbalance forward (#15473) by @llfl in #15473
[model-gateway] simplify workflow engine backoff and reduce duplicate reads (#15505) by @slin1237 in #15505
[model-gateway] Run workflow event subscribers concurrently (#15504) by @slin1237 in #15504
[model-gateway] Optimize workflow engine with pre-computed dependency graph (#15503) by @slin1237 in #15503
[model-gateway] Improve logging across core modules (#15497) by @slin1237 in #15497
[model-gateway] Improve logging in policies module (#15496) by @slin1237 in #15496
[model-gateway] Improve logging in data_connector module (#15495) by @slin1237 in #15495
[model-gateway] refactor: extract common graceful shutdown code before TLS branch (#15494) by @slin1237 in #15494
[model-gateway] fix graceful shutdown for TLS/Non-TLS server (#15491) by @slin1237 in #15491
[model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (#15361) by @slin1237 in #15361
[model-gateway] optimize worker registry and reduce lock contention in grpc client fetch (#15336) by @slin1237 in #15336
[model-gateway] reduce cpu overhead (#15316) by @slin1237 in #15316
Super tiny rename failure_count for consistency (#15186) by @fzyzcjy in #15186
[model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (#15160) by @slin1237 in #15160
Fix num running requests (load) wrong cleared for ongoing requests (#15116) by @fzyzcjy in #15116
[model-gateway] add mcp and discovery metrics (#15156) by @slin1237 in #15156
[model-gateway] Add streaming metrics for harmony gRPC router (#15147) by @slin1237 in #15147
[model-gateway] upgrade axum and axum server (#15146) by @slin1237 in #15146
[model-gateway] Add Layer 3 worker metrics (smg_worker_*) (#15130) by @slin1237 in #15130
Fix cache aware wrong routing caused by incorrect load tracking (#15101) by @fzyzcjy in #15101
[model-gateway] fix circuit breaker metrics (#15099) by @fzyzcjy in #15099
[model-gateway] extract circuit breaker state struct (#15098) by @fzyzcjy in #15098
[model-gateway] Parallelize metrics requests (#14953) by @ppraneth in #14953
feat(gateway): Add server-side TLS support (#15052) by @Ratish1 in #15052
[model-gateway] add streaming metrics (TTFT, TPOT, tokens, duration) for gRPC router (#15125) by @slin1237 in #15125
[model-gateway] feat(metrics): implement Layer 2 router metrics (smg_router_*) (#15124) by @slin1237 in #15124
[model-gateway] Implement Layer 1 HTTP metrics instrumentation (#15121) by @slin1237 in #15121
[model-gateway] Add new SMG metrics architecture with 6 layers (#15106) by @slin1237 in #15106
Avoid confusing zero value metric when worker is removed (#15096) by @fzyzcjy in #15096
Fix issue not reported when load decrement is incorrect (#15061) by @fzyzcjy in #15061
[model-gateway] optimize metric labels to avoid unnecessary allocations (#15095) by @slin1237 in #15095
[model-gateway] Add circuit breaker and discovery watcher metrics (#15094) by @slin1237 in #15094
[model-gateway] Fix metric emission gaps and name mismatch (#15093) by @slin1237 in #15093
[model-gateway] Remove unused TokenizerMetrics to reduce CPU overhead (#15087) by @slin1237 in #15087
[model-gateway] Refactor worker steps and add update workflow (#15085) by @slin1237 in #15085
[model-gateway] Avoid MCP Server Initialization Issue (#15065) by @xuwenyihust in #15065
[bug] fix grpc secheduler launcher breaking change (#15080) by @slin1237 in #15080
[model-gateway] Simplify error response creation (#15079) by @slin1237 in #15079
Fix double decrease load (#15060) by @fzyzcjy in #15060
Fix load metric not updated when using guard (#15059) by @fzyzcjy in #15059
Add sgl_router_attempt_http_responses_total for single attempt information (#15037) by @fzyzcjy in #15037
Add error code in prometheus metrics and add X-SMG-Error-Code header (#15036) by @fzyzcjy in #15036
Provide more fine grained error reason for reqwest error (#15032) by @fzyzcjy in #15032
Tiny change http router response format to unify (#15031) by @fzyzcjy in #15031
Tiny unify grpc existing error responses into new format (#15030) by @fzyzcjy in #15030
Add code field and unify error responses for router (#15028) by @fzyzcjy in #15028
Super tiny remove unused log_request (#15035) by @fzyzcjy in #15035
[model-gateway] refactor: unify worker management into modular workflow structure (#15010) by @slin1237 in #15010
Super tiny extract route_typed_request_once (#14951) by @fzyzcjy in #14951
[model-gateway] refactor: workflow engine cleanup and minor optimization (#15001) by @slin1237 in #15001
[model-gateway] fix: handle workflow deadlock and optimize cycle detection (#15000) by @slin1237 in #15000
[model-gateway] feat: add DAG parallel execution support and workflow optimization (#14999) by @slin1237 in #14999
[model-gateway] refactor: extract workflow engine to src/workflow module (#14996) by @slin1237 in #14996
Super tiny refactor error.rs logic (#14949) by @fzyzcjy in #14949
Super tiny move error.rs (#14944) by @fzyzcjy in #14944
Super tiny remove non-updated sgl_router_worker_load (#14888) by @fzyzcjy in #14888
Tiny add e2e http request arrival metric (#14893) by @fzyzcjy in #14893
Tiny add router e2e duration histogram (#14892) by @fzyzcjy in #14892
Fix negative duration panic in token bucket wait time calculation (#14941) by @xiaguan in #14941
[model-gateway] optimize worker selection (#14894) by @ppraneth in #14894
Super tiny remove sgl_router_active_workers (#14891) by @fzyzcjy in #14891
[SMG][DS32][fix] support dsv32, add role developer (#14307) by @jimmy-evo in #14307
[model-gateway] fix imports and delete unused code (#14911) by @slin1237 in #14911
[model-gateway] fix annotation error and code formating (#14910) by @slin1237 in #14910
[model-gateway] code clean up on oai router in responses (#14852) by @slin1237 in #14852
Tiny clean router load report logic (#14889) by @fzyzcjy in #14889
[model-gateway] refactor cleanup WorkflowContext.get_or_err (#14890) by @fzyzcjy in #14890
[model-gateway] fix import order in oai conversation (#14851) by @slin1237 in #14851
[model-gateway] code clean up on oai router (#14850) by @slin1237 in #14850
[model-gateway] adds default implementations to RouterTrait in mod.rs (#14841) by @slin1237 in #14841
[model-gateway] Fix incompatible metric comparison in PowerOfTwo policy (#14823) by @ppraneth in #14823
[model-gateway] support engine response http status statistics in router (#14712) by @fzyzcjy in #14712
[model-gateway] support customizing Prometheus duration buckets (#14716) by @fzyzcjy in #14716
[model-gateway] add anthropic message api spec (#14834) by @slin1237 in #14834
Fix router keep nonzero metrics after worker is deleted (#14819) by @fzyzcjy in #14819
[model-gateway] Dynamically Populate Tool Call Parser Choices (#14807) by @xuwenyihust in #14807
[SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (#14770) by @whybeyoung in #14770

New Contributors

@ppraneth made their first contribution in e99ee0c69
@xuwenyihust made their first contribution in d7f6320bb
@alphabetc1 made their first contribution in 1d90b194b
@Ratish1 made their first contribution in 0e4108ba2
@UbeCc made their first contribution in 26704c23c
@whybeyoung made their first contribution in 766476f52
@llfl made their first contribution in 3c116d5e5
@xiaguan made their first contribution in 22fe5da13
@YouNeedCryDear made their first contribution in dd620987d
@YouNeedCryDear made their first contribution in f65fa0474

Full Changelog: gateway-v0.2.4...gateway-v0.3.0

sgl-project/sglang gateway-v0.3.0 Release Gateway-v0.3.0 on GitHub