🚀 SGLang Model Gateway v0.3.0 Released!
We're thrilled to announce SGLang Model Gateway v0.3.0 – a major release with powerful new features, architectural improvements, and important breaking changes!
⚠️ Breaking Changes
📊 Metrics Architecture Redesigned
Complete overhaul with new 6-layer metrics architecture covering protocol (HTTP/gRPC), router, worker, streaming (TTFT/TPOT), circuit breaker, and policy metrics with unified error codes.
Action Required: Update your Prometheus dashboards and alerting rules. Metric names and structure have changed.
🔧 UUID-Based Worker Resource Management
Workers are now identified by UUIDs instead of endpoints for cleaner resource management.
Action Required: Update any tooling or scripts that interact with the worker API.
✨ New Features
🌐 Unified Inference Gateway Mode (IGW)
Single gateway, entire fleet. IGW now supports ALL router types in a single deployment with Kubernetes service discovery:
- gRPC router (PD and regular mode)
- HTTP router (PD and regular mode)
- OpenAI router
Auto-enabled with service discovery. Deploy once, route everything - handle all traffic patterns across your entire inference fleet from a single gateway instance.
🔤 Tokenize/Detokenize HTTP Endpoints
- Direct HTTP endpoints for tokenization operations
- Dynamic tokenizer control plane: add, list, get, and remove tokenizers on-the-fly
- TokenizerRegistry for efficient dynamic loading
🧠 Parser Endpoints
/parse/reasoning- Parse reasoning outputs/parse/function_call- Parse function call responses- GLM-4 function call parser - Contributed directly by the GLM team for latest GLM models
📊 Embeddings Support
Native embeddings endpoint for gRPC router - expand beyond text generation to embedding workloads.
🔐 Server-Side TLS Support
Secure your gateway deployments with native TLS support.
🌐 Go Implementation, contributed by iFlytek MaaS team.
Complete Go SGLang Model Gateway with OpenAI-compatible API server - bringing SGLang to the Go ecosystem!
⚡ Major Enhancements
Control Plane - Workflow Engine
Intelligent lifecycle orchestration with:
- DAG-based parallel execution with pre-computed dependency graphs
- Concurrent event processing for maximum throughput
- Modular add/remove/update workflows
Performance Optimization
- Lock-free data structures: DashMap for policy lookups, lock-free router snapshots
- Reduced CPU overhead: Optimized worker registry, gRPC client fetch, and worker selection
- Optimized router management: Improved selection algorithms and state management
Resilience & Reliability:
- Retry and circuit breaker support for OpenAI and gRPC routers
- Enhanced circuit breaker with better state management
- Graceful shutdown for TLS and non-TLS servers
- Unified error responses with error codes and X-SMG-Error-Code headers
Infrastructure:
- Multi-architecture Docker builds (Linux, macOS, Windows, ARM)
- Custom Prometheus duration buckets
- Improved logging across all modules
🐛 Bug Fixes & Stability
- Fixed cache-aware routing in gRPC mode
- Resolved load metric tracking and double-decrease issues for cache aware load balancing
- Improved backward compatibility for GET endpoints
- Fixed gRPC scheduler launcher issues
- Fixed token bucket negative duration panics
- Resolved MCP server initialization issues
📚 Documentation
Major documentation update with comprehensive guides, examples, and best practices for SGLang Model Gateway.
⚠️ Migration checklist:
- Update Prometheus dashboards for new metrics
- Update worker API integrations for UUID-based management
- Review new error response format
⚡ Built for speed. Engineered for scale. Production-proven.
Gateway Changes (108 commits)
- [model-gateway] release smg 0.3.0 (#15781) by @slin1237 in #15781
- [model-gateway] Fix logging module name, parse endpoint context, and tokenizer factory (#15782) by @slin1237 in #15782
- [model-gateway] Implement Zero-Copy Vision Tensor Access (#15750) by @ppraneth in #15750
- [model-gateway] Fix IGW routing and optimize RouterManager (#15741) by @slin1237 in #15741
- Fix smg_http_requests_total semantics (#15655) by @fzyzcjy in #15655
- [model-gateway]Enable IGW mode with gRPC router and auto enable IGW when service discovery is turned on (#15459) by @YouNeedCryDear in #15459
- [docs] major SGL Model Gateway documentation update (#15715) by @slin1237 in #15715
- [model-gateway] add back router worker health metric and fix init state (#15622) by @fzyzcjy in #15622
- [mode;-gateway] add back fixes of incorrect metrics after worker removal (#15624) by @fzyzcjy in #15624
- [model-gateway] Add tokenize/detokenize HTTP endpoints and tokenizer management (#15702) by @slin1237 in #15702
- [model-gateway] Fix tokenizer caching and improve error handling (#15695) by @slin1237 in #15695
- [model-gateway]: add gRPC router embeddings endpoint implementation (#15273) by @Ratish1 in #15273
- [model-gateway] Optimize router selection with lock-free snapshots (#15672) by @ppraneth in #15672
- [model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router (#12968) by @YouNeedCryDear in #12968
- Improve engine customization interface (#15635) by @merrymercy in #15635
- Tiny add back missing router per attempt response metric (#15621) by @fzyzcjy in #15621
- Fix router gRPC mode launch error caused by async loading (#15368) by @fzyzcjy in #15368
- [model-gateway] return 503 when all workers are circuit-broken (#15611) by @slin1237 in #15611
- [model-gateway] add retry support to OpenAI router chat endpoint (#15589) by @slin1237 in #15589
- Optimize Rust CI builds with proper sccache configuration (#15581) by @slin1237 in #15581
- [model-gateway] add retry and circuit breaker support to gRPC routers (#15585) by @slin1237 in #15585
- [model-gateway] refactor WorkerManager with fan_out helper and thin handlers (#15583) by @slin1237 in #15583
- [model-gateway] add WorkerService abstraction for worker business logic (#15580) by @slin1237 in #15580
- [model-gateway] minor code clean up (#15578) by @slin1237 in #15578
- [model-gateway] Use UUIDs for router-managed worker resources (#15540) by @alphabetc1 in #15540
- [model-gateway] /parse/easoning and parse/function_call for sgl-model-gateway (#15568) by @UbeCc in #15568
- [model-gateway]: Tool parser for glm47 (#15520) by @UbeCc in #15520
- [model-gateway] bugfix: backward compatibility for GET endpoints (#15413) by @alphabetc1 in #15413
- [model-gateway] Optimize WASM Runtime with Instance Pooling and Component Caching (#15515) by @ppraneth in #15515
- [model-gateway] add model gateway multi-arch docker build, test and document docker image (#15544) by @slin1237 in #15544
- [model-gateway] Implement RAII load guard with response body attachment (#15507) by @slin1237 in #15507
- [router] bugfix: cache_aware in grpc inbalance forward (#15473) by @llfl in #15473
- [model-gateway] simplify workflow engine backoff and reduce duplicate reads (#15505) by @slin1237 in #15505
- [model-gateway] Run workflow event subscribers concurrently (#15504) by @slin1237 in #15504
- [model-gateway] Optimize workflow engine with pre-computed dependency graph (#15503) by @slin1237 in #15503
- [model-gateway] Improve logging across core modules (#15497) by @slin1237 in #15497
- [model-gateway] Improve logging in policies module (#15496) by @slin1237 in #15496
- [model-gateway] Improve logging in data_connector module (#15495) by @slin1237 in #15495
- [model-gateway] refactor: extract common graceful shutdown code before TLS branch (#15494) by @slin1237 in #15494
- [model-gateway] fix graceful shutdown for TLS/Non-TLS server (#15491) by @slin1237 in #15491
- [model-gateway] Replace PolicyRegistry RwLock with DashMap for lock-free policy lookups (#15361) by @slin1237 in #15361
- [model-gateway] optimize worker registry and reduce lock contention in grpc client fetch (#15336) by @slin1237 in #15336
- [model-gateway] reduce cpu overhead (#15316) by @slin1237 in #15316
- Super tiny rename failure_count for consistency (#15186) by @fzyzcjy in #15186
- [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (#15160) by @slin1237 in #15160
- Fix num running requests (load) wrong cleared for ongoing requests (#15116) by @fzyzcjy in #15116
- [model-gateway] add mcp and discovery metrics (#15156) by @slin1237 in #15156
- [model-gateway] Add streaming metrics for harmony gRPC router (#15147) by @slin1237 in #15147
- [model-gateway] upgrade axum and axum server (#15146) by @slin1237 in #15146
- [model-gateway] Add Layer 3 worker metrics (smg_worker_*) (#15130) by @slin1237 in #15130
- Fix cache aware wrong routing caused by incorrect load tracking (#15101) by @fzyzcjy in #15101
- [model-gateway] fix circuit breaker metrics (#15099) by @fzyzcjy in #15099
- [model-gateway] extract circuit breaker state struct (#15098) by @fzyzcjy in #15098
- [model-gateway] Parallelize metrics requests (#14953) by @ppraneth in #14953
- feat(gateway): Add server-side TLS support (#15052) by @Ratish1 in #15052
- [model-gateway] add streaming metrics (TTFT, TPOT, tokens, duration) for gRPC router (#15125) by @slin1237 in #15125
- [model-gateway] feat(metrics): implement Layer 2 router metrics (smg_router_*) (#15124) by @slin1237 in #15124
- [model-gateway] Implement Layer 1 HTTP metrics instrumentation (#15121) by @slin1237 in #15121
- [model-gateway] Add new SMG metrics architecture with 6 layers (#15106) by @slin1237 in #15106
- Avoid confusing zero value metric when worker is removed (#15096) by @fzyzcjy in #15096
- Fix issue not reported when load decrement is incorrect (#15061) by @fzyzcjy in #15061
- [model-gateway] optimize metric labels to avoid unnecessary allocations (#15095) by @slin1237 in #15095
- [model-gateway] Add circuit breaker and discovery watcher metrics (#15094) by @slin1237 in #15094
- [model-gateway] Fix metric emission gaps and name mismatch (#15093) by @slin1237 in #15093
- [model-gateway] Remove unused TokenizerMetrics to reduce CPU overhead (#15087) by @slin1237 in #15087
- [model-gateway] Refactor worker steps and add update workflow (#15085) by @slin1237 in #15085
- [model-gateway] Avoid MCP Server Initialization Issue (#15065) by @xuwenyihust in #15065
- [bug] fix grpc secheduler launcher breaking change (#15080) by @slin1237 in #15080
- [model-gateway] Simplify error response creation (#15079) by @slin1237 in #15079
- Fix double decrease load (#15060) by @fzyzcjy in #15060
- Fix load metric not updated when using guard (#15059) by @fzyzcjy in #15059
- Add sgl_router_attempt_http_responses_total for single attempt information (#15037) by @fzyzcjy in #15037
- Add error code in prometheus metrics and add X-SMG-Error-Code header (#15036) by @fzyzcjy in #15036
- Provide more fine grained error reason for reqwest error (#15032) by @fzyzcjy in #15032
- Tiny change http router response format to unify (#15031) by @fzyzcjy in #15031
- Tiny unify grpc existing error responses into new format (#15030) by @fzyzcjy in #15030
- Add
codefield and unify error responses for router (#15028) by @fzyzcjy in #15028 - Super tiny remove unused log_request (#15035) by @fzyzcjy in #15035
- [model-gateway] refactor: unify worker management into modular workflow structure (#15010) by @slin1237 in #15010
- Super tiny extract route_typed_request_once (#14951) by @fzyzcjy in #14951
- [model-gateway] refactor: workflow engine cleanup and minor optimization (#15001) by @slin1237 in #15001
- [model-gateway] fix: handle workflow deadlock and optimize cycle detection (#15000) by @slin1237 in #15000
- [model-gateway] feat: add DAG parallel execution support and workflow optimization (#14999) by @slin1237 in #14999
- [model-gateway] refactor: extract workflow engine to src/workflow module (#14996) by @slin1237 in #14996
- Super tiny refactor error.rs logic (#14949) by @fzyzcjy in #14949
- Super tiny move error.rs (#14944) by @fzyzcjy in #14944
- Super tiny remove non-updated sgl_router_worker_load (#14888) by @fzyzcjy in #14888
- Tiny add e2e http request arrival metric (#14893) by @fzyzcjy in #14893
- Tiny add router e2e duration histogram (#14892) by @fzyzcjy in #14892
- Fix negative duration panic in token bucket wait time calculation (#14941) by @xiaguan in #14941
- [model-gateway] optimize worker selection (#14894) by @ppraneth in #14894
- Super tiny remove sgl_router_active_workers (#14891) by @fzyzcjy in #14891
- [SMG][DS32][fix] support dsv32, add role developer (#14307) by @jimmy-evo in #14307
- [model-gateway] fix imports and delete unused code (#14911) by @slin1237 in #14911
- [model-gateway] fix annotation error and code formating (#14910) by @slin1237 in #14910
- [model-gateway] code clean up on oai router in responses (#14852) by @slin1237 in #14852
- Tiny clean router load report logic (#14889) by @fzyzcjy in #14889
- [model-gateway] refactor cleanup WorkflowContext.get_or_err (#14890) by @fzyzcjy in #14890
- [model-gateway] fix import order in oai conversation (#14851) by @slin1237 in #14851
- [model-gateway] code clean up on oai router (#14850) by @slin1237 in #14850
- [model-gateway] adds default implementations to RouterTrait in mod.rs (#14841) by @slin1237 in #14841
- [model-gateway] Fix incompatible metric comparison in
PowerOfTwopolicy (#14823) by @ppraneth in #14823 - [model-gateway] support engine response http status statistics in router (#14712) by @fzyzcjy in #14712
- [model-gateway] support customizing Prometheus duration buckets (#14716) by @fzyzcjy in #14716
- [model-gateway] add anthropic message api spec (#14834) by @slin1237 in #14834
- Fix router keep nonzero metrics after worker is deleted (#14819) by @fzyzcjy in #14819
- [model-gateway] Dynamically Populate Tool Call Parser Choices (#14807) by @xuwenyihust in #14807
- [SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (#14770) by @whybeyoung in #14770
New Contributors
- @ppraneth made their first contribution in e99ee0c69
- @xuwenyihust made their first contribution in d7f6320bb
- @alphabetc1 made their first contribution in 1d90b194b
- @Ratish1 made their first contribution in 0e4108ba2
- @UbeCc made their first contribution in 26704c23c
- @whybeyoung made their first contribution in 766476f52
- @llfl made their first contribution in 3c116d5e5
- @xiaguan made their first contribution in 22fe5da13
- @YouNeedCryDear made their first contribution in dd620987d
- @YouNeedCryDear made their first contribution in f65fa0474
Full Changelog: gateway-v0.2.4...gateway-v0.3.0