sgl-project/sglang gateway-v0.3.1 on GitHub

🚀 SMG v0.3.1 Released!

We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security!

🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡

Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains:

Performance Improvements

Our cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900), with latency dropping from 52.9 microseconds to just 4.6 microseconds per operation.
For prefix matching across 10,000 tree entries, throughput jumped from 41,000 to 124,000 operations per second.
Under concurrent load with 64 threads, the system processes 474,000 operations per second – a 7.9x improvement over the previous 59,000 ops/sec.

Data processing

INSERT operations now process 440 MB/s (up from 38 MB/s),
MATCH operations handle 253 MB/s (up from 83 MB/s).

Memory Improvements:

~99% memory reduction per tree node:
Before: ~180 KB per node (DashMap default config on 170-core machines)
After: ~1.4 KB per node
Result: Deploy 100x more cache entries in the same memory footprint!
For a typical deployment with 10,000 cached prefixes, memory usage drops from ~1.8 GB to just ~14 MB – freeing up resources for actual inference workloads.
Impact: Cache-aware routing is now 10-12x faster and uses 99% less memory. This is critical for large-scale multi-tenant deployments.

🔐 JWT/OIDC Authentication

Production-grade security for control plane APIs with native support for industry-standard OIDC providers: Google, Azure, Oracle, GitHub, and more. Protect tokenizer management, worker registration, and admin endpoints with enterprise authentication infrastructure you already use. Critical for enterprise deployments – seamlessly integrate SMG into your existing identity and access management systems.

📊 Classification API Support

Native support for classification workloads! Deploy and serve classification models alongside your existing inference fleet with dedicated pipeline stages and protocol types.

✨ Additional Features

PrefixHash Load Balancing: New KV cache-aware load balancing policy using prefix hashing for improved cache hit rates in multi-tenant environments.
Nemotron Nano V3 Parser
In-Flight Request Age Metrics: Track request age in-flight for better observability and SLA monitoring.

🛠️ Enhancements

Developer Experience:

Organized CLI arguments into logical groups
Shortened logging targets (sgl_model_gateway → smg)
Comprehensive embedding correctness tests against HuggingFace
Auto-generate protobuf files during wheel build

Reliability:

Fix IGW routing for external OpenAI workers
Work around orphan process problems
Prevent potential hangs in subprocess handling
Use 504 Gateway Timeout for upstream timeouts (proper HTTP semantics)

🐛 Bug Fixes

Fixed embedding worker health check crash
Fixed tokenizer to match transformers special token handling
Fixed age bucket rendering issue
Fixed non-PD router HTTP header whitelist
Fixed duplicate classify prefix in response ID
Fixed WASM test errors on machines with many cores

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (120 commits)

[model-gateway] release 0.3.1 (#16254) by @slin1237 in #16254
[smg] cleanup router RAII guards (#16560) by @fzyzcjy in #16560
[smg] update gRPC proto to match upstream changes (#16764) by @slin1237 in #16764
[smg] Add Nemotron Nano V3 reasoning parser support (#16763) by @slin1237 in #16763
[smg] Work around sglang's notorious orphan process problem (#16756) by @slin1237 in #16756
fix(e2e): prevent potential hangs in model pool subprocess handling (#16752) by @slin1237 in #16752
[smg][ci] delete old chat completion integration tests and workflow step (#16751) by @slin1237 in #16751
[smg][ci] migrate function calling tests to new infrastructure (#16748) by @slin1237 in #16748
[smg][ci] migrate validation tests to new infrastructure (#16746) by @slin1237 in #16746
[grpc] Auto-generate protobuf files during wheel build (#16409) by @CatherineSue in #16409
[smg][ci] fix model pool GPU cleanup and add startup reliability improvements (#16745) by @slin1237 in #16745
[smg][ci] migrate reasoning_content tests to new infrastructure (#16741) by @slin1237 in #16741
[smg][ci] migrate enable_thinking tests to new infrastructure (#16739) by @slin1237 in #16739
Remove migrated e2e_grpc/basic tests (#16738) by @slin1237 in #16738
[smg][ci] migrate chat completions tests to new infrastructure and build wheel once and share via artifact (#16709) by @slin1237 in #16709
[smg][ci] delete old responses api ci (#16695) by @slin1237 in #16695
[smg][ci] rename 3rd models from cloud backend and delete dead code (#16692) by @slin1237 in #16692
[smg][ci] Migrate Response API e2e tests to shared infrastructure (#16680) by @slin1237 in #16680
[smg][ci] Add thread safety to ModelPool and GPUAllocator (#16674) by @slin1237 in #16674
Add reference counting to ModelInstance for parallel test safety (#16672) by @slin1237 in #16672
[model-gateway] Fix IGW routing for external OpenAI workers (#16633) by @zhaowenzi in #16633
refactor(e2e): unify RouterInstance into Gateway class, split conftest.py into modular fixtures (#16671) by @slin1237 in #16671
refactor(e2e_test): fix smg ci e2e test code quality (#16664) by @slin1237 in #16664
fix(e2e_test): remove dead code and fix type annotations (#16661) by @slin1237 in #16661
[smg][ci] preserve model launch order with test collected (#16618) by @slin1237 in #16618
[model-gateway] extract header extraction in policy and add (#16566) by @fzyzcjy in #16566
[smg][ci]: migrate benchmarks to e2e_test/benchmarks/, use parent conftest (#16597) by @slin1237 in #16597
[router][openai] Rename prepare_mcp_payload_for_streaming and patch_streaming_response_json (#16596) by @CatherineSue in #16596
[router][grpc] Replace Vec<(String, String, String)> with ExtractedToolCall (#16598) by @CatherineSue in #16598
[model-gateway][cleanup] Fix wrong comment in manager.rs (#16601) by @CatherineSue in #16601
refactor(e2e): keep only benchmark tests in e2e_http, remove redundant tests (#16594) by @slin1237 in #16594
refactor(e2e): remove old embedding tests migrated to e2e_test/embeddings (#16592) by @slin1237 in #16592
[model-gateway] add embedding tests (#16583) by @slin1237 in #16583
[smg] clean up logs in mcp (should be info instead warn) (#16591) by @slin1237 in #16591
[ci] fix url strips in smg ci (#16548) by @slin1237 in #16548
[grpc] Unify ResponsesContext and HarmonyResponsesContext (#16549) by @CatherineSue in #16549
[responses API] Add list_tools_for_servers and threading server_keys in routers (#16540) by @CatherineSue in #16540
[router] Remove deadcode and add note for unused API completeness methods (#16528) by @CatherineSue in #16528
[model-gateway] Add model scope support and LRU eviction for GPU-constrained environments (#16525) by @slin1237 in #16525
[model-gateway] Tighten visibility in modules and remove unused re-exports (#16524) by @CatherineSue in #16524
[model-gateway] refactor e2e test infrastructure and add router CI (#16513) by @slin1237 in #16513
[model-gateway] Tighten visibility across data_connector and grpc module (#16516) by @CatherineSue in #16516
[model-gateway] fix tokenizer encode in golang bindings (#16482) by @WeiLai5432 in #16482
[grpc] Refactor openai module (#16511) by @CatherineSue in #16511
[grpc] Refactor grpc/regular/responses (#16509) by @CatherineSue in #16509
[model-gateway][grpc] Refactor harmony/responses.rs (#16508) by @CatherineSue in #16508
Fix age bucket rendering issue (#16492) by @fzyzcjy in #16492
[model-gateway][e2e_test]: Create directory structure and backends config (#16469) by @slin1237 in #16469
[model-gateway] Optimize HTTP Router Fan-out: Replace Serial Execution with Concurrent Streams (#16042) by @ppraneth in #16042
[model-gateway] add GPU allocator and model pool infrastructure for parallel E2E tests (#16460) by @slin1237 in #16460
[model-gateway] rename py_test to e2e_test (#16454) by @slin1237 in #16454
[model-gateway]: move PD configuration conflict checks to model gateway (#16088) by @Ratish1 in #16088
Revert "[grpc] update api to scheduler in grpc request manager" (#16387) by @yhyang201 in #16387
[model-gateway] reorganize integration tests into logical subdirectories (#16451) by @slin1237 in #16451
[model-gateway] Delete Python integration_mock tests (#16448) by @slin1237 in #16448
[model-gateway] : Rust integration tests for integration_mock replacement (#16441) by @slin1237 in #16441
[model-gateway]: move unit tests to bindings/python/tests/ (#16430) by @slin1237 in #16430
[model-gateway] improve lock contention and allocation in middleware (#16405) by @slin1237 in #16405
[model-gateway] optimize Vec and HashMap allocations in responses api (#16406) by @slin1237 in #16406
Super tiny code cleanup (#16401) by @fzyzcjy in #16401
Tiny add manual policy benchmark to CI (#16392) by @fzyzcjy in #16392
[model-gateway] Clear architectual debt in responses API (#16359) by @CatherineSue in #16359
Tiny support forwarding SMG routing key to engine for dumping and avoid string allocation (#16352) by @fzyzcjy in #16352
Support in-flight request age metrics for router (#16341) by @fzyzcjy in #16341
[grpc] update api to scheduler in grpc request manager (#16350) by @CatherineSue in #16350
Tiny refactor router test contexts (#16340) by @fzyzcjy in #16340
Tiny fix non-PD router http header missing whitelist (#16339) by @fzyzcjy in #16339
fix(logging): use Display format for model_id instead of Debug (#16337) by @slin1237 in #16337
[model-gateway] bug fix on module name (#16332) by @CatherineSue in #16332
refactor(gateway): shorten logging targets from sgl_model_gateway to smg (#16328) by @CatherineSue in #16328
[model-gateway] address embedding similarity threshold in ut (#16321) by @slin1237 in #16321
[model-gateway] code clean up in tokenizer register step workflow (#16316) by @CatherineSue in #16316
refactor(core): remove get_by_model_fast alias in worker_registry (#16313) by @CatherineSue in #16313
refactor(steps): consolidate duplicate strip_protocol function (#16318) by @CatherineSue in #16318
fix(http): use 504 Gateway Timeout for upstream timeouts (#16320) by @CatherineSue in #16320
refactor(http): improve pd_types.rs documentation and style (#16319) by @CatherineSue in #16319
refactor(core): use idiomatic .min() in calculate_delay (#16314) by @CatherineSue in #16314
refactor(core): use thiserror for WorkerError (#16315) by @CatherineSue in #16315
chore(core): remove unused serde_json import in worker.rs (#16312) by @CatherineSue in #16312
[model-gateway] Add embedding correctness test comparing against HuggingFace (#16092) by @slin1237 in #16092
Support cache eviction for Manual Policy (#16263) by @fzyzcjy in #16263
perf(tree): optimize input_char_count calculation (#16238) by @slin1237 in #16238
ci(benchmark): disable sccache summary annotations and fix tree output (#16233) by @slin1237 in #16233
test(tree): add comprehensive unit tests and fix input_char_count bug (#16228) by @slin1237 in #16228
[model-gateway] cache_aware eliminate String allocations in hot path (#16209) by @slin1237 in #16209
[model-gateway]: ASCII byte comparison and probabilistic timestamp updates (#16181) by @slin1237 in #16181
[model-gateway][docs] Add Classification API documentation (#16182) by @slin1237 in #16182
[model-gateway]: optimize prefix_match with zero-copy tenant and deferred char count (#16099) by @slin1237 in #16099
[model-gateway] Fix duplicate classify prefix in response ID (#16101) by @slin1237 in #16101
[model-gateway] Generate UUID-based request IDs for embedding/classify (#16100) by @slin1237 in #16100
[model-gateway] Wire classify pipeline to gRPC router (#16098) by @slin1237 in #16098
[model-gateway] Optimize INSERT with leaf-only timestamp updates (#16097) by @slin1237 in #16097
[model-gateway] Add classify pipeline stages and protocol types (#16094) by @slin1237 in #16094
[model-gateway] Optimize radix tree timestamp updates for multi-tenant scaling (#16093) by @slin1237 in #16093
[model-gateway] Improve tree benchmark with realistic multi-tenant scenarios (#14838) by @slin1237 in #14838
[model-gateway] Add classification model support infrastructure (#16061) by @slin1237 in #16061
[model-gateway] fix tokenizer to match transformers special token handling (#16087) by @slin1237 in #16087
[model-gateway] update WorkerRegistryStats with connection mode and circuit breaker info (#16046) by @slin1237 in #16046
[model-gateway]: optimize metrics for minimal CPU and memory overhead (#16041) by @slin1237 in #16041
[model-gateway] perf: optimize observability logging for minimal CPU/memory overhead (#16039) by @slin1237 in #16039
[model-gateway][CI] Display benchmark results in GitHub Actions summary (#16037) by @slin1237 in #16037
[model-gateway] Organize CLI arguments into logical groups for better --help output (#16035) by @slin1237 in #16035
[model-gateway] Organize Rust CLI arguments into logical groups for better --help output (#16036) by @slin1237 in #16036
Tiny fix WASM test errors on machines with many cores (#15992) by @fzyzcjy in #15992
Add micro benchmarks for manual policy (#15991) by @fzyzcjy in #15991
Tiny extract PeriodicTask in router (#15988) by @fzyzcjy in #15988
Tiny add smg_manual_policy_cache_entries metric (#15987) by @fzyzcjy in #15987
[model-gateway]: remove unnecessary comment (#15947) by @Ratish1 in #15947
[model-gateway] Add PrefixHash load balancing policy for KV cache-aware routing (#15935) by @slin1237 in #15935
[model-gateway]: fix grpc embedding test (#15934) by @Ratish1 in #15934
[model-gateway] optimize radix tree memory and reduce allocations (#15933) by @slin1237 in #15933
[model-gateway] Add consistent hashing for ManualPolicy routing (#15907) by @slin1237 in #15907
[model-gateway] add JWT/OIDC authentication for control plane APIs (#15850) by @slin1237 in #15850
Revert embedding integration tests from 5f3a47d (#15914) by @slin1237 in #15914
[model-gateway]: fix crash in embedding worker health check (#15910) by @Ratish1 in #15910
[model-gateway] update ManualPolicy with header-based routing (#15847) by @slin1237 in #15847
Use X-SMG-Routing-Key header instead of json body and add tests (#15826) by @fzyzcjy in #15826
Tiny fix missing record_router_upstream_response (#15811) by @fzyzcjy in #15811
Add manual routing policy for router (#15586) by @fzyzcjy in #15586
Tiny refactor select_workers API for future passing more information (#15596) by @fzyzcjy in #15596

New Contributors

@Ratish1 made their first contribution in 886e03832
@zhaowenzi made their first contribution in b5a94f8a8
@yhyang201 made their first contribution in 10174e111
@WeiLai5432 made their first contribution in 23849eba7

Full Changelog: gateway-v0.3.0...gateway-v0.3.1

sgl-project/sglang gateway-v0.3.1 Release Gateway-v0.3.1 on GitHub