🚀 SMG v0.3.1 Released!
We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security!
🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡
Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains:
Performance Improvements
- Our cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900), with latency dropping from 52.9 microseconds to just 4.6 microseconds per operation.
- For prefix matching across 10,000 tree entries, throughput jumped from 41,000 to 124,000 operations per second.
- Under concurrent load with 64 threads, the system processes 474,000 operations per second – a 7.9x improvement over the previous 59,000 ops/sec.
Data processing
- INSERT operations now process 440 MB/s (up from 38 MB/s),
- MATCH operations handle 253 MB/s (up from 83 MB/s).
Memory Improvements:
- ~99% memory reduction per tree node:
- Before: ~180 KB per node (DashMap default config on 170-core machines)
- After: ~1.4 KB per node
Result: Deploy 100x more cache entries in the same memory footprint!
For a typical deployment with 10,000 cached prefixes, memory usage drops from ~1.8 GB to just ~14 MB – freeing up resources for actual inference workloads.
Impact: Cache-aware routing is now 10-12x faster and uses 99% less memory. This is critical for large-scale multi-tenant deployments.
🔐 JWT/OIDC Authentication
Production-grade security for control plane APIs with native support for industry-standard OIDC providers: Google, Azure, Oracle, GitHub, and more. Protect tokenizer management, worker registration, and admin endpoints with enterprise authentication infrastructure you already use. Critical for enterprise deployments – seamlessly integrate SMG into your existing identity and access management systems.
📊 Classification API Support
Native support for classification workloads! Deploy and serve classification models alongside your existing inference fleet with dedicated pipeline stages and protocol types.
✨ Additional Features
- PrefixHash Load Balancing: New KV cache-aware load balancing policy using prefix hashing for improved cache hit rates in multi-tenant environments.
- Nemotron Nano V3 Parser
- In-Flight Request Age Metrics: Track request age in-flight for better observability and SLA monitoring.
🛠️ Enhancements
Developer Experience:
- Organized CLI arguments into logical groups
- Shortened logging targets (sgl_model_gateway → smg)
- Comprehensive embedding correctness tests against HuggingFace
- Auto-generate protobuf files during wheel build
Reliability:
- Fix IGW routing for external OpenAI workers
- Work around orphan process problems
- Prevent potential hangs in subprocess handling
- Use 504 Gateway Timeout for upstream timeouts (proper HTTP semantics)
🐛 Bug Fixes
- Fixed embedding worker health check crash
- Fixed tokenizer to match transformers special token handling
- Fixed age bucket rendering issue
- Fixed non-PD router HTTP header whitelist
- Fixed duplicate classify prefix in response ID
- Fixed WASM test errors on machines with many cores
⚡ Built for speed. Engineered for scale. Production-proven.
Gateway Changes (120 commits)
- [model-gateway] release 0.3.1 (#16254) by @slin1237 in #16254
- [smg] cleanup router RAII guards (#16560) by @fzyzcjy in #16560
- [smg] update gRPC proto to match upstream changes (#16764) by @slin1237 in #16764
- [smg] Add Nemotron Nano V3 reasoning parser support (#16763) by @slin1237 in #16763
- [smg] Work around sglang's notorious orphan process problem (#16756) by @slin1237 in #16756
- fix(e2e): prevent potential hangs in model pool subprocess handling (#16752) by @slin1237 in #16752
- [smg][ci] delete old chat completion integration tests and workflow step (#16751) by @slin1237 in #16751
- [smg][ci] migrate function calling tests to new infrastructure (#16748) by @slin1237 in #16748
- [smg][ci] migrate validation tests to new infrastructure (#16746) by @slin1237 in #16746
- [grpc] Auto-generate protobuf files during wheel build (#16409) by @CatherineSue in #16409
- [smg][ci] fix model pool GPU cleanup and add startup reliability improvements (#16745) by @slin1237 in #16745
- [smg][ci] migrate reasoning_content tests to new infrastructure (#16741) by @slin1237 in #16741
- [smg][ci] migrate enable_thinking tests to new infrastructure (#16739) by @slin1237 in #16739
- Remove migrated e2e_grpc/basic tests (#16738) by @slin1237 in #16738
- [smg][ci] migrate chat completions tests to new infrastructure and build wheel once and share via artifact (#16709) by @slin1237 in #16709
- [smg][ci] delete old responses api ci (#16695) by @slin1237 in #16695
- [smg][ci] rename 3rd models from cloud backend and delete dead code (#16692) by @slin1237 in #16692
- [smg][ci] Migrate Response API e2e tests to shared infrastructure (#16680) by @slin1237 in #16680
- [smg][ci] Add thread safety to ModelPool and GPUAllocator (#16674) by @slin1237 in #16674
- Add reference counting to ModelInstance for parallel test safety (#16672) by @slin1237 in #16672
- [model-gateway] Fix IGW routing for external OpenAI workers (#16633) by @zhaowenzi in #16633
- refactor(e2e): unify RouterInstance into Gateway class, split conftest.py into modular fixtures (#16671) by @slin1237 in #16671
- refactor(e2e_test): fix smg ci e2e test code quality (#16664) by @slin1237 in #16664
- fix(e2e_test): remove dead code and fix type annotations (#16661) by @slin1237 in #16661
- [smg][ci] preserve model launch order with test collected (#16618) by @slin1237 in #16618
- [model-gateway] extract header extraction in policy and add (#16566) by @fzyzcjy in #16566
- [smg][ci]: migrate benchmarks to e2e_test/benchmarks/, use parent conftest (#16597) by @slin1237 in #16597
- [router][openai] Rename
prepare_mcp_payload_for_streamingandpatch_streaming_response_json(#16596) by @CatherineSue in #16596 - [router][grpc] Replace
Vec<(String, String, String)>withExtractedToolCall(#16598) by @CatherineSue in #16598 - [model-gateway][cleanup] Fix wrong comment in manager.rs (#16601) by @CatherineSue in #16601
- refactor(e2e): keep only benchmark tests in e2e_http, remove redundant tests (#16594) by @slin1237 in #16594
- refactor(e2e): remove old embedding tests migrated to e2e_test/embeddings (#16592) by @slin1237 in #16592
- [model-gateway] add embedding tests (#16583) by @slin1237 in #16583
- [smg] clean up logs in mcp (should be info instead warn) (#16591) by @slin1237 in #16591
- [ci] fix url strips in smg ci (#16548) by @slin1237 in #16548
- [grpc] Unify ResponsesContext and HarmonyResponsesContext (#16549) by @CatherineSue in #16549
- [responses API] Add list_tools_for_servers and threading server_keys in routers (#16540) by @CatherineSue in #16540
- [router] Remove deadcode and add note for unused API completeness methods (#16528) by @CatherineSue in #16528
- [model-gateway] Add model scope support and LRU eviction for GPU-constrained environments (#16525) by @slin1237 in #16525
- [model-gateway] Tighten visibility in modules and remove unused re-exports (#16524) by @CatherineSue in #16524
- [model-gateway] refactor e2e test infrastructure and add router CI (#16513) by @slin1237 in #16513
- [model-gateway] Tighten visibility across
data_connectorandgrpcmodule (#16516) by @CatherineSue in #16516 - [model-gateway] fix tokenizer encode in golang bindings (#16482) by @WeiLai5432 in #16482
- [grpc] Refactor openai module (#16511) by @CatherineSue in #16511
- [grpc] Refactor grpc/regular/responses (#16509) by @CatherineSue in #16509
- [model-gateway][grpc] Refactor harmony/responses.rs (#16508) by @CatherineSue in #16508
- Fix age bucket rendering issue (#16492) by @fzyzcjy in #16492
- [model-gateway][e2e_test]: Create directory structure and backends config (#16469) by @slin1237 in #16469
- [model-gateway] Optimize HTTP Router Fan-out: Replace Serial Execution with Concurrent Streams (#16042) by @ppraneth in #16042
- [model-gateway] add GPU allocator and model pool infrastructure for parallel E2E tests (#16460) by @slin1237 in #16460
- [model-gateway] rename py_test to e2e_test (#16454) by @slin1237 in #16454
- [model-gateway]: move PD configuration conflict checks to model gateway (#16088) by @Ratish1 in #16088
- Revert "[grpc] update api to scheduler in grpc request manager" (#16387) by @yhyang201 in #16387
- [model-gateway] reorganize integration tests into logical subdirectories (#16451) by @slin1237 in #16451
- [model-gateway] Delete Python integration_mock tests (#16448) by @slin1237 in #16448
- [model-gateway] : Rust integration tests for integration_mock replacement (#16441) by @slin1237 in #16441
- [model-gateway]: move unit tests to bindings/python/tests/ (#16430) by @slin1237 in #16430
- [model-gateway] improve lock contention and allocation in middleware (#16405) by @slin1237 in #16405
- [model-gateway] optimize Vec and HashMap allocations in responses api (#16406) by @slin1237 in #16406
- Super tiny code cleanup (#16401) by @fzyzcjy in #16401
- Tiny add manual policy benchmark to CI (#16392) by @fzyzcjy in #16392
- [model-gateway] Clear architectual debt in responses API (#16359) by @CatherineSue in #16359
- Tiny support forwarding SMG routing key to engine for dumping and avoid string allocation (#16352) by @fzyzcjy in #16352
- Support in-flight request age metrics for router (#16341) by @fzyzcjy in #16341
- [grpc] update api to scheduler in grpc request manager (#16350) by @CatherineSue in #16350
- Tiny refactor router test contexts (#16340) by @fzyzcjy in #16340
- Tiny fix non-PD router http header missing whitelist (#16339) by @fzyzcjy in #16339
- fix(logging): use Display format for model_id instead of Debug (#16337) by @slin1237 in #16337
- [model-gateway] bug fix on module name (#16332) by @CatherineSue in #16332
- refactor(gateway): shorten logging targets from sgl_model_gateway to smg (#16328) by @CatherineSue in #16328
- [model-gateway] address embedding similarity threshold in ut (#16321) by @slin1237 in #16321
- [model-gateway] code clean up in tokenizer register step workflow (#16316) by @CatherineSue in #16316
- refactor(core): remove get_by_model_fast alias in worker_registry (#16313) by @CatherineSue in #16313
- refactor(steps): consolidate duplicate strip_protocol function (#16318) by @CatherineSue in #16318
- fix(http): use 504 Gateway Timeout for upstream timeouts (#16320) by @CatherineSue in #16320
- refactor(http): improve pd_types.rs documentation and style (#16319) by @CatherineSue in #16319
- refactor(core): use idiomatic .min() in calculate_delay (#16314) by @CatherineSue in #16314
- refactor(core): use thiserror for WorkerError (#16315) by @CatherineSue in #16315
- chore(core): remove unused serde_json import in worker.rs (#16312) by @CatherineSue in #16312
- [model-gateway] Add embedding correctness test comparing against HuggingFace (#16092) by @slin1237 in #16092
- Support cache eviction for Manual Policy (#16263) by @fzyzcjy in #16263
- perf(tree): optimize input_char_count calculation (#16238) by @slin1237 in #16238
- ci(benchmark): disable sccache summary annotations and fix tree output (#16233) by @slin1237 in #16233
- test(tree): add comprehensive unit tests and fix input_char_count bug (#16228) by @slin1237 in #16228
- [model-gateway] cache_aware eliminate String allocations in hot path (#16209) by @slin1237 in #16209
- [model-gateway]: ASCII byte comparison and probabilistic timestamp updates (#16181) by @slin1237 in #16181
- [model-gateway][docs] Add Classification API documentation (#16182) by @slin1237 in #16182
- [model-gateway]: optimize prefix_match with zero-copy tenant and deferred char count (#16099) by @slin1237 in #16099
- [model-gateway] Fix duplicate classify prefix in response ID (#16101) by @slin1237 in #16101
- [model-gateway] Generate UUID-based request IDs for embedding/classify (#16100) by @slin1237 in #16100
- [model-gateway] Wire classify pipeline to gRPC router (#16098) by @slin1237 in #16098
- [model-gateway] Optimize INSERT with leaf-only timestamp updates (#16097) by @slin1237 in #16097
- [model-gateway] Add classify pipeline stages and protocol types (#16094) by @slin1237 in #16094
- [model-gateway] Optimize radix tree timestamp updates for multi-tenant scaling (#16093) by @slin1237 in #16093
- [model-gateway] Improve tree benchmark with realistic multi-tenant scenarios (#14838) by @slin1237 in #14838
- [model-gateway] Add classification model support infrastructure (#16061) by @slin1237 in #16061
- [model-gateway] fix tokenizer to match transformers special token handling (#16087) by @slin1237 in #16087
- [model-gateway] update WorkerRegistryStats with connection mode and circuit breaker info (#16046) by @slin1237 in #16046
- [model-gateway]: optimize metrics for minimal CPU and memory overhead (#16041) by @slin1237 in #16041
- [model-gateway] perf: optimize observability logging for minimal CPU/memory overhead (#16039) by @slin1237 in #16039
- [model-gateway][CI] Display benchmark results in GitHub Actions summary (#16037) by @slin1237 in #16037
- [model-gateway] Organize CLI arguments into logical groups for better --help output (#16035) by @slin1237 in #16035
- [model-gateway] Organize Rust CLI arguments into logical groups for better --help output (#16036) by @slin1237 in #16036
- Tiny fix WASM test errors on machines with many cores (#15992) by @fzyzcjy in #15992
- Add micro benchmarks for manual policy (#15991) by @fzyzcjy in #15991
- Tiny extract PeriodicTask in router (#15988) by @fzyzcjy in #15988
- Tiny add smg_manual_policy_cache_entries metric (#15987) by @fzyzcjy in #15987
- [model-gateway]: remove unnecessary comment (#15947) by @Ratish1 in #15947
- [model-gateway] Add PrefixHash load balancing policy for KV cache-aware routing (#15935) by @slin1237 in #15935
- [model-gateway]: fix grpc embedding test (#15934) by @Ratish1 in #15934
- [model-gateway] optimize radix tree memory and reduce allocations (#15933) by @slin1237 in #15933
- [model-gateway] Add consistent hashing for ManualPolicy routing (#15907) by @slin1237 in #15907
- [model-gateway] add JWT/OIDC authentication for control plane APIs (#15850) by @slin1237 in #15850
- Revert embedding integration tests from 5f3a47d (#15914) by @slin1237 in #15914
- [model-gateway]: fix crash in embedding worker health check (#15910) by @Ratish1 in #15910
- [model-gateway] update ManualPolicy with header-based routing (#15847) by @slin1237 in #15847
- Use X-SMG-Routing-Key header instead of json body and add tests (#15826) by @fzyzcjy in #15826
- Tiny fix missing record_router_upstream_response (#15811) by @fzyzcjy in #15811
- Add manual routing policy for router (#15586) by @fzyzcjy in #15586
- Tiny refactor select_workers API for future passing more information (#15596) by @fzyzcjy in #15596
New Contributors
- @Ratish1 made their first contribution in 886e03832
- @zhaowenzi made their first contribution in b5a94f8a8
- @yhyang201 made their first contribution in 10174e111
- @WeiLai5432 made their first contribution in 23849eba7
Full Changelog: gateway-v0.3.0...gateway-v0.3.1