github sgl-project/sglang gateway-v0.3.1
Release Gateway-v0.3.1

2 days ago

🚀 SMG v0.3.1 Released!

We're excited to announce SMG v0.3.1 – a game-changing release with 10-12x performance improvement and 99% memory reduction in cache-aware routing, plus enterprise-grade security!

🌲 Radix Tree / Cache-Aware Routing: 10-12x Faster + 99% Less Memory ⚡

Complete optimization overhaul of our cache-aware routing engine with stunning performance and memory gains:

Performance Improvements

  • Our cache-aware routing can now handle over 216,000 cache insertions per second (up from 18,900), with latency dropping from 52.9 microseconds to just 4.6 microseconds per operation.
  • For prefix matching across 10,000 tree entries, throughput jumped from 41,000 to 124,000 operations per second.
  • Under concurrent load with 64 threads, the system processes 474,000 operations per second – a 7.9x improvement over the previous 59,000 ops/sec.

Data processing

  • INSERT operations now process 440 MB/s (up from 38 MB/s),
  • MATCH operations handle 253 MB/s (up from 83 MB/s).

Memory Improvements:

  • ~99% memory reduction per tree node:
  • Before: ~180 KB per node (DashMap default config on 170-core machines)
  • After: ~1.4 KB per node
    Result: Deploy 100x more cache entries in the same memory footprint!
    For a typical deployment with 10,000 cached prefixes, memory usage drops from ~1.8 GB to just ~14 MB – freeing up resources for actual inference workloads.
    Impact: Cache-aware routing is now 10-12x faster and uses 99% less memory. This is critical for large-scale multi-tenant deployments.

🔐 JWT/OIDC Authentication

Production-grade security for control plane APIs with native support for industry-standard OIDC providers: Google, Azure, Oracle, GitHub, and more. Protect tokenizer management, worker registration, and admin endpoints with enterprise authentication infrastructure you already use. Critical for enterprise deployments – seamlessly integrate SMG into your existing identity and access management systems.

📊 Classification API Support

Native support for classification workloads! Deploy and serve classification models alongside your existing inference fleet with dedicated pipeline stages and protocol types.

✨ Additional Features

  • PrefixHash Load Balancing: New KV cache-aware load balancing policy using prefix hashing for improved cache hit rates in multi-tenant environments.
  • Nemotron Nano V3 Parser
  • In-Flight Request Age Metrics: Track request age in-flight for better observability and SLA monitoring.

🛠️ Enhancements

Developer Experience:

  • Organized CLI arguments into logical groups
  • Shortened logging targets (sgl_model_gateway → smg)
  • Comprehensive embedding correctness tests against HuggingFace
  • Auto-generate protobuf files during wheel build

Reliability:

  • Fix IGW routing for external OpenAI workers
  • Work around orphan process problems
  • Prevent potential hangs in subprocess handling
  • Use 504 Gateway Timeout for upstream timeouts (proper HTTP semantics)

🐛 Bug Fixes

  • Fixed embedding worker health check crash
  • Fixed tokenizer to match transformers special token handling
  • Fixed age bucket rendering issue
  • Fixed non-PD router HTTP header whitelist
  • Fixed duplicate classify prefix in response ID
  • Fixed WASM test errors on machines with many cores

⚡ Built for speed. Engineered for scale. Production-proven.

Gateway Changes (120 commits)

New Contributors

Full Changelog: gateway-v0.3.0...gateway-v0.3.1

Don't miss a new sglang release

NewReleases is sending notifications on new releases.