sgl-project/sglang gateway-v0.2.0 on GitHub

🚀 Release: SGLang Model Gateway v0.2.0 (formerly “SGLang Router”)

🔥 What’s new

🧠 Multi-Model Inference Gateway (IGW) Mode

IGW turns one router into many — letting you manage multiple models at once, each with its own routing policy, priorities, and metadata. Think of it as running several routers under one roof, with shared reliability, observability, and API surface.
You can dynamically register models via /workers, assign labels like tier or policy, and let the gateway handle routing, health checks, and load balancing.
Whether you’re mixing Llama, Mistral, and DeepSeek, or orchestrating per-tenant routing in enterprise setups, IGW gives you total control.
Your fleet, your rules. ⚡

⚡ gRPC Mode: Rust-Powered, Built for Throughput

This is the heart of 0.2.0. The new gRPC data plane runs entirely in Rust — tokenizer, reasoning parser, and tool parser included — giving you native-speed performance, and lower latency.
You can connect to gRPC-based SGLang workers, stream tokens in real time, and even handle OpenAI-compatible APIs like

🌐 OpenAI-Compatible Gateway

Seamlessly proxy requests to OpenAI, while keeping data control local.
Conversation history, responses, and background jobs all flow through the gateway — same API, enterprise privacy.
💾 Pluggable History Storage
Choose between memory, none, or oracle for conversation and /v1/responses data.
memory: Fastest for ephemeral runs.none: Zero persistence, zero latency overhead.oracle: Full persistence via Oracle ATP with connection pooling and credentials support.🧩 Pluggable MCP Integration
The gateway now natively speaks MCP across all transports (STDIO, HTTP, SSE, Streamable), so your tools can plug directly into reasoning and response loops — perfect for agentic workflows and cross-model orchestration.

🛡️ Reliability & Observability Upgrades

Built-in:
Retries with exponential backoff + jitterPer-worker circuit breakersToken-bucket rate limiting & FIFO queuingPrometheus metrics for latency, load, queue depth, PD pipelines, tokenizer speed, and MCP activityStructured tracing & request-ID propagation

✨ SGLang Model Gateway v0.2.0 — built in Rust, designed for scale, ready for reasoning.

What's Changed in Gateway

Gateway Changes (238 commits)

[router] upgrade to 0.2.0 (#11642) by @slin1237 in #11642
[router] add worker self discovery for metadata (#11638) by @slin1237 in #11638
[router][grpc] add warm up to grpc server (#11627) by @slin1237 in #11627
[router] update router readme to latest features (#11619) by @slin1237 in #11619
[router] add chang and keyang to sgl router author (#11620) by @slin1237 in #11620
[router] cleanup app context and move to startup (#11617) by @slin1237 in #11617
[router] add py binding and readme for openai router and history backend (#11453) by @key4ng in #11453
[router] when given both local tokenizer and chat template, log all (#11601) by @slin1237 in #11601
[router] allow router launch server to use grpc mode (#11600) by @slin1237 in #11600
[router] delete useless table content comment in spec (#11597) by @slin1237 in #11597
[router] change worker api to async instead of sync (#11566) by @slin1237 in #11566
[router] update generate spec to align with sgl io struct (#11591) by @slin1237 in #11591
[router][protocols] Add Axum validate extractor and use it for /v1/chat/completions endpoint (#11588) by @CatherineSue in #11588
[router][grpc] Add serve_grpc to launch_server and log id for HealthCheck (#11564) by @CatherineSue in #11564
[router][grpc] Add error handling to generate_tool_constraints (#11562) by @CatherineSue in #11562
[router] Add Rust CLI flags for queue size, timeout, and rate limit for token bucket rate limiter (#11483) by @Jonahcb in #11483
[router] allow user to specify chat template path (#11549) by @slin1237 in #11549
[router][grpc] Further delegate non-stream processing to processing.rs (#11553) by @CatherineSue in #11553
[router][Fix] Include grpc reflection runtime dependency (#11419) by @ai-jz in #11419
[router] allow tokenizer path to be dir (#11530) by @slin1237 in #11530
[router] openai router: support grok model (#11511) by @key4ng in #11511
Fix the GPT function calling regex to allow dash in the name (#10577) by @antoine-roux in #10577
[Router]: Small Typo in a comment within tree.rs (#11489) by @xuwenyihust in #11489
Super tiny delete unused openai router in sgl-router (#11448) by @fzyzcjy in #11448
[router][grpc] Consolidate parser checks for chat completions (#11439) by @CatherineSue in #11439
[router] leverage RAII to actively cancel request during client disconnect (#11399) by @slin1237 in #11399
[router] disable rate limiter by default (#11435) by @slin1237 in #11435
[router] Fix ci nvcc not found error (#11411) by @key4ng in #11411
move more files under srt/utils (#11285) by @merrymercy in #11285
[router] conversation item API: create, retrieve and delete (#11369) by @key4ng in #11369
[router] change grpc client from mutable to clone (#11394) by @slin1237 in #11394
[router][grpc] Replace fake health check with correct ones (#11387) by @CatherineSue in #11387
[router][grpc] Fix streaming bugs: empty tool names, state pollution, and panics (#11373) by @CatherineSue in #11373
[router][lint] Add unused_qualifications to cargo lint warnings (#11366) by @CatherineSue in #11366
[router] Refactor OpenAI router: split monolithic file and move location (#11359) by @key4ng in #11359
[router][grpc] disable health check generation and increase timeout (#11353) by @slin1237 in #11353
[router][grpc] Add dependencies in Cargo.toml to support chat template rendering (#11342) by @CatherineSue in #11342
[router] Support history management using conversation (#11339) by @key4ng in #11339
[router] Fix all unused_qualifications (#11341) by @CatherineSue in #11341
[router][grpc] Cleanup debug logs in grpc_server and grpc_router (#11340) by @CatherineSue in #11340
[router] improve reasoning parser lock and reduce req cloning (#11336) by @slin1237 in #11336
[router] refactor generate to use new pipeline arch (#11323) by @slin1237 in #11323
[router][grpc] Refactor chat handler in grpc/ to use centralized orchestrator (#11314) by @CatherineSue in #11314
[router] cleanup worker health check to return early (#11310) by @slin1237 in #11310
[router] support Openai router conversation API CRUD (#11297) by @key4ng in #11297
[router][grpc] Fix error message format in grpc chat handler (#11307) by @CatherineSue in #11307
[router][grpc] Fix sampling_params.stop_strs is None (#11306) by @CatherineSue in #11306
[router] fix grpc connection conversion and add optimization (#11305) by @slin1237 in #11305
[router][grpc] Refactor chat template content format detection (#11288) by @CatherineSue in #11288
[router] add get server info and get model info in grpc server (#11303) by @slin1237 in #11303
[router] add reasoning and tool parser argument in router (#11290) by @slin1237 in #11290
[router][grpc] Fix proto3 default value mismatches and cleanup unused fields (#11283) by @CatherineSue in #11283
[router][grpc] Refine streaming processes (#11277) by @CatherineSue in #11277
[router][tool call] Clean up redundant detect_format and has_tool_markers (#11270) by @CatherineSue in #11270
[router] add ipv6 support across all components (#11219) by @slin1237 in #11219
[router] add grpc router pd mode for chat and generate (#11140) by @slin1237 in #11140
[router] fix get load response parsing (#11213) by @slin1237 in #11213
[router] Steaming support for MCP Tool Calls in OpenAI Router (#11173) by @key4ng in #11173
[router][grpc] Support streaming for v1/chat/completions (#11179) by @CatherineSue in #11179
Remove dp balance metadata and minimul token balance. (#11170) by @hnyls2002 in #11170
[grpc] style fix for grpc compilation. (#11175) by @hnyls2002 in #11175
[proto] Add script to compile python protos (#11171) by @CatherineSue in #11171
[router][grpc] Support tool call parser in streaming (#11160) by @CatherineSue in #11160
[router] Add multi-turn tool calling loop support for MCP integration (#11143) by @key4ng in #11143
[router] add pd service in grpc router for pd (#11120) by @slin1237 in #11120
[router] add mcp list and mcp call in output array (#11112) by @key4ng in #11112
[router][bugfix] Fix input_logprobs handling with None value and logprob_start_len = -1 (#11113) by @CatherineSue in #11113
[router][grpc-server] Fix gRPC server shutdown (#11094) by @slin1237 in #11094
[router][tool call] Full support for ToolChoice (#11085) by @CatherineSue in #11085
[router] grpc router generate endpoint support (#11070) by @slin1237 in #11070
[router][grpc] Add logprobs support to router (#11082) by @CatherineSue in #11082
[router] Use get_pooled in process_single_choice (#11079) by @CatherineSue in #11079
[router][tool call] Separate JsonParser and LlamaParser (#11073) by @CatherineSue in #11073
[router] add n to generate sampling params (#11069) by @slin1237 in #11069
[router][tool call] Improve normal content extraction and error handling (non-stream) (#11050) by @CatherineSue in #11050
[router] add harmony tool parser base structure and interface (#11036) by @slin1237 in #11036
[router][tool call] Support normal content extraction before tool call (streaming) (#11038) by @CatherineSue in #11038
[router] migrate to rust python module for pythonic parser (#11033) by @slin1237 in #11033
Update GLM-4.5 Model Doc (#11017) by @zRzRzRzRzRzRzR in #11017
[router] fix chat template loading and tokenizer path (#10999) by @slin1237 in #10999
[router] basic mcp support for openai router response api (#10978) by @key4ng in #10978
[router][tool parser] Modify tool parser to return both normal text and tool calls (non-stream) (#10995) by @CatherineSue in #10995
[router][grpc] Support E2E non-stream chat completions (#10980) by @CatherineSue in #10980
[router][grpc] Add helpfer functions for decoder in router.rs and fix specs (#10971) by @CatherineSue in #10971
[router] remove old/oudated/useless comments across code base (#10968) by @slin1237 in #10968
[router] remove old/oudated/useless comments (#10967) by @slin1237 in #10967
[router] grpc router regular mode import cleanup (#10963) by @slin1237 in #10963
[router] add move grpc worker management from router to worker manager (#10960) by @slin1237 in #10960
[router] move grpc client from router to worker and builder (#10958) by @slin1237 in #10958
[router] add grpc client get and set (#10955) by @slin1237 in #10955
router: Support parallel sampling num > 1 in grpc_server and non-stream handling (#10929) by @CatherineSue in #10929
[router][refactor] Clean up protobuf fields (#10923) by @CatherineSue in #10923
[router] change log level to warning (#10926) by @slin1237 in #10926
refactor: Move grpc/client.rs to grpc_client/sglang_scheduler.rs (#10924) by @CatherineSue in #10924
router: Fix constraint proto and build_constraint in grpc router (#10881) by @CatherineSue in #10881
[router] consolidate worker load monitoring (#10894) by @slin1237 in #10894
[router] simplify tokenizer dev doc (#10895) by @slin1237 in #10895
[router] Support Oracle DB(ATP) Data Connector (#10845) by @key4ng in #10845
[router] consolidate worker get loads (#10880) by @slin1237 in #10880
[router] consolidate health endpoints and flush cache (#10876) by @slin1237 in #10876
router-grpc: Add tools processing and other paramters for apply_chat_template (#10877) by @CatherineSue in #10877
[router] select first healthy worker on proxied get requests (#10827) by @lun-4 in #10827
router-grpc: Support jinja chat template content format detection (#10832) by @CatherineSue in #10832
[router] add auth middleware for api key auth (#10826) by @CatherineSue in #10826
[router] Support streaming for Openai Router Response api (#10822) by @key4ng in #10822
[router] fix axum default body limit (#10818) by @CatherineSue in #10818
router(grpc): Implement route for chat_cmpl endpoint (#10761) by @CatherineSue in #10761
[router] use dashmap for radix tree instead of hash for multi model (#10814) by @slin1237 in #10814
[router] responses api POST and GET with local storage (#10581) by @slin1237 in #10581
[router] fix cache aware routing strategy and lock contention (#10773) by @slin1237 in #10773
[router] fix logger type mismatch (#10774) by @CatherineSue in #10774
[router] remove pd router draining channel (#10767) by @slin1237 in #10767
[router] refactor router and worker management 4/n (#10756) by @slin1237 in #10756
[router] refactor router and worker management 3/n (#10727) by @slin1237 in #10727
bugfix: Fix get_worker_urls_for_model in http/router.rs (#10754) by @CatherineSue in #10754
[2/2] Support deterministic inference for temperature > 0 (#10678) by @Qiaolin-Yu in #10678
[Router]fix: fix get_load missing api_key (#10385) by @jinmingyi1998 in #10385
[router] refactor router and worker management 2.5/n (#10677) by @slin1237 in #10677
router-spec: Reorder ChatCompletionRequest and fix validation logic (#10675) by @CatherineSue in #10675
[router] refactor router and worker management 2/n (#10666) by @slin1237 in #10666
[router] refactor router and worker management 1/n (#10664) by @slin1237 in #10664
[router] preserve order of json params using preserve_order feature (#10661) by @fgebhart in #10661
[router] refactor worker to builder pattern 5/n (#10653) by @slin1237 in #10653
[router] refactor worker to builder pattern 4/n (#10650) by @slin1237 in #10650
[router] refactor worker to builder pattern 3/n (#10647) by @slin1237 in #10647
[router] refactor worker to builder pattern 2/n (#10633) by @slin1237 in #10633
[router] refactor worker to builder pattern 1/n (#10628) by @slin1237 in #10628
adjust import setuptools_rust (#10524) by @whybeyoung in #10524
[router] fix router manager and router init in server (#10499) by @CatherineSue in #10499
[router] add router db connector for responses api (#10487) by @slin1237 in #10487
[router] fix worker registration in multi model mode (#10486) by @CatherineSue in #10486
[router] multi model registration fix (#10481) by @CatherineSue in #10481
[bugfix] fix typo (#10471) by @1195343015 in #10471
[router] minor code clean up in server startup (#10470) by @CatherineSue in #10470
[router] fix logger ordering git ctx (#10457) by @CatherineSue in #10457
[router] add dependency for router (#10401) by @ooapex in #10401
[router] fix service discovery and mcp ut (#10449) by @slin1237 in #10449
[router]: Add Embedding routing logic (#10129) by @tao12345666333 in #10129
[router] add not implemented functions for multi model trait (#10394) by @slin1237 in #10394
[router] Add get and cancel method for response api (#10387) by @key4ng in #10387
[router] allow one router to support different model families and serving mode (#10244) by @slin1237 in #10244
[router] enable sccache in ci and local build (#10099) by @slin1237 in #10099
[router] Add Rerank Routing Logic in Regular Router (#10219) by @fangjian601 in #10219
Implement Standalone gRPC Server for SGLang Python Scheduler (#10283) by @CatherineSue in #10283
[router] Basic OAI Response api (#10346) by @key4ng in #10346
[router][ci] Add gpu utilization analyze with nvml (#10345) by @key4ng in #10345
[router][ci] add gpu process check and free port before start server (#10338) by @key4ng in #10338
[router] Add OpenAI backend support - core function (#10254) by @key4ng in #10254
[router] add benchmark for regular router and pd router (#10280) by @key4ng in #10280
[router] Add PD router mmlu test (#10256) by @key4ng in #10256
[router] Improve the router e2e tests (#10102) by @key4ng in #10102
[router] Introduce router integration tests (#10086) by @key4ng in #10086
[router] move to mcp sdk instead (#10057) by @slin1237 in #10057
[router] add rust cache in benchmark ci (#10080) by @slin1237 in #10080
[router] add rust cache for rust unit test (#10079) by @key4ng in #10079
[router] add py binding unit tests to coverage 80% (#10043) by @key4ng in #10043
Simplify Router arguments passing and build it in docker image (#9964) by @hnyls2002 in #9964
[router] fix grpc connection mode detection (#9999) by @slin1237 in #9999
[router] clean up dependency injector to use ctx (#10000) by @slin1237 in #10000
[router] move tokenizer, reasoning, tool initialization to server (#9996) by @slin1237 in #9996
[router] add chat_template_kwargs in ChatCompletionRequest (#9958) by @tonyluj in #9958
[router] Add Rerank API Specification (#9906) by @fangjian601 in #9906
Grpc client (#9939) by @CatherineSue in #9939
[router] include rust benchamrks (#9932) by @slin1237 in #9932
[router] fix FunctionCallResponse proto, support arguments is null (#9875) by @Bruce-x-1997 in #9875
[router] add grpc pd and regular router init (#9893) by @CatherineSue in #9893
[router] Fix short timeout for the prefill client (#9803) by @LukasBluebaum in #9803
[router] add tokenizer download support from hf hub (#9882) by @CatherineSue in #9882
[router] global tool parser registry (#9840) by @CatherineSue in #9840
Tool parser.benchmark (#9835) by @CatherineSue in #9835
[router] add reasoning parser readme (#9837) by @slin1237 in #9837
[router] grpc router bootstraps (#9759) by @slin1237 in #9759
[router] add llama3.2 multi json streaming parser (#9735) by @slin1237 in #9735
[router] additional llama32 parser unit test and multi json support (#9732) by @slin1237 in #9732
[router] additional pythonic parser unit test (#9730) by @slin1237 in #9730
[router] Add MCP Tool Handler (#9615) by @key4ng in #9615
[router] fix error response in pd_router (#9505) by @Bruce-x-1997 in #9505
[router] add gpt-oss and glm4 tool parser (#9703) by @slin1237 in #9703
[router] add kimi-k2 tool parser (#9702) by @slin1237 in #9702
[router] add step3 tool parser (#9695) by @slin1237 in #9695
[router] add deepseek tool parser (#9694) by @slin1237 in #9694
[router] restructure tool parser module folder (#9693) by @slin1237 in #9693
[router] add token bucket rate limiter (#9656) by @CatherineSue in #9656
[router] address worker load tracking consistency (#9523) by @slin1237 in #9523
Fix lint for router (#9636) by @hebiao064 in #9636
[router] add ut for mistral, llama, pythonic, and streaming tool parser (#9632) by @slin1237 in #9632
[router] add llama tool parser (#9629) by @slin1237 in #9629
[router] add pythonic parser (#9628) by @slin1237 in #9628
[router] add qwen tool parser (#9623) by @slin1237 in #9623
[router] add mistral tool parser (#9622) by @slin1237 in #9622
[Doc] add LWS(LeaderWorkerSet) use case in sgl-router README (#9568) by @Bruce-x-1997 in #9568
[router] add right rustls dependency in sgl-router cargo.toml (#9498) by @Bruce-x-1997 in #9498
[router] ignore client error when record failure in pd_router (#9503) by @Bruce-x-1997 in #9503
[router] Move all protocols to spec.rs file (#9519) by @key4ng in #9519
[router] add json tool parser (#9516) by @slin1237 in #9516
[router] tokenizer arch doc (#9513) by @slin1237 in #9513
[router] fix router load guard tracking for streaming (#9491) by @slin1237 in #9491
[router] add tool parser base structure and partial json parser (#9482) by @CatherineSue in #9482
[router] remove all tokenizer metrics for performance (#9474) by @CatherineSue in #9474
[router] add tokenizer benchmark (#9427) by @slin1237 in #9427
[router] add glm and step3 reasoning parser (#9415) by @CatherineSue in #9415
[router] add tokenizer integration test with real mini tokenizer (#9413) by @CatherineSue in #9413
[router] Add IGW (Inference Gateway) Feature Flag (#9371) by @key4ng in #9371
[router] Implement OpenAI Responses API specification (#9367) by @key4ng in #9367
[router] add tokenizer chat template support (#9370) by @slin1237 in #9370
[router] Implement gRPC SGLangSchedulerClient (#9364) by @CatherineSue in #9364
[router] adds reasoning parser pooling and thread-safe (#9360) by @slin1237 in #9360
[Router] Add validation module for API parameters (#9335) by @key4ng in #9335
[router] add tiktokenizer and sequence in router (#9354) by @slin1237 in #9354
[router] add dsr1, kimi, and qwen reasoning parser (#9353) by @slin1237 in #9353
[router]restructure protocol modules for better organization (#9321) by @key4ng in #9321
[router] Add spec for sglang scheduler (#9322) by @CatherineSue in #9322
[router] add reasoning parser base structure (#9310) by @slin1237 in #9310
[router] add tokenizer metrics (#9307) by @slin1237 in #9307
[router] tokenizer factory, hf tokenizer, and stop sequence detector (#9293) by @slin1237 in #9293
[router] introducing tokenizer trait (#9287) by @slin1237 in #9287
[router] introduce prefill response draining for http compliance (#9281) by @slin1237 in #9281
[router] add cargo clippy in CI and fix-up linting errors (#9242) by @jeffdn in #9242
[router] fix pd prefill http request complinace issue (#9237) by @slin1237 in #9237
[router] preserve original worker response header in router (#9236) by @slin1237 in #9236
[router] clean up lint warnings with clippy execution (#9201) by @jeffdn in #9201
[router] allow more health check configuration (#9198) by @slin1237 in #9198
[router] optimize Rust compilation and development workflow (#9133) by @slin1237 in #9133
router: Fix user guide link README.md (#9122) by @CatherineSue in #9122
[CI] migrate router to BM.A10.4 runner (#8992) by @key4ng in #8992
[router] Add Rust Binary Entrypoint for SGLang Router (#9089) by @slin1237 in #9089
refactor(pd-router): extract common patterns to reduce code duplication (#9081) by @slin1237 in #9081
[pd-router] add retry and circuit breakfor for pd router (#9051) by @slin1237 in #9051
[router] regular router circuit breaker (#8997) by @slin1237 in #8997
[router] update pyo3 version to 0.25.1 (#9022) by @slin1237 in #9022
[router] upgrade kube version to latest (#9018) by @slin1237 in #9018
[router] upgrade rand to latest version (#9017) by @slin1237 in #9017
[router] fix radix tree integration issues in PD router (#8982) by @slin1237 in #8982
[router] reduce contention, fix double-count race (#8978) by @slin1237 in #8978
[router] add metrics for worker and policy (#8971) by @tonyluj in #8971
[router] harden retries + metrics; fix streaming load; header filtering (#8972) by @slin1237 in #8972
[router] router circuit breaker core (#8941) by @slin1237 in #8941
[router] dedicated prefill HTTP client and request-path optimizations (#8923) by @slin1237 in #8923

New Contributors

@CatherineSue made their first contribution in ad359d1c7
@key4ng made their first contribution in 4093d460c
@Bruce-x-1997 made their first contribution in 446c8e4cd
@tonyluj made their first contribution in 36bfddecb
@jeffdn made their first contribution in d7e38b2f6
@fangjian601 made their first contribution in 788b19a53
@zRzRzRzRzRzRzR made their first contribution in abb678157
@whybeyoung made their first contribution in 0abb41c70
@xuwenyihust made their first contribution in 9b5efe346
@hebiao064 made their first contribution in cbc0e4d77
@Qiaolin-Yu made their first contribution in e2ac7888b
@ooapex made their first contribution in 957482c8f
@lun-4 made their first contribution in c3faf2d6e
@LukasBluebaum made their first contribution in 9d9fa9a53
@Jonahcb made their first contribution in f4aa78801
@tao12345666333 made their first contribution in f9ee6ae17
@jinmingyi1998 made their first contribution in 56321e9fc
@1195343015 made their first contribution in 57234d0c9
@fzyzcjy made their first contribution in d957177a2
@fgebhart made their first contribution in 68cdc1893
@antoine-roux made their first contribution in ec1cd90ac
@ai-jz made their first contribution in 9cc1e065f

Paths Included

sgl-router
python/sglang/srt/grpc
python/sglang/srt/entrypoints/grpc_server.py

Full Changelog: gateway-v0.1.9...gateway-v0.2.0

sgl-project/sglang gateway-v0.2.0 Release Gateway-v0.2.0 on GitHub