What's Changed
🚀 Features
- FP8 kv cache quantization by @CUHKSZzxy in #4563
💥 Improvements
- Update turbomind modeling infrastructure by @lzhangzz in #4557
- refactor(turbomind): consolidate CUDA error handling and add manual stacktracing by @lzhangzz in #4565
- Add Qwen3.5 Moe lite awq by @43758726 in #4561
- [Improve]: Drain queues when sleep engine by @RunningLeon in #4577
- Extend chat completions by introducing token-in/out and returning routed experts by @lvhan028 in #4593
- Follow openai's spec to add "AllowedToolChoice" and report 400 when parsing request failed by @lvhan028 in #4585
- Improve health endpoint by @lvhan028 in #4615
- Remove state init by @grimoire in #4604
- Include spec stats in metrics by @RunningLeon in #4625
🐞 Bug fixes
- fix the anthropic adapter by @lvhan028 in #4578
- Fix Structured Output for GPT-OSS Models by @windreamer in #4386
- Allow W8A8Linear to accept dtype during initialization instead of hard code by @43758726 in #4586
- fix: compact split multimodal tensors by @CUHKSZzxy in #4583
- Fix legacy VLM preprocessors for normalized image data by @CUHKSZzxy in #4584
- fix dockerfile which missing common.txt by @lvhan028 in #4608
- fix: enable FA3 for SM80+ GPUs and fix CUDA version comparison by @windreamer in #4591
- flatten_kv_cache zero padding by @grimoire in #4613
- align streaming usage chunks with OpenAI spec by @lvhan028 in #4616
- fix(vl): reduce multimodal feature memory use by @CUHKSZzxy in #4603
- fix memleak when input contain large image data by @grimoire in #4610
- fix(turbomind): map Intern-S1 HF checkpoint keys by @lvhan028 in #4617
- fix(serve): emit all stream_chunk deltas to fix concurrent tool-call streaming by @lvhan028 in #4622
- fix cp inference by @irexyc in #4619
- refactor(serve): avoid per-request tokenizer work in parsers by @lvhan028 in #4633
- Bring MixtralForCausalLM back to Turbomind by @43758726 in #4623
- fix model loading on windows by @irexyc in #4626
🌐 Other
- chore: gate request logs behind request level by @CUHKSZzxy in #4581
- miss rdkit for intern-s models by @lvhan028 in #4587
- extract common deps into requirements/common.txt by @lvhan028 in #4595
- Remove staled cli arg in vlmevalkit docs by @CUHKSZzxy in #4598
- log reponse for debugging by @lvhan028 in #4592
- cancel in-progress runs when PR is updated or merged by @lvhan028 in #4609
- TEST: update qwen3.5 397b test by @littlegy in #4607
- TEST: update video test by @littlegy in #4606
Full Changelog: v0.13.0...0.14.0a1