What's Changed
🚀 Features
- [Ascend] support qwen3.5 35BA3B by @wanfengcxz in #4485
- feat: Add TurboQuant (quant_policy=42) support for KV Cache Quantization by @windreamer in #4510
- [refactor] [api_server] [2/N] improve tool parsers by abstracting xml parser by @lvhan028 in #4548
- feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations by @hd9568 in #4490
- feat: add Anthropic-compatible serving endpoints by @lvhan028 in #4538
- Support InternS2 Preview by @CUHKSZzxy in #4575
💥 Improvements
- lmdeploy support kernel block size by @Tsundoku958 in #4421
- Reject requests on stale session or sleeping engine by @lvhan028 in #4496
- Add modern logging utils by @lzhangzz in #4486
- refine dlinfer update_weights by @yao-fengchen in #4519
- feat(serve): expose repetition n-gram params on OpenAI routes by @lvhan028 in #4522
- Refactor step inputs by @grimoire in #4504
- fix lite module for transformers>=5.0 by @43758726 in #4488
- [refactor] [api_server] [1/N] Improve reasoning and tool-call parsers by @lvhan028 in #4468
- fix: prevent prefill starvation under high decode load by @grimoire in #4532
- Mixed modality by @CUHKSZzxy in #4531
- optimize get_sorted_idx in moe by @grimoire in #4529
- Map user-input session_id to internal session_id to maintain session identity by @lvhan028 in #4523
- support more message item types by @CUHKSZzxy in #4501
- add explicit trust_remote_code controls to resolve the security issue by @lvhan028 in #4511
🐞 Bug fixes
- [ascend] fix prefix caching by @yao-fengchen in #4448
- fix update params by @CUHKSZzxy in #4514
- fix ray mem leak by @grimoire in #4487
- Fix mtp by @RunningLeon in #4517
- fix kernel-block-size by @grimoire in #4521
- fix: use
is not Nonecheck for seed to prevent seed=0 being silently ignored by @kuishou68 in #4526 - Fix qwen35 dp by @grimoire in #4535
- Fix mtp for rl by @RunningLeon in #4520
- cancel request and block new inputs when sleeping by @grimoire in #4541
- Fix mp engine by @RunningLeon in #4540
- Fix cache sizing and cache block layout edge cases by @grimoire in #4552
- Fix qwen3.5-moe mtp with tp>1 by @RunningLeon in #4568
- block_offsets padding 0 by @grimoire in #4569
- hotfix: resolve test issues for v0.13.0 by @lvhan028 in #4571
- ResponseParser forget to strip tag in non-stream mode by @lvhan028 in #4576
- yield error when prompt processing suffers exception by @lvhan028 in #4574
- Fix the reprefill of evicted seqs with invalid draft tokens by @RunningLeon in #4564
- Support mtp fp8 by @RunningLeon in #4572
🌐 Other
- Use env LMDEPLOY_FP32_MAMBA_SSM_DTYPE to control the dtype of recurrent state by @lvhan028 in #4518
- add tool and reasoning test by @littlegy in #4388
- update h config and add glm4.7 mtp test by @littlegy in #4424
- [ci] change test whl into python 312 and use test images by @zhulinJulia24 in #4513
- [Misc] fix typos in turbomind.py and model.py by @ZhijunLStudio in #4543
- [Misc] fix mutable default arguments by @ZhijunLStudio in #4544
- Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. by @lvhan028 in #4546
- remove barely used skills and checkin docker-build skill by @lvhan028 in #4560
- bump version to v0.13.0 by @lvhan028 in #4549
New Contributors
- @kuishou68 made their first contribution in #4526
- @ZhijunLStudio made their first contribution in #4543
- @hd9568 made their first contribution in #4490
Full Changelog: v0.12.3...v0.13.0