What's Changed
🚀 Features
- support glm5 by @grimoire in #4355
- Qwen/Internlm/Llama Dense/Moe model fp8 quant online by @43758726 in #4324
- Qwen3.5 by @grimoire in #4351
- GLM-4.7-Flash Turbomind support by @lapy in #4362
- Support router replay and ignore quant layer for qwen3.5 by @RunningLeon in #4394
- [Feature] Add TurboMind support for Qwen3.5 models (dense + MoE) by @lapy in #4389
- support repetition ngram logits processor by @grimoire in #4288
💥 Improvements
- Compatible with transformers 5.0 at TurboMind side by @lvhan028 in #4304
- Support fp32 head for qwen and internlm models by @RunningLeon in #4160
- Reduce MLA kv-cache memory by @lzhangzz in #4373
- add recurrent_gated_delta_rule kernel by @grimoire in #4376
- [ascend]adapt for s1-pro dp*tp+ep by @yao-fengchen in #4380
- Support glm4.7 with mtp by @RunningLeon in #4346
- Faster MLA kernels by @lzhangzz in #4391
- Attention kernel self-registration and decoupled dispatching by @lzhangzz in #4396
🐞 Bug fixes
- fix: change debug log from ERROR to DEBUG in RepetitionPenaltyKernel by @murray-macdonald in #4363
- Fix quant config parsing for internvl awq model by @RunningLeon in #4369
- Fix XGrammar bitmask initialization and add null check for gen_config in generate method by @windreamer in #4349
- fix the logic of closing session by @lvhan028 in #4370
- Fix authorization by @lvhan028 in #4338
- Fix some minor issues and provide tests for Pipeline by @windreamer in #4365
- fix dllm mask on set_step by @grimoire in #4278
- fix models for transformers>=5 by @grimoire in #4381
- fix exception when aborting a request by @lvhan028 in #4403
- fix inference crashed on v100 with qwen3.5-0.8b by @lvhan028 in #4420
🌐 Other
- ci(lint): skip flaky deadlink test for python wiki page by @windreamer in #4357
- fix fa3 install by @irexyc in #4361
- fix lint by @windreamer in #4375
- upgrade triton and torch by @grimoire in #4379
- Add speculative decoding test by @littlegy in #4377
- ci: integrate clang-format lint into pre-commit hooks by @windreamer in #4390
- Update dockerfile by removing cu11 and changing cu12.4 to cu12.6 by @lvhan028 in #4398
- manually build dev image instead of publishing it every version by @lvhan028 in #4409
- bump version to v0.12.2 by @lvhan028 in #4378
New Contributors
- @murray-macdonald made their first contribution in #4363
- @lapy made their first contribution in #4362
Full Changelog: v0.12.1...v0.12.2