What's Changed
🚀 Features
- support glm-4.7-flash by @RunningLeon in #4320
- [ascend]suppot ep by @yao-fengchen in #3696
💥 Improvements
- fix rotary embedding for transformers v5 by @grimoire in #4303
- Improve metrics log by @CUHKSZzxy in #4297
- Support ignore layers in quant config for qwen3 models by @RunningLeon in #4293
- add custom noaux kernel by @grimoire in #4345
- fix qwen3vl with transformers5 by @grimoire in #4348
🐞 Bug fixes
- fix tool call parser's streaming cursor by @lvhan028 in #4333
- Fix data race for guided decoding in TP mode by @lzhangzz in #4341
- fa3 check by @grimoire in #4340
- Fix time series preprocess by @CUHKSZzxy in #4339
- Negative KV sequence length error in Attention op by @jinminxi104 in #4316
- fix qwen3-vl-moe long context by @grimoire in #4342
- fix: move quantized norm to CPU instead of stale q_linear reference in smooth_quant by @Mr-Neutr0n in #4352
- update noaux-kernel check by @grimoire in #4358
🌐 Other
- change INPUT_CUDA_VERSION to 12.6.2 by @lvhan028 in #4322
- add Qwen3-8B accuracy evaluation in llm_compressor.md by @43758726 in #4319
- [ci] refactor ete testcase by @zhulinJulia24 in #4274
- Set alias interns1_1 for interns1_pro by @lvhan028 in #4334
- build(docker): skip FA2 when use cu13 by @windreamer in #4356
- bump version to v0.12.1 by @lvhan028 in #4350
New Contributors
- @Mr-Neutr0n made their first contribution in #4352
Full Changelog: v0.12.0...v0.12.1