What's Changed
🚀 Features
- Add Gloo communication to turbomind by @irexyc in #3362
- [Feat] Support llm-compressor AWQ models in TurboMind by @43758726 in #4290
- Router replay for gpt oss by @RunningLeon in #4298
- Support llm-compressor symmetric quantized model inference in TurboMind by @43758726 in #4305
- Support Intern-S1-Pro by @CUHKSZzxy in #4318
💥 Improvements
- Configurable max CTAs and NVLS usage for CUDA IPC communicator by @lzhangzz in #4227
- Improve aborting all sessions by @lvhan028 in #4215
- Moe Reduce kernel by @grimoire in #4228
- Refactor attn by @grimoire in #4238
- Optimize exception raising and error process by @grimoire in #4236
- [AsyncEngine Refactor 1/N] define MultimodalProcessor to handle multimodal data processing by @lvhan028 in #4250
- [AsyncEngine Refactor 2/N] Remove deprecates from chat template by @lvhan028 in #4252
- Configurable uvicorn timeout by @CUHKSZzxy in #4255
- Adapt to dlsime v0.0.2 by @JimyMa in #4242
- [Fix] fix quant calibration dataset by @43758726 in #4256
- lmdeploy suppport parrllel embedding by @Tsundoku958 in #4192
- Refactor turbomind engine by @lzhangzz in #4223
- Refactor Engine & ModelAgent interact by @grimoire in #4265
- Support sleep and destroy deepep buffer by @RunningLeon in #4246
- add yarn truncate by @grimoire in #4301
- [AsyncEngine Refactor 3/N] Introduce Session and SessionManager by @lvhan028 in #4253
- Add warning about NCCL 2.27 memory leaks by @lzhangzz in #4313
🐞 Bug fixes
- Fix fope cos/sin coef device type by @CUHKSZzxy in #4240
- Fix include_stop_str_in_output with output_logits Exception by @windreamer in #4244
- fix logit softcapping is None by @grimoire in #4247
- Fix performance regression for prefix caching by @lzhangzz in #4270
- convert float16 weight to bfloat16 for FP8 models by @lvhan028 in #4276
- [ascend] fix dp multinode rank_table mapping by @tangzhiyi11 in #4268
- [Fix] move calibrate load dataset location by @43758726 in #4289
- fix ignore-eos by @grimoire in #4282
- fix MPEngine poll by @grimoire in #4287
- Fix prefix caching by @lzhangzz in #4292
- Fix gemma chat template by @lvhan028 in #4280
- Fix scheduler metrics by @lzhangzz in #4294
- Fix NVLS init for mixed DP+TP by @lzhangzz in #4296
- [side-effect] The tool message dump is incomplete by @lvhan028 in #4299
- Fix mla with spec tokens by @RunningLeon in #4302
- fix stop long context by @grimoire in #4309
- fix crash on client disconnect (Ctrl+C) by @lvhan028 in #4308
- Ensure the pipe benchmark uses kwargs when calling
pipe.stream_inferby @lvhan028 in #4312 - fix get_ppl for long context by @lvhan028 in #4314
- fix sleep engine for dp=1 by @RunningLeon in #4315
🌐 Other
- [ci] fix fail testcase and add generate testcase in pr test by @zhulinJulia24 in #4231
- Pin nvshmem version by @CUHKSZzxy in #4257
- fix: Pin
timmversion to avoid failed tests by @windreamer in #4258 - docs: add generated openapi spec documentation by @windreamer in #4251
- fix: get rid of buggy timm-1.0.23 by @windreamer in #4260
- [ascend] fix paged prefill by @tangzhiyi11 in #4254
- Fix ascend/maca/camb runtime_requirements by @jinminxi104 in #4262
- docs: refine the documents by @windreamer in #4259
- docs: add cli docs by @windreamer in #4264
- Drop support for Python 3.9 as it has reached end-of-life by @lvhan028 in #4281
- bump version to v0.12.0 by @lvhan028 in #4300
New Contributors
Full Changelog: v0.11.1...v0.12.0