InternLM/lmdeploy v0.12.0 on GitHub

What's Changed

Add Gloo communication to turbomind by @irexyc in #3362
[Feat] Support llm-compressor AWQ models in TurboMind by @43758726 in #4290
Router replay for gpt oss by @RunningLeon in #4298
Support llm-compressor symmetric quantized model inference in TurboMind by @43758726 in #4305
Support Intern-S1-Pro by @CUHKSZzxy in #4318

Configurable max CTAs and NVLS usage for CUDA IPC communicator by @lzhangzz in #4227
Improve aborting all sessions by @lvhan028 in #4215
Moe Reduce kernel by @grimoire in #4228
Refactor attn by @grimoire in #4238
Optimize exception raising and error process by @grimoire in #4236
[AsyncEngine Refactor 1/N] define MultimodalProcessor to handle multimodal data processing by @lvhan028 in #4250
[AsyncEngine Refactor 2/N] Remove deprecates from chat template by @lvhan028 in #4252
Configurable uvicorn timeout by @CUHKSZzxy in #4255
Adapt to dlsime v0.0.2 by @JimyMa in #4242
[Fix] fix quant calibration dataset by @43758726 in #4256
lmdeploy suppport parrllel embedding by @Tsundoku958 in #4192
Refactor turbomind engine by @lzhangzz in #4223
Refactor Engine & ModelAgent interact by @grimoire in #4265
Support sleep and destroy deepep buffer by @RunningLeon in #4246
add yarn truncate by @grimoire in #4301
[AsyncEngine Refactor 3/N] Introduce Session and SessionManager by @lvhan028 in #4253
Add warning about NCCL 2.27 memory leaks by @lzhangzz in #4313

Fix fope cos/sin coef device type by @CUHKSZzxy in #4240
Fix include_stop_str_in_output with output_logits Exception by @windreamer in #4244
fix logit softcapping is None by @grimoire in #4247
Fix performance regression for prefix caching by @lzhangzz in #4270
convert float16 weight to bfloat16 for FP8 models by @lvhan028 in #4276
[ascend] fix dp multinode rank_table mapping by @tangzhiyi11 in #4268
[Fix] move calibrate load dataset location by @43758726 in #4289
fix ignore-eos by @grimoire in #4282
fix MPEngine poll by @grimoire in #4287
Fix prefix caching by @lzhangzz in #4292
Fix gemma chat template by @lvhan028 in #4280
Fix scheduler metrics by @lzhangzz in #4294
Fix NVLS init for mixed DP+TP by @lzhangzz in #4296
[side-effect] The tool message dump is incomplete by @lvhan028 in #4299
Fix mla with spec tokens by @RunningLeon in #4302
fix stop long context by @grimoire in #4309
fix crash on client disconnect (Ctrl+C) by @lvhan028 in #4308
Ensure the pipe benchmark uses kwargs when calling pipe.stream_infer by @lvhan028 in #4312
fix get_ppl for long context by @lvhan028 in #4314
fix sleep engine for dp=1 by @RunningLeon in #4315

[ci] fix fail testcase and add generate testcase in pr test by @zhulinJulia24 in #4231
Pin nvshmem version by @CUHKSZzxy in #4257
fix: Pin timm version to avoid failed tests by @windreamer in #4258
docs: add generated openapi spec documentation by @windreamer in #4251
fix: get rid of buggy timm-1.0.23 by @windreamer in #4260
[ascend] fix paged prefill by @tangzhiyi11 in #4254
Fix ascend/maca/camb runtime_requirements by @jinminxi104 in #4262
docs: refine the documents by @windreamer in #4259
docs: add cli docs by @windreamer in #4264
Drop support for Python 3.9 as it has reached end-of-life by @lvhan028 in #4281
bump version to v0.12.0 by @lvhan028 in #4300

Full Changelog: v0.11.1...v0.12.0