InternLM/lmdeploy 0.14.0a1 on GitHub

What's Changed

Update turbomind modeling infrastructure by @lzhangzz in #4557
refactor(turbomind): consolidate CUDA error handling and add manual stacktracing by @lzhangzz in #4565
Add Qwen3.5 Moe lite awq by @43758726 in #4561
[Improve]: Drain queues when sleep engine by @RunningLeon in #4577
Extend chat completions by introducing token-in/out and returning routed experts by @lvhan028 in #4593
Follow openai's spec to add "AllowedToolChoice" and report 400 when parsing request failed by @lvhan028 in #4585
Improve health endpoint by @lvhan028 in #4615
Remove state init by @grimoire in #4604
Include spec stats in metrics by @RunningLeon in #4625

fix the anthropic adapter by @lvhan028 in #4578
Fix Structured Output for GPT-OSS Models by @windreamer in #4386
Allow W8A8Linear to accept dtype during initialization instead of hard code by @43758726 in #4586
fix: compact split multimodal tensors by @CUHKSZzxy in #4583
Fix legacy VLM preprocessors for normalized image data by @CUHKSZzxy in #4584
fix dockerfile which missing common.txt by @lvhan028 in #4608
fix: enable FA3 for SM80+ GPUs and fix CUDA version comparison by @windreamer in #4591
flatten_kv_cache zero padding by @grimoire in #4613
align streaming usage chunks with OpenAI spec by @lvhan028 in #4616
fix(vl): reduce multimodal feature memory use by @CUHKSZzxy in #4603
fix memleak when input contain large image data by @grimoire in #4610
fix(turbomind): map Intern-S1 HF checkpoint keys by @lvhan028 in #4617
fix(serve): emit all stream_chunk deltas to fix concurrent tool-call streaming by @lvhan028 in #4622
fix cp inference by @irexyc in #4619
refactor(serve): avoid per-request tokenizer work in parsers by @lvhan028 in #4633
Bring MixtralForCausalLM back to Turbomind by @43758726 in #4623
fix model loading on windows by @irexyc in #4626

Full Changelog: v0.13.0...0.14.0a1