InternLM/lmdeploy v0.10.1 on GitHub

What's Changed

Add ROCm support: installation guide and FlashAttention compatibility for AMD GPUs by @Vivicai1005 in #3925
support gpt-oss basic output by @irexyc in #3956
Add FP8*(B)F16 GEMM by @lzhangzz in #3960
Support GLM-4.5 by @CUHKSZzxy in #3863
[Refactor]: Remove tokenizer when building engine by @RunningLeon in #3978
Support InternVL3.5-Flash by @CUHKSZzxy in #3952
support gpt-oss function/reasoning in /v1/chat/completions by @irexyc in #3962
support returning stop_str in output by @lvhan028 in #3984
Support SDAR by @grimoire in #3922

fix bugs with triton3.4.0 by @grimoire in #3946
fix longrope by @grimoire in #3968
Fix tm rl usage in xtuner by @irexyc in #3912
Disable prefix caching when serving a VLM model by @lvhan028 in #3990
remove NCCL_LAUNCH_MODE by @irexyc in #3994
return the last token's logprobs, logits and last_hidden_states if include_stop_str_in_output is requested by @lvhan028 in #4000
[Fix] device args in chat cli when using pytorch engine by @CyCle1024 in #3999
fix internvl by @CUHKSZzxy in #3997
fix not-returned iterator in SequenceManager::Erase by @irexyc in #4001
fix cudagraph without warmup by @grimoire in #4005
fix internvl flash long context acc by @CUHKSZzxy in #4003

Full Changelog: v0.10.0...v0.10.1