InternLM/lmdeploy v0.10.2 on GitHub

What's Changed

add /generate api by @irexyc in #4019
Guided decoding with xgrammar for TurboMind by @windreamer in #3965
Reimplement guided decoding with xgrammar for PyTorch Engine by @windreamer in #4028

[ascend] support aclgraph by @yao-fengchen in #4063
Leverage incremental output between the inference and async engines to improve performance by @lvhan028 in #4054
Optimize multinomial sampling by @grimoire in #4056

zmqrpc localhost only by @grimoire in #4017
fix bug: dp+tp warmup by @Tsundoku958 in #3991
fix dllm long-context by @grimoire in #4012
Fix GPT-OSS streaming tool call parsing by @QwertyJack in #4023
move releasing resource from async_engine to inference engine by @lvhan028 in #4041
fix: fix tokenizer parsing bug for guided decoding by @windreamer in #4044
Fix message content field handling for tool calls and multimodal input by @QwertyJack in #4029
fix builder for kimi-k2 by @CUHKSZzxy in #4069
Skip unnecessary sampling and fix the random offset by @grimoire in #4068
fix duplicated stop_token_string when ignore_special_tokens is False by @irexyc in #4077

Drop CUDA 11.8 build support, upgrade CI/CD to CUDA 12.6/12.8 by @windreamer in #4013
remove profile_generation.py and its testcases by @lvhan028 in #4027
[ci] refactor eval into api eval and add h800 eval workflow by @zhulinJulia24 in #4008
Add Docker image for NVIDIA Jetson by @windreamer in #3834
[ci] refactor api evaluate test into llm judger evaluation by @littlegy in #4046
Check color logger by @grimoire in #4060
Update API testing with HLE and LCB datasets by @littlegy in #4061
update ascend requirements by @yao-fengchen in #4066
bump version to v0.10.2 by @lvhan028 in #4062

Full Changelog: v0.10.1...v0.10.2