What's Changed
🚀 Features
- add /generate api by @irexyc in #4019
- Guided decoding with xgrammar for TurboMind by @windreamer in #3965
- Reimplement guided decoding with xgrammar for PyTorch Engine by @windreamer in #4028
💥 Improvements
- [ascend] support aclgraph by @yao-fengchen in #4063
- Leverage incremental output between the inference and async engines to improve performance by @lvhan028 in #4054
- Optimize multinomial sampling by @grimoire in #4056
🐞 Bug fixes
- zmqrpc localhost only by @grimoire in #4017
- fix bug: dp+tp warmup by @Tsundoku958 in #3991
- fix dllm long-context by @grimoire in #4012
- Fix GPT-OSS streaming tool call parsing by @QwertyJack in #4023
- move releasing resource from async_engine to inference engine by @lvhan028 in #4041
- fix: fix tokenizer parsing bug for guided decoding by @windreamer in #4044
- Fix message content field handling for tool calls and multimodal input by @QwertyJack in #4029
- fix builder for kimi-k2 by @CUHKSZzxy in #4069
- Skip unnecessary sampling and fix the random offset by @grimoire in #4068
- fix duplicated stop_token_string when ignore_special_tokens is False by @irexyc in #4077
🌐 Other
- Drop CUDA 11.8 build support, upgrade CI/CD to CUDA 12.6/12.8 by @windreamer in #4013
- remove profile_generation.py and its testcases by @lvhan028 in #4027
- [ci] refactor eval into api eval and add h800 eval workflow by @zhulinJulia24 in #4008
- Add Docker image for NVIDIA Jetson by @windreamer in #3834
- [ci] refactor api evaluate test into llm judger evaluation by @littlegy in #4046
- Check color logger by @grimoire in #4060
- Update API testing with HLE and LCB datasets by @littlegy in #4061
- update ascend requirements by @yao-fengchen in #4066
- bump version to v0.10.2 by @lvhan028 in #4062
Full Changelog: v0.10.1...v0.10.2