What's Changed
🚀 Features
- add endpoint /abort_request by @lvhan028 in #4092
- Qwen3 next by @grimoire in #4039
- Support Qwen3-VL by @CUHKSZzxy in #4093
- Support sync weights with flattened bucket tensor by @RunningLeon in #4109
- Support group router for moe models by @RunningLeon in #4120
- [Feature]: return routed experts to reuse by @RunningLeon in #4090
- support context parallel by @irexyc in #3951
- fope by @grimoire in #4043
- [Feature]: Support speculative decoding by @RunningLeon in #3945
- Moe bf16 ep by @grimoire in #4144
💥 Improvements
- Enlarge gc threshold by @grimoire in #4076
- remove num_tokens from EngineOutput by @lvhan028 in #4088
- revert masking vocab_size by @lvhan028 in #4089
- feat: add json_object support in response_format by @windreamer in #4080
- support image_data input to /generate endpoint by @irexyc in #4086
- [Fix] all RayEngineWorker actors created at node 0 in RL training by @CyCle1024 in #4107
- Optimize sleep level=1 for turbomind backend by @irexyc in #4074
- [Feat] enable ascend update_params by @CyCle1024 in #4111
- Enhance request checker by @lvhan028 in #4104
- Refactor dp tp by @grimoire in #4004
- fix kernel numerical error by @grimoire in #4133
- free ray put by @grimoire in #4137
- Reduce experts cache when resize by @RunningLeon in #4138
- support interleave text and image in messages by @lvhan028 in #4141
- optimize rms norm by @grimoire in #4153
- fix evict policy by @Tsundoku958 in #4127
🐞 Bug fixes
- fix type hint by @grimoire in #4078
- Fix inputs split by @RunningLeon in #4083
- add missing update_model_meta by @jinminxi104 in #4099
- Fix update_params for pytorch backend when loading vl model by @irexyc in #4101
- workaround for issue "TypeError argument 'tokens': 'NoneType' object cannot be converted to 'PyString" by @lvhan028 in #4103
- fix bug: schedule ratio support prefix-caching by @Tsundoku958 in #4100
- remove prefill free ratio threshold by @grimoire in #4110
- fix key error: api_server node might be removed by @lvhan028 in #4112
- Incorrectly judging the request as a bad request by @lvhan028 in #4121
- fix dist config keys by @grimoire in #4125
- proxy server miss media_type in streaming mode by @lvhan028 in #4130
- Fix logprobs to_tensor by @RunningLeon in #4132
- Fix cli help by @RunningLeon in #4139
- fix and optimize fill_kv_cache_quant by @grimoire in #4140
- fix: fix package deprecation introduced by CUDA 13 by @windreamer in #4117
- yield empty list for token_ids when it runs out of tokens by @lvhan028 in #4148
- Fix interns1 routed experts outputs by @RunningLeon in #4149
- fix qwen3-30-a3b lcb-code score by @yao-fengchen in #4142
- Fix ep deployment issues by @CUHKSZzxy in #4084
- Fix dllm to not use fa3 decoding by @RunningLeon in #4159
- fix: handle non-tuple decoder outputs during Qwen-2.5 quantization by @chengyuma in #4158
- fix cu11 docker build by @CUHKSZzxy in #4165
- Fix model config by @CUHKSZzxy in #4170
- fix lora by @grimoire in #4172
- fix cmake logic detect sm70, sm75 by @tuilakhanh in #4175
📚 Documentations
- Update model evalution guide by @lvhan028 in #4094
- [Docs]: Add guide for update weights by @RunningLeon in #4151
🌐 Other
- add dockerfile to build dev image by @lvhan028 in #4091
- add ascend_a3 Dockerfile by @yao-fengchen in #4097
- [ci] refactor longtext benchmark by @zhulinJulia24 in #4087
- enable metrics by default by @lvhan028 in #4108
- Replace pynvml with nvidia-ml-py in requirements by @myhloli in #4118
- [ci] add free disk before build test whl package and add session_len args in benchmark script by @zhulinJulia24 in #4136
- Add prefixcache functionality and performance testing by @littlegy in #4119
- [ci] modify pipeline.close and add more case into pr_test by @zhulinJulia24 in #4150
- bump version to v0.11.0 by @lvhan028 in #4155
New Contributors
- @myhloli made their first contribution in #4118
- @tuilakhanh made their first contribution in #4175
Full Changelog: v0.10.2...v0.11.0