Highlights
This release features important bug fixes and enhancements for
- Pixtral models. (#8415, #8425, #8399, #8431)
- Chunked scheduling has been turned off for vision models. Please replace
--max_num_batched_tokens 16384
with--max-model-len 16384
- Chunked scheduling has been turned off for vision models. Please replace
- Multistep scheduling. (#8417, #7928, #8427)
- Tool use. (#8423, #8366)
Also
- support multiple images for qwen-vl (#8247)
- removes
engine_use_ray
(#8126) - add engine option to return only deltas or final output (#7381)
- add bitsandbytes support for Gemma2 (#8338)
What's Changed
- [MISC] Dump model runner inputs when crashing by @comaniac in #8305
- [misc] remove engine_use_ray by @youkaichao in #8126
- [TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
- Fix the AMD weight loading tests by @mgoin in #8390
- [Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
- [Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
- [Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
- [Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
- [torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
- [Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
- [Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
- [Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
- [BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
- [BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
- [Bugfix] Offline mode fix by @joerunde in #8376
- [multi-step] add flashinfer backend by @SolitaryThinker in #7928
- [Core] Add engine option to return only deltas or final output by @njhill in #7381
- [Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
- [Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
- [CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
- [Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
- [Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
- [Misc] Update Pixtral example by @ywang96 in #8431
- [BugFix] fix group_topk by @dsikka in #8430
- [Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
- [Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
- [Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
- [CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
- [Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
- [bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
- bump version to v0.6.1.post1 by @simon-mo in #8440
New Contributors
- @blueyo0 made their first contribution in #8338
- @lnykww made their first contribution in #8403
- @vegaluisjose made their first contribution in #8423
Full Changelog: v0.6.1...v0.6.1.post1