vllm-project/vllm v0.6.1.post1 on GitHub

Highlights

This release features important bug fixes and enhancements for

Pixtral models. (#8415, #8425, #8399, #8431)
- Chunked scheduling has been turned off for vision models. Please replace --max_num_batched_tokens 16384 with --max-model-len 16384
Multistep scheduling. (#8417, #7928, #8427)
Tool use. (#8423, #8366)

Also

support multiple images for qwen-vl (#8247)
removes engine_use_ray (#8126)
add engine option to return only deltas or final output (#7381)
add bitsandbytes support for Gemma2 (#8338)

What's Changed

[MISC] Dump model runner inputs when crashing by @comaniac in #8305
[misc] remove engine_use_ray by @youkaichao in #8126
[TPU] Use Ray for default distributed backend by @WoosukKwon in #8389
Fix the AMD weight loading tests by @mgoin in #8390
[Bugfix]: Fix the logic for deciding if tool parsing is used by @tomeras91 in #8366
[Gemma2] add bitsandbytes support for Gemma2 by @blueyo0 in #8338
[Misc] Raise error when using encoder/decoder model with cpu backend by @kevin314 in #8355
[Misc] Use RoPE cache for MRoPE by @WoosukKwon in #8396
[torch.compile] hide slicing under custom op for inductor by @youkaichao in #8384
[Hotfix][VLM] Fixing max position embeddings for Pixtral by @ywang96 in #8399
[Bugfix] Fix InternVL2 inference with various num_patches by @Isotr0py in #8375
[Model] Support multiple images for qwen-vl by @alex-jw-brooks in #8247
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance by @lnykww in #8403
[BugFix] Fix Duplicate Assignment of Class Variable in Hermes2ProToolParser by @vegaluisjose in #8423
[Bugfix] Offline mode fix by @joerunde in #8376
[multi-step] add flashinfer backend by @SolitaryThinker in #7928
[Core] Add engine option to return only deltas or final output by @njhill in #7381
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible by @alexm-neuralmagic in #8427
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models by @ywang96 in #8425
[CI/Build] Disable multi-node test for InternVL2 by @ywang96 in #8428
[Hotfix][Pixtral] Fix multiple images bugs by @patrickvonplaten in #8415
[Bugfix] Fix weight loading issue by rename variable. by @wenxcs in #8293
[Misc] Update Pixtral example by @ywang96 in #8431
[BugFix] fix group_topk by @dsikka in #8430
[Core] Factor out input preprocessing to a separate class by @DarkLight1337 in #7329
[Bugfix] Mapping physical device indices for e2e test utils by @ShangmingCai in #8290
[Bugfix] Bump fastapi and pydantic version by @DarkLight1337 in #8435
[CI/Build] Update pixtral tests to use JSON by @DarkLight1337 in #8436
[Bugfix] Fix async log stats by @alexm-neuralmagic in #8417
[bugfix] torch profiler bug for single gpu with GPUExecutor by @SolitaryThinker in #8354
bump version to v0.6.1.post1 by @simon-mo in #8440

New Contributors

@blueyo0 made their first contribution in #8338
@lnykww made their first contribution in #8403
@vegaluisjose made their first contribution in #8423

Full Changelog: v0.6.1...v0.6.1.post1