vllm 0.7.2 on Python PyPI

Highlights

Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face transformers library at the moment (#12604)
Add transformers backend support via --model-impl=transformers. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727).
Performance enhancement to DeepSeek models.
- Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676)
- Apply torch.compile to fused_moe/grouped_topk, yielding 5% throughput enhancement (#12637)
- Enable MLA for DeepSeek VL2 (#12729)
- Enable DeepSeek model on ROCm (#12662)

Core Engine

Use VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding in high batch size scenarios (#12368)

Security Update

Improve hash collision avoidance in prefix caching (#12621)
Add SPDX-License-Identifier headers to python source files (#12628)

Other

Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)

What's Changed

Apply torch.compile to fused_moe/grouped_topk by @mgoin in #12637
doc: fixing minor typo in readme.md by @vicenteherrera in #12643
[Bugfix] fix moe_wna16 get_quant_method by @jinzhen-lin in #12648
[Core] Silence unnecessary deprecation warnings by @russellb in #12620
[V1][Minor] Avoid frequently creating ConstantList by @WoosukKwon in #12653
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager by @ShawnD200 in #12608
[Hardware][Intel GPU] add XPU bf16 support by @jikunshang in #12392
[Misc] Add SPDX-License-Identifier headers to python source files by @russellb in #12628
[doc][misc] clarify VLLM_HOST_IP for multi-node inference by @youkaichao in #12667
[Doc] Deprecate Discord by @zhuohan123 in #12668
[Kernel] port sgl moe_align_block_size kernels by @chenyang78 in #12574
make sure mistral_common not imported for non-mistral models by @youkaichao in #12669
Properly check if all fused layers are in the list of targets by @eldarkurtic in #12666
Fix for attention layers to remain unquantized during moe_wn16 quant by @srikanthsrnvs in #12570
[cuda] manually import the correct pynvml module by @youkaichao in #12679
[ci/build] fix gh200 test by @youkaichao in #12681
[Model]: Add transformers backend support by @ArthurZucker in #11330
[Misc] Fix improper placement of SPDX header in scripts by @russellb in #12694
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm by @tlrmchlsmth in #12696
Squelch MLA warning for Compressed-Tensors Models by @kylesayrs in #12704
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 by @kushanam in #12707
[MISC] Remove model input dumping when exception by @comaniac in #12582
[V1] Revert uncache_blocks and support recaching full blocks by @comaniac in #12415
[Core] Improve hash collision avoidance in prefix caching by @russellb in #12621
Support Pixtral-Large HF by using llava multimodal_projector_bias config by @mgoin in #12710
[Doc] Replace ibm-fms with ibm-ai-platform by @tdoublep in #12709
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs by @kylesayrs in #12711
[AMD][ROCm] Enable DeepSeek model on ROCm by @hongxiayang in #12662
[Misc] Add BNB quantization for Whisper by @jeejeelee in #12381
[VLM] Merged multi-modal processor for InternVL-based models by @DarkLight1337 in #12553
[V1] Remove constraints on partial requests by @WoosukKwon in #12674
[VLM] Implement merged multimodal processor and V1 support for idefics3 by @Isotr0py in #12660
[Model] [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small by @mgtk77 in #12689
Avoid unnecessary multi-modal input data copy when len(batch) == 1 by @imkero in #12722
[Build] update requirements of no-device for plugin usage by @sducouedic in #12630
[Bugfix] Fix CI failures for InternVL and Mantis models by @DarkLight1337 in #12728
[V1][Metrics] Add request_success_total counter, labelled with finish reason by @markmc in #12579
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) by @LucasWilkinson in #12676
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS by @akeshet in #12368
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling by @maleksan85 in #12713
Refactor Linear handling in TransformersModel by @hmellor in #12727
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models by @Isotr0py in #12729
[Misc] Bump the compressed-tensors version by @dsikka in #12736
[Model][Quant] Fix GLM, Fix fused module mappings for quantization by @kylesayrs in #12634
[Doc] Update PR Reminder with link to Developer Slack by @mgoin in #12748
[Bugfix] Fix OpenVINO model runner by @hmellor in #12750
[V1][Misc] Shorten FinishReason enum and use constant strings by @njhill in #12760
[Doc] Remove performance warning for auto_awq.md by @mgoin in #12743
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 by @Akashcodes732 in #12546
[core][distributed] exact ray placement control by @youkaichao in #12732
[Kernel] Use self.kv_cache and forward_context.attn_metadata in Attention.forward by @heheda12345 in #12536
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in #12359
Add: Support for Sparse24Bitmask Compressed Models by @rahul-tuli in #12097
[VLM] Use shared field to pass token ids to model by @DarkLight1337 in #12767
[Docs] Drop duplicate [source] links by @russellb in #12780
[VLM] Qwen2.5-VL by @ywang96 in #12604
[VLM] Update compatibility with transformers 4.49 by @DarkLight1337 in #12781
Quantization and MoE configs for GH200 machines by @arvindsun in #12717
[ROCm][Kernel] Using the correct warp_size value by @gshtras in #12789
[Bugfix] Better FP8 supported defaults by @LucasWilkinson in #12796
[Misc][Easy] Remove the space from the file name by @houseroad in #12799
[Model] LoRA Support for Ultravox model by @thedebugger in #11253
[Bugfix] Fix the test_ultravox.py's license by @houseroad in #12806
Improve TransformersModel UX by @hmellor in #12785
[Misc] Remove duplicated DeepSeek V2/V3 model definition by @mgoin in #12793
[Misc] Improve error message for incorrect pynvml by @youkaichao in #12809

New Contributors

@vicenteherrera made their first contribution in #12643
@chenyang78 made their first contribution in #12574
@srikanthsrnvs made their first contribution in #12570
@ArthurZucker made their first contribution in #11330
@mgtk77 made their first contribution in #12689
@sducouedic made their first contribution in #12630
@akeshet made their first contribution in #12368
@arvindsun made their first contribution in #12717
@thedebugger made their first contribution in #11253

Full Changelog: v0.7.1...v0.7.2

vllm 0.7.2 v0.7.2 on Python PyPI

Highlights

Core Engine

Security Update

Other

What's Changed

New Contributors

vllm 0.7.2
v0.7.2

on Python PyPI