Highlights
- Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face
transformers
library at the moment (#12604) - Add
transformers
backend support via--model-impl=transformers
. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727). - Performance enhancement to DeepSeek models.
Core Engine
- Use
VLLM_LOGITS_PROCESSOR_THREADS
to speed up structured decoding in high batch size scenarios (#12368)
Security Update
- Improve hash collision avoidance in prefix caching (#12621)
- Add SPDX-License-Identifier headers to python source files (#12628)
Other
- Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)
What's Changed
- Apply torch.compile to fused_moe/grouped_topk by @mgoin in #12637
- doc: fixing minor typo in readme.md by @vicenteherrera in #12643
- [Bugfix] fix moe_wna16 get_quant_method by @jinzhen-lin in #12648
- [Core] Silence unnecessary deprecation warnings by @russellb in #12620
- [V1][Minor] Avoid frequently creating ConstantList by @WoosukKwon in #12653
- [Core][v1] Unify allocating slots in prefill and decode in KV cache manager by @ShawnD200 in #12608
- [Hardware][Intel GPU] add XPU bf16 support by @jikunshang in #12392
- [Misc] Add SPDX-License-Identifier headers to python source files by @russellb in #12628
- [doc][misc] clarify VLLM_HOST_IP for multi-node inference by @youkaichao in #12667
- [Doc] Deprecate Discord by @zhuohan123 in #12668
- [Kernel] port sgl moe_align_block_size kernels by @chenyang78 in #12574
- make sure mistral_common not imported for non-mistral models by @youkaichao in #12669
- Properly check if all fused layers are in the list of targets by @eldarkurtic in #12666
- Fix for attention layers to remain unquantized during moe_wn16 quant by @srikanthsrnvs in #12570
- [cuda] manually import the correct pynvml module by @youkaichao in #12679
- [ci/build] fix gh200 test by @youkaichao in #12681
- [Model]: Add
transformers
backend support by @ArthurZucker in #11330 - [Misc] Fix improper placement of SPDX header in scripts by @russellb in #12694
- [Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm by @tlrmchlsmth in #12696
- Squelch MLA warning for Compressed-Tensors Models by @kylesayrs in #12704
- [Model] Add Deepseek V3 fp8_w8a8 configs for B200 by @kushanam in #12707
- [MISC] Remove model input dumping when exception by @comaniac in #12582
- [V1] Revert
uncache_blocks
and support recaching full blocks by @comaniac in #12415 - [Core] Improve hash collision avoidance in prefix caching by @russellb in #12621
- Support Pixtral-Large HF by using llava multimodal_projector_bias config by @mgoin in #12710
- [Doc] Replace ibm-fms with ibm-ai-platform by @tdoublep in #12709
- [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs by @kylesayrs in #12711
- [AMD][ROCm] Enable DeepSeek model on ROCm by @hongxiayang in #12662
- [Misc] Add BNB quantization for Whisper by @jeejeelee in #12381
- [VLM] Merged multi-modal processor for InternVL-based models by @DarkLight1337 in #12553
- [V1] Remove constraints on partial requests by @WoosukKwon in #12674
- [VLM] Implement merged multimodal processor and V1 support for idefics3 by @Isotr0py in #12660
- [Model] [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small by @mgtk77 in #12689
- Avoid unnecessary multi-modal input data copy when len(batch) == 1 by @imkero in #12722
- [Build] update requirements of no-device for plugin usage by @sducouedic in #12630
- [Bugfix] Fix CI failures for InternVL and Mantis models by @DarkLight1337 in #12728
- [V1][Metrics] Add request_success_total counter, labelled with finish reason by @markmc in #12579
- [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) by @LucasWilkinson in #12676
- [Core] add and implement
VLLM_LOGITS_PROCESSOR_THREADS
by @akeshet in #12368 - [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling by @maleksan85 in #12713
- Refactor
Linear
handling inTransformersModel
by @hmellor in #12727 - [VLM] Add MLA with pure RoPE support for deepseek-vl2 models by @Isotr0py in #12729
- [Misc] Bump the compressed-tensors version by @dsikka in #12736
- [Model][Quant] Fix GLM, Fix fused module mappings for quantization by @kylesayrs in #12634
- [Doc] Update PR Reminder with link to Developer Slack by @mgoin in #12748
- [Bugfix] Fix OpenVINO model runner by @hmellor in #12750
- [V1][Misc] Shorten
FinishReason
enum and use constant strings by @njhill in #12760 - [Doc] Remove performance warning for auto_awq.md by @mgoin in #12743
- [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 by @Akashcodes732 in #12546
- [core][distributed] exact ray placement control by @youkaichao in #12732
- [Kernel] Use self.kv_cache and forward_context.attn_metadata in Attention.forward by @heheda12345 in #12536
- [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in #12359
- Add: Support for Sparse24Bitmask Compressed Models by @rahul-tuli in #12097
- [VLM] Use shared field to pass token ids to model by @DarkLight1337 in #12767
- [Docs] Drop duplicate [source] links by @russellb in #12780
- [VLM] Qwen2.5-VL by @ywang96 in #12604
- [VLM] Update compatibility with transformers 4.49 by @DarkLight1337 in #12781
- Quantization and MoE configs for GH200 machines by @arvindsun in #12717
- [ROCm][Kernel] Using the correct warp_size value by @gshtras in #12789
- [Bugfix] Better FP8 supported defaults by @LucasWilkinson in #12796
- [Misc][Easy] Remove the space from the file name by @houseroad in #12799
- [Model] LoRA Support for Ultravox model by @thedebugger in #11253
- [Bugfix] Fix the test_ultravox.py's license by @houseroad in #12806
- Improve
TransformersModel
UX by @hmellor in #12785 - [Misc] Remove duplicated DeepSeek V2/V3 model definition by @mgoin in #12793
- [Misc] Improve error message for incorrect pynvml by @youkaichao in #12809
New Contributors
- @vicenteherrera made their first contribution in #12643
- @chenyang78 made their first contribution in #12574
- @srikanthsrnvs made their first contribution in #12570
- @ArthurZucker made their first contribution in #11330
- @mgtk77 made their first contribution in #12689
- @sducouedic made their first contribution in #12630
- @akeshet made their first contribution in #12368
- @arvindsun made their first contribution in #12717
- @thedebugger made their first contribution in #11253
Full Changelog: v0.7.1...v0.7.2