github vllm-project/vllm v0.7.3

2 days ago

Highlights

🎉 253 commits from 93 contributors, including 29 new contributors!

  • Deepseek enhancements:
    • Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755)
    • AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199)
    • Using FlashAttention3 for MLA (#12807)
    • Align the expert selection code path with official implementation (#13474)
    • Optimize moe_align_block_size for deepseek_v3 (#12850)
  • V1 Engine:
    • LoRA Support (#10957, #12883)
    • Logprobs and prompt logprobs support (#9880), min_p sampling support (#13191), logit_bias in v1 Sampler (#13079)
    • Use msgpack for core request serialization (#12918)
    • Pipeline parallelism support (#12996, #13353, #13472, #13417, #13315)
    • Metrics enhancements: GPU prefix cache hit rate % gauge (#12592), iteration_tokens_total histogram (#13288), several request timing histograms (#12644)
    • Initial speculative decoding support with ngrams (#12193, #13365)

Model Support

  • Enhancement to Qwen2.5-VL: BNB support (#12944), LoRA (#13261), Optimizations (#13155)
  • Support Unsloth Dynamic 4bit BnB quantization (#12974)
  • IBM/NASA Prithvi Geospatial model (#12830)
  • Support Mamba2 (Codestral Mamba) (#9292), Bamba Model (#10909)
  • Ultravox Model: Support v0.5 Release (#12912)
  • transformers backend
    • Enable quantization support for transformers backend (#12960)
    • Set torch_dtype in TransformersModel (#13088)
  • VLM:
    • Implement merged multimodal processor for Mllama (#11427), GLM4V (#12449), Molmo (#12966)
    • Separate text-only and vision variants of the same model architecture (#13157)

Hardware Support

  • Pluggable platform-specific scheduler (#13161)
  • NVIDIA: Support nvfp4 quantization (#12784)
  • AMD:
    • Per-Token-Activation Per-Channel-Weight FP8 (#12501)
    • Tuning for Mixtral on MI325 and Qwen MoE on MI300 (#13503), Mixtral8x7B on MI300 (#13577)
    • Add intial ROCm support to V1 (#12790)
  • TPU: V1 Support (#13049)
  • Neuron: Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
  • Gaudi:
    • Support Contiguous Cache Fetch (#12139)
    • Enable long-contexts + LoRA support (#12812)

Engine Feature

  • Add sleep and wake up endpoint and v1 support (#12987)
  • Add /v1/audio/transcriptions OpenAI API endpoint (#12909)

Performance

  • Reduce TTFT with concurrent partial prefills (#10235)
  • LoRA - Refactor sgmv kernels (#13110)

Others

  • Make vLLM compatible with veRL (#12824)
  • Fixes for cases of FA2 illegal memory access error (#12848)
  • choice-based structured output with xgrammar (#12632)
  • Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)

What's Changed

  • [Misc] Update w2 scale loading for GPTQMarlinMoE by @dsikka in #12757
  • [Docs] Add Google Cloud Slides by @simon-mo in #12814
  • [Attention] Use FA3 for MLA on Hopper by @LucasWilkinson in #12807
  • [misc] Reduce number of config file requests to HuggingFace by @khluu in #12797
  • [Misc] Remove unnecessary decode call by @DarkLight1337 in #12833
  • [Kernel] Make rotary_embedding ops more flexible with input shape by @Isotr0py in #12777
  • [torch.compile] PyTorch 2.6 and nightly compatibility by @youkaichao in #12393
  • [Doc] double quote cmake package in build.inc.md by @jitseklomp in #12840
  • [Bugfix] Fix unsupported FA version check for Turing GPU by @Isotr0py in #12828
  • [V1] LoRA Support by @varun-sundar-rabindranath in #10957
  • Add Bamba Model by @fabianlim in #10909
  • [MISC] Check space in the file names in the pre commit checks by @houseroad in #12804
  • [misc] Revert # 12833 by @khluu in #12857
  • [Bugfix] FA2 illegal memory access by @LucasWilkinson in #12848
  • Make vllm compatible with verl by @ZSL98 in #12824
  • [Bugfix] Missing quant_config in deepseek embedding layer by @SzymonOzog in #12836
  • Prevent unecessary requests to huggingface hub by @maxdebayser in #12837
  • [MISC][EASY] Break check file names into entry and args in the pre-commit hooks by @houseroad in #12880
  • [Misc] Remove unnecessary detokenization in multimodal processing by @DarkLight1337 in #12868
  • [Model] Add support for partial rotary embeddings in Phi3 model by @garg-amit in #12718
  • [V1] Logprobs and prompt logprobs support by @afeldman-nm in #9880
  • [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing by @tjtanaa in #12501
  • [V1] LM Eval With Streaming Integration Tests by @robertgshaw2-redhat in #11590
  • [Bugfix] Fix disagg hang caused by the prefill and decode communication issues by @houseroad in #12723
  • [V1][Minor] Remove outdated comment by @WoosukKwon in #12928
  • [V1] Move KV block hashes from Request to KVCacheManager by @WoosukKwon in #12922
  • [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping by @jeejeelee in #12905
  • [Misc] Fix typo in the example file by @DK-DARKmatter in #12896
  • [Bugfix] Fix multi-round chat error when mistral tokenizer is used by @zifeitong in #12859
  • [bugfix] respect distributed_executor_backend in world_size=1 by @youkaichao in #12934
  • [Misc] Add offline test for disaggregated prefill by @Shaoting-Feng in #12418
  • [V1][Minor] Move cascade attn logic outside _prepare_inputs by @WoosukKwon in #12943
  • [Build] Make pypi install work on CPU platform by @wangxiyuan in #12874
  • [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi by @SanjuCSudhakaran in #12812
  • [misc] Add LoRA to benchmark_serving by @varun-sundar-rabindranath in #12898
  • [Misc] Log time consumption on weight downloading by @waltforme in #12926
  • [CI] Resolve transformers-neuronx version conflict by @liangfu in #12925
  • [Doc] Correct HF repository for TeleChat2 models by @waltforme in #12949
  • [Misc] Add qwen2.5-vl BNB support by @Isotr0py in #12944
  • [CI/Build] Auto-fix Markdown files by @DarkLight1337 in #12941
  • [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU by @ShangmingCai in #12935
  • [bugfix] fix early import of flash attention by @youkaichao in #12959
  • [VLM] Merged multi-modal processor for GLM4V by @jeejeelee in #12449
  • [V1][Minor] Remove outdated comment by @WoosukKwon in #12968
  • [RFC] [Mistral] FP8 format by @patrickvonplaten in #10130
  • [V1] Cache uses_mrope in GPUModelRunner by @WoosukKwon in #12969
  • [core] port pynvml into vllm codebase by @youkaichao in #12963
  • [MISC] Always import version library first in the vllm package by @houseroad in #12979
  • [core] improve error handling when wake up from sleep mode by @youkaichao in #12981
  • [core][rlhf] add colocate example for RLHF by @youkaichao in #12984
  • [V1] Use msgpack for core request serialization by @njhill in #12918
  • [Bugfix][Platform] Check whether selected backend is None in get_attn_backend_cls() by @terrytangyuan in #12975
  • [core] fix sleep mode and pytorch checkpoint compatibility by @youkaichao in #13001
  • [Doc] Add link to tool_choice tracking issue in tool_calling.md by @terrytangyuan in #13003
  • [misc] Add retries with exponential backoff for HF file existence check by @khluu in #13008
  • [Bugfix] Clean up and fix multi-modal processors by @DarkLight1337 in #13012
  • Fix seed parameter behavior in vLLM by @SmartManoj in #13007
  • [Model] Ultravox Model: Support v0.5 Release by @farzadab in #12912
  • [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU by @khluu in #13022
  • [V1][Minor] Move scheduler outputs to a separate file by @WoosukKwon in #13062
  • [Docs] Annouce Meta Meetup by @simon-mo in #13065
  • [Bugfix] Support missing tool parameters in mistral tokenizer by @fgreinacher in #12884
  • [Benchmark] Add BurstGPT to benchmark_serving by @WoosukKwon in #13063
  • [Core] Don't do platform detection at import time by @russellb in #12933
  • [Misc] LoRA - Refactor Punica ops tests by @varun-sundar-rabindranath in #12970
  • [Bugfix]: Reasoning output bug according to the chat template change by @gaocegege in #13025
  • [V1][Metrics] Add GPU prefix cache hit rate % gauge by @comaniac in #12592
  • [executor] init local_rank as device index by @MengqingCao in #13027
  • [ROCm] Using a more precise memory profiling by @gshtras in #12624
  • [Build] Fix cuda link target of cumem_allocator in CPU env by @guoyuhong in #12863
  • [Platform] add pre_register_and_update function by @wangxiyuan in #12432
  • [Bugfix] fix flaky test by @SmartManoj in #13089
  • [V1][Metrics] Add several request timing histograms by @markmc in #12644
  • Set torch_dtype in TransformersModel by @hmellor in #13088
  • [Misc] Fix typo at comments at metrics.py by @je1lee in #13024
  • [Bugfix] Do not use resource module on Windows (#12858) by @MoonRide303 in #13029
  • [BugFix] Pop instead of del CUDA_VISIBLE_DEVICES by @HollowMan6 in #12962
  • Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 by @SzymonOzog in #13023
  • [CI/Build][Bugfix] Fix CPU backend default threads num by @bigPYJ1151 in #13077
  • [Doc] Improve OpenVINO installation doc by @hmellor in #13102
  • [Bugfix] Guided decoding falls back to outlines when fails to import xgrammar by @terrytangyuan in #12976
  • [Misc] Move pre-commit suggestion back to the end by @russellb in #13114
  • [RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM by @youngkent in #12518
  • [Model] IBM/NASA Prithvi Geospatial model by @christian-pinto in #12830
  • [ci] Add more source file dependencies for some tests by @khluu in #13123
  • [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency by @lingfanyu in #12921
  • Bump helm/kind-action from 1.10.0 to 1.12.0 by @dependabot in #11612
  • Bump actions/stale from 9.0.0 to 9.1.0 by @dependabot in #12462
  • Bump helm/chart-testing-action from 2.6.1 to 2.7.0 by @dependabot in #12463
  • Bump actions/setup-python from 5.3.0 to 5.4.0 by @dependabot in #12672
  • Further reduce the HTTP calls to huggingface.co by @maxdebayser in #13107
  • [Misc] AMD Build Improvements by @842974287 in #12923
  • [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request by @bnellnm in #13108
  • [Bugfix] Fix num video tokens calculation for Qwen2-VL by @DarkLight1337 in #13148
  • [Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral by @rafvasq in #12332
  • [Misc] Delete unused LoRA modules by @jeejeelee in #13151
  • Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path by @houseroad in #12998
  • [CI/Build] Use mypy matcher for pre-commit CI job by @russellb in #13162
  • [CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control by @Qubitium in #7086
  • [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity by @mgoin in #13119
  • [CI] Fix failing FP8 cpu offload test by @mgoin in #13170
  • [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort by @andoorve in #13173
  • [CI/Build] Ignore ruff warning up007 by @russellb in #13182
  • [perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance by @khluu in #12706
  • [NVIDIA] Support nvfp4 quantization by @kaixih in #12784
  • [Bugfix][Example] Fix GCed profiling server for TPU by @mgoin in #12792
  • [VLM] Implement merged multimodal processor for Mllama by @Isotr0py in #11427
  • Simplify logic of locating CUDART so file path by @houseroad in #13203
  • [Build] Automatically use the wheel of the base commit with Python-only build by @comaniac in #13178
  • [Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case by @LikeSundayLikeRain in #13097
  • [Frontend] Move CLI code into vllm.cmd package by @russellb in #12971
  • Allow Unsloth Dynamic 4bit BnB quants to work by @danielhanchen in #12974
  • [CI/Build] Allow ruff to auto-fix some issues by @russellb in #13180
  • [V1][core] Implement pipeline parallel on Ray by @ruisearch42 in #12996
  • [VLM] Remove input processor from clip and siglip by @Isotr0py in #13165
  • [Frontend] Pass pre-created socket to uvicorn by @russellb in #13113
  • [V1] Clarify input processing and multimodal feature caching logic by @ywang96 in #13211
  • [VLM] Merged multi-modal processor for Molmo by @DarkLight1337 in #12966
  • [V1][Core] Add worker_base for v1 worker by @AoyuQC in #12816
  • [Misc] Qwen2.5-VL Optimization by @wulipc in #13155
  • [VLM] Separate text-only and vision variants of the same model architecture by @DarkLight1337 in #13157
  • [Bugfix] Missing Content Type returns 500 Internal Server Error by @vaibhavjainwiz in #13193
  • [Frontend] Add /v1/audio/transcriptions OpenAI API endpoint by @NickLucche in #12909
  • Add label if pre-commit passes by @hmellor in #12527
  • Optimize moe_align_block_size for deepseek_v3 by @mgoin in #12850
  • [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels by @tlrmchlsmth in #13198
  • Revert "Add label if pre-commit passes" by @hmellor in #13242
  • [ROCm] Avoid using the default stream on ROCm as it is a performance killer by @gshtras in #13238
  • [Kernel] Fix awq error when n is not divisable by 128 by @jinzhen-lin in #13227
  • [V1] Consolidate MM cache size to vllm.envs by @ywang96 in #13239
  • [Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on by @tlrmchlsmth in #13250
  • [Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config by @tlrmchlsmth in #13237
  • [Bugfix] Offline example of disaggregated prefill by @XiaobingSuper in #13214
  • [Misc] Remove redundant statements in scheduler.py by @WrRan in #13229
  • Consolidate Llama model usage in tests by @hmellor in #13094
  • Expand MLA to support most types of quantization by @mgoin in #13181
  • [V1] LoRA - Enable Serving Usecase by @varun-sundar-rabindranath in #12883
  • [ROCm][V1] Add intial ROCm support to V1 by @SageMoore in #12790
  • [Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch by @imkero in #13126
  • [WIP] TPU V1 Support Refactored by @alexm-redhat in #13049
  • [Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch by @pooyadavoodi in #12927
  • [Bugfix] Fix missing parentheses by @xu-song in #13263
  • [Misc] Log time consumption of sleep and wake-up by @waltforme in #13115
  • [VLM] Keep track of whether prompt replacements have been applied by @DarkLight1337 in #13215
  • [V1] Simplify GPUModelRunner._update_states check by @njhill in #13265
  • Support logit_bias in v1 Sampler by @houseroad in #13079
  • [Core] choice-based structured output with xgrammar by @russellb in #12632
  • [Hardware][Gaudi][Bugfix] Fix error for guided decoding by @zhouyu5 in #12317
  • [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts by @mgoin in #13236
  • [Core] Reduce TTFT with concurrent partial prefills by @joerunde in #10235
  • [V1][Core] min_p sampling support by @AoyuQC in #13191
  • [V1][CI] Fix failed v1-test because of min_p by @WoosukKwon in #13316
  • [V1][Sampler] Don't apply temp for greedy-only by @njhill in #13311
  • [V1][PP] Fix memory profiling in PP by @WoosukKwon in #13315
  • [Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm by @SageMoore in #13235
  • [Bugfix][Docs] Fix offline Whisper by @NickLucche in #13274
  • [Bugfix] Massage MLA's usage of flash attn for RoCM by @tlrmchlsmth in #13310
  • [BugFix] Don't scan entire cache dir when loading model by @njhill in #13302
  • [Bugfix]Fix search start_index of stop_checker by @xu-song in #13280
  • [Bugfix] Fix qwen2.5-vl image processor by @Isotr0py in #13286
  • [V1][Metrics] Add iteration_tokens_total histogram from V0 by @markmc in #13288
  • [AMD] [Model] DeepSeek tunings by @rasmith in #13199
  • [V1][PP] Run engine busy loop with batch queue by @comaniac in #13064
  • [ci/build] update flashinfer by @youkaichao in #13323
  • [Doc] [2/N] Add Fuyu E2E example for multimodal processor by @DarkLight1337 in #13331
  • [V1][Spec Decode] Ngram Spec Decode by @LiuXiaoxuanPKU in #12193
  • [Quant] Add SupportsQuant to phi3 and clip by @kylesayrs in #13104
  • [Bugfix] Pin xgrammar to 0.1.11 by @mgoin in #13338
  • [BugFix] Enhance test_pos_encoding to support execution on multi-devices by @wchen61 in #13187
  • [V1] Update doc and examples for H2O-VL by @ywang96 in #13349
  • [ci] skip failed tests for flashinfer by @youkaichao in #13352
  • [platform] add base class for communicators by @youkaichao in #13208
  • [Bugfix] Fix 2 Node and Spec Decode tests by @DarkLight1337 in #13341
  • [Docs] Change myenv to vllm. Update python_env_setup.inc.md by @arkylin in #13325
  • [V1][BugFix] Add init.py to v1/spec_decode/ by @WoosukKwon in #13359
  • [V1][PP] Cache Intermediate Tensors by @WoosukKwon in #13353
  • [Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case by @Isotr0py in #13358
  • [V1][BugFix] Clean up rejection sampler & Fix warning msg by @WoosukKwon in #13362
  • [V1][Misc] Avoid unnecessary log output by @jeejeelee in #13289
  • [Feature][Spec Decode] Simplify the use of Eagle Spec Decode by @ShangmingCai in #12304
  • Fix spelling error in index.md by @yankooo in #13369
  • Run v1 benchmark and integrate with PyTorch OSS benchmark database by @huydhn in #13068
  • [MISC] tiny fixes by @MengqingCao in #13378
  • [VLM] Check required fields before initializing field config in DictEmbeddingItems by @DarkLight1337 in #13380
  • [Model] Support Mamba2 (Codestral Mamba) by @tlrmchlsmth in #9292
  • [Bugfix] fix xpu communicator by @yma11 in #13368
  • [Bugfix] Fix VLLM_USE_MODELSCOPE issue by @r4ntix in #13384
  • [V1] Get input tokens from scheduler by @WoosukKwon in #13339
  • [V1][PP] Fix intermediate tensor values by @comaniac in #13417
  • [V1][Spec decode] Move drafter to model runner by @WoosukKwon in #13363
  • [Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue by @tlrmchlsmth in #13425
  • [Misc] Remove dangling references to SamplingType.BEAM by @hmellor in #13402
  • [Model] Enable quantization support for transformers backend by @Isotr0py in #12960
  • [ROCm] fix get_device_name for rocm by @divakar-amd in #13438
  • [v1] fix parallel config rank by @youkaichao in #13445
  • [Quant] Molmo SupportsQuant by @kylesayrs in #13336
  • [Quant] Arctic SupportsQuant by @kylesayrs in #13366
  • [Bugfix] Only print out chat template when supplied by @terrytangyuan in #13444
  • [core] fix sleep mode in pytorch 2.6 by @youkaichao in #13456
  • [Quant] Aria SupportsQuant by @kylesayrs in #13416
  • [V1][PP] Fix & Pin Ray version in requirements-cuda.txt by @WoosukKwon in #13436
  • Add outlines fallback when JSON schema has enum by @mgoin in #13449
  • [Bugfix] Ensure LoRA path from the request can be included in err msg by @terrytangyuan in #13450
  • [Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method by @Isotr0py in #13403
  • [Doc]: Improve feature tables by @hmellor in #13224
  • [Bugfix] Remove noisy error logging during local model loading by @Isotr0py in #13458
  • [ROCm] Make amdsmi import optional for other platforms by @DarkLight1337 in #13460
  • [Bugfix] Handle content type with optional parameters by @zifeitong in #13383
  • [Bugfix] Fix invalid rotary embedding unit test by @liangfu in #13431
  • [CI/Build] migrate static project metadata from setup.py to pyproject.toml by @dtrifiro in #8772
  • [V1][PP] Enable true PP with Ray executor by @WoosukKwon in #13472
  • [misc] fix debugging code by @youkaichao in #13487
  • [V1][Tests] Adding additional testing for multimodal models to V1 by @andoorve in #13308
  • [V1] Optimize handling of sampling metadata and req_ids list by @njhill in #13244
  • Pin Ray version to 2.40.0 by @WoosukKwon in #13490
  • [V1][Spec Decode] Optimize N-gram matching with Numba by @WoosukKwon in #13365
  • [Misc] Remove dangling references to --use-v2-block-manager by @hmellor in #13492
  • [Hardware][Gaudi][Feature] Support Contiguous Cache Fetch by @zhouyu5 in #12139
  • [perf-benchmark] Allow premerge ECR by @khluu in #13509
  • [ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe by @divakar-amd in #13503
  • [Doc] Add clarification note regarding paligemma by @ywang96 in #13511
  • [1/n][CI] Load models in CI from S3 instead of HF by @khluu in #13205
  • [perf-benchmark] Fix ECR path for premerge benchmark by @khluu in #13512
  • Refactor GPUModelRunnerBase load_model method to include device param by @Zzhiter in #13037
  • [Bugfix] Fix Positive Feature Layers in Llava Models by @alex-jw-brooks in #13514
  • [Model][Speculative Decoding] DeepSeek MTP spec decode by @luccafong in #12755
  • [V1][Core] Generic mechanism for handling engine utility methods by @njhill in #13060
  • [Feature] Pluggable platform-specific scheduler by @yannicks1 in #13161
  • [CI/Build] force writing version file by @dtrifiro in #13544
  • [doc] clarify profiling is only for developers by @youkaichao in #13554
  • [VLM][Bugfix] Pass processor kwargs properly on init by @DarkLight1337 in #13516
  • [Bugfix] Fix device ordinal when initializing spec_decode_sampler under multi-node setup by @ShangmingCai in #13269
  • [doc] clarify multi-node serving doc by @youkaichao in #13558
  • Fix copyright year to auto get current year by @wilsonwu in #13561
  • [MISC] Logging the message about Ray teardown by @comaniac in #13502
  • [Misc] Avoid calling unnecessary hf_list_repo_files for local model path by @Isotr0py in #13348
  • [BugFix] Avoid error traceback in logs when V1 LLM terminates by @njhill in #13565
  • [3/n][CI] Load Quantization test models with S3 by @khluu in #13570
  • [Misc] Qwen2.5 VL support LoRA by @jeejeelee in #13261
  • [ci] Add AWS creds for AMD by @khluu in #13572
  • [ROCm][MoE] mi300 mixtral8x7B perf for specific BS by @divakar-amd in #13577
  • [core] add sleep and wake up endpoint and v1 support by @youkaichao in #12987
  • [bugfix] spec decode worker get tp group only when initialized by @simon-mo in #13578
  • [Misc] Warn if the vLLM version can't be retrieved by @alex-jw-brooks in #13501
  • [Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL by @wulipc in #13533
  • [ROCm] MI300A compile targets deprecation by @gshtras in #13560
  • [API Server] Add port number range validation by @terrytangyuan in #13506
  • [CI/Build] Use uv in the Dockerfile by @mgoin in #13566
  • [ci] Fix spec decode test by @khluu in #13600
  • [2/n][ci] S3: Use full model path by @khluu in #13564
  • [Kernel] LoRA - Refactor sgmv kernels by @varun-sundar-rabindranath in #13110
  • Merge similar examples in offline_inference into single basic example by @hmellor in #12737
  • [Bugfix] Fix deepseekv3 grouped topk error by @Chen-XiaoBing in #13474

New Contributors

Full Changelog: v0.7.2...v0.7.3

Don't miss a new vllm release

NewReleases is sending notifications on new releases.