github vllm-project/vllm v0.8.0

latest release: v0.8.1
one day ago

v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!

Highlights

V1

We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

DeepSeek Improvements

We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:

  • MLA Enhancements:
  • Distributed Expert Parallelism (EP) and Data Parallelism (DP)
    • EP Support for DeepSeek Models (#12583)
    • Add enable_expert_parallel arg (#14305)
    • EP/TP MoE + DP Attention (#13931)
    • Set up data parallel communication (#13591)
  • MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
  • Pipeline Parallelism:
    • DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
    • Improve pipeline partitioning (#13839)
  • GEMM
    • Add streamK for block-quantized CUTLASS kernels (#12978)
    • Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
    • Add more tuned configs for H20 and others (#14877)

New Models

  • Gemma 3 (#14660)
    • Note: You have to install transformers from main branch (pip install git+https://github.com/huggingface/transformers.git) to use this model. Also, there may be numerical instabilities for float16/half dtype. Please use bfloat16 (preferred by HF) or float32 dtype.
  • Mistral Small 3.1 (#14957)
  • Phi-4-multimodal-instruct (#14119)
  • Grok1 (#13795)
  • QwQ-32B and toll calling (#14479, #14478)
  • Zamba2 (#13185)

NVIDIA Blackwell

  • Support nvfp4 cutlass gemm (#13571)
  • Add cutlass support for blackwell fp8 gemm (#13798)
  • Update the flash attn tag to support Blackwell (#14244)
  • Add ModelOpt FP4 Checkpoint Support (#12520)

Breaking Changes

  • The default value of seed is now None to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274)
  • The kv_cache and attn_metadata arguments for model's forward method has been removed; as the attention backend has access to these value via forward_context. (#13887)
  • vLLM will now default generation_config from model for chat template, sampling parameters such as temperature, etc. (#12622)
  • Several request time metrics (vllm:time_in_queue_requests, vllm:model_forward_time_milliseconds, vllm:model_execute_time_milliseconds) has been deprecated and subject to removal (#14135)

Updates

  • Update to PyTorch 2.6.0 (#12721, #13860)
  • Update to Python 3.9 typing (#14492, #13971)
  • Update to CUDA 12.4 as default for release and nightly wheels (#12098)
  • Update to Ray 2.43 (#13994)
  • Upgrade aiohttp to incldue CVE fix (#14840)
  • Upgrade jinja2 to get 3 moderate CVE fixes (#14839)

Features

Frontend API

  • API Server
    • Support return_tokens_as_token_id as a request param (#14066)
    • Support Image Emedding as input (#13955)
    • New /load endpoint for load statistics (#13950)
    • New API endpoint /is_sleeping (#14312)
    • Enables /score endpoint for embedding models (#12846)
    • Enable streaming for Transcription API (#13301)
    • Make model param optional in request (#13568)
    • Support SSL Key Rotation in HTTP Server (#13495)
  • Reasoning
    • Support reasoning output (#12955)
    • Support outlines engine with reasoning outputs (#14114)
    • Update reasoning with stream example to use OpenAI library (#14077)
  • CLI
    • Ensure out-of-tree quantization method recognize by cli args (#14328)
    • Add vllm bench CLI (#13993)
  • Make LLM API compatible for torchrun launcher (#13642)

Disaggregated Serving

  • Support KV cache offloading and disagg prefill with LMCache connector (#12953)
  • Support chunked prefill for LMCache connector (#14505)

LoRA

  • Add LoRA support for TransformersModel (#13770)
  • Make the deviceprofilerinclude LoRA memory. (#14469)
  • Gemma3ForConditionalGeneration supports LoRA (#14797)
  • Retire SGMV and BGMV Kernels (#14685) (#14685)

VLM

  • Generalized prompt updates for multi-modal processor (#13964)
  • Deprecate legacy input mapper for OOT multimodal models (#13979)
  • Refer code examples for common cases in dev multimodal processor (#14278)

Quantization

  • BaiChuan SupportsQuant (#13710)
  • BartModel SupportsQuant (#14699)
  • Bamba SupportsQuant (#14698)
  • Deepseek GGUF support (#13167)
  • GGUF MoE kernel (#14613)
  • Add GPTQAllSpark Quantization (#12931)
  • Better performance of gptq marlin kernel when n is small (#14138)

Structured Output

  • xgrammar: Expand list of unsupported jsonschema keywords (#13783)

Hardware Support

AMD

  • Faster Custom Paged Attention kernels (#12348)
  • Improved performance for V1 Triton (ROCm) backend (#14152)
  • Chunked prefill/paged attention in MLA on ROCm (#14316)
  • Perf improvement for DSv3 on AMD GPUs (#13718)
  • MoE fp8 block quant tuning support (#14068)

TPU

  • Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
  • Support start_profile/stop_profile in TPU worker (#13988)
  • Add TPU v1 test (#14834)
  • TPU multimodal model support for ragged attention (#14158)
  • Add tensor parallel support via Ray (#13618)
  • Enable prefix caching by default (#14773)

Neuron

  • Add Neuron device communicator for vLLM v1 (#14085)
  • Add custom_ops for neuron backend (#13246)
  • Add reshape_and_cache (#14391)
  • Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)

CPU

  • Upgrade CPU backend to torch-2.6 (#13381)
  • Support FP8 KV cache in CPU Backend(#14741)

s390x

  • Adding cpu inference with VXE ISA for s390x architecture (#12613)
  • Add documentation for s390x cpu implementation (#14198)

Plugins

  • Remove cuda hard code in models and layers (#13658)
  • Move use allgather to platform (#14010)

Bugfix and Enhancements

  • Illegal memory access for MoE On H20 (#13693)
  • Fix FP16 overflow for DeepSeek V2 (#13232)
  • Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
  • Pass all driver env vars to ray workers unless excluded (#14099)
  • Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
  • Capture and log the time of loading weights (#13666)

Developer Tooling

Benchmarks

  • Consolidate performance benchmark datasets (#14036)
  • Update benchmarks README (#14646)

CI and Build

  • Add RELEASE.md (#13926)
  • Use env var to control whether to use S3 bucket in CI (#13634)

Documentation

  • Add RLHF document (#14482)
  • Add nsight guide to profiling docs (#14298)
  • Add K8s deployment guide (#14084)
  • Add developer documentation for torch.compile integration (#14437)

What's Changed

  • Update pre-commit's isort version to remove warnings by @hmellor in #13614
  • [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
  • fix neuron performance issue by @ajayvohra2005 in #13589
  • [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
  • [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
  • [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
  • Add llmaz as another integration by @kerthcet in #13643
  • [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
  • [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
  • Use pre-commit to update requirements-test.txt by @hmellor in #13617
  • [Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
  • [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
  • Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
  • [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
  • [ci] Fix metrics test model path by @khluu in #13635
  • [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
  • [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
  • fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in #13668
  • [Attention] MLA with chunked prefill by @LucasWilkinson in #12639
  • [Misc] Fix yapf linting tools etc not running on pre-commit by @Isotr0py in #13695
  • docs: Add a note on full CI run in contributing guide by @terrytangyuan in #13646
  • [HTTP Server] Make model param optional in request by @youngkent in #13568
  • [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… by @WangErXiao in #13672
  • [Misc] Capture and log the time of loading weights by @waltforme in #13666
  • [ROCM] fix native attention function call by @gongdao123 in #13650
  • [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA by @2015aroras in #13687
  • [Misc] Bump compressed-tensors by @dsikka in #13619
  • [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len by @WangErXiao in #13691
  • [v1] Support allowed_token_ids in v1 Sampler by @houseroad in #13210
  • [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler by @JenZhao in #13594
  • Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size by @fabianlim in #13660
  • [V1][Metrics] Support vllm:cache_config_info by @markmc in #13299
  • [Metrics] Add --show-hidden-metrics-for-version CLI arg by @markmc in #13295
  • [Misc] Reduce LoRA-related static variable by @jeejeelee in #13166
  • [CI/Build] Fix pre-commit errors by @DarkLight1337 in #13696
  • [core] set up data parallel communication by @youkaichao in #13591
  • [ci] fix linter by @youkaichao in #13701
  • Support SSL Key Rotation in HTTP Server by @youngkent in #13495
  • [NVIDIA] Support nvfp4 cutlass gemm by @kaixih in #13571
  • [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths by @SageMoore in #13095
  • [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm by @gshtras in #13231
  • [Doc] Dockerfile instructions for optional dependencies and dev transformers by @DarkLight1337 in #13699
  • [Bugfix] Fix boolean conversion for OpenVINO env variable by @helena-intel in #13615
  • [XPU]fix setuptools version for xpu by @yma11 in #13548
  • [CI/Build] fix uv caching in Dockerfile by @dtrifiro in #13611
  • [CI/Build] Fix pre-commit errors from #13571 by @ywang96 in #13709
  • [BugFix] Minor: logger import in attention backend by @andylolu2 in #13706
  • [ci] Use env var to control whether to use S3 bucket in CI by @khluu in #13634
  • [Quant] BaiChuan SupportsQuant by @kylesayrs in #13710
  • [LMM] Implement merged multimodal processor for whisper by @Isotr0py in #13278
  • [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms by @njhill in #13688
  • [Misc] Deprecate --dataset from benchmark_serving.py by @ywang96 in #13708
  • [v1] torchrun compatibility by @youkaichao in #13642
  • [V1][BugFix] Fix engine core client shutdown hangs by @njhill in #13298
  • Fix some issues with benchmark data output by @huydhn in #13641
  • [ci] Add logic to change model to S3 path only when S3 CI env var is on by @khluu in #13727
  • [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13721
  • [model][refactor] remove cuda hard code in models and layers by @MengqingCao in #13658
  • [Bugfix] fix(logging): add missing opening square bracket by @bufferoverflow in #13011
  • [CI/Build] add python-json-logger to requirements-common by @bufferoverflow in #12842
  • Expert Parallelism (EP) Support for DeepSeek Models by @cakeng in #12583
  • [BugFix] Illegal memory access for MoE On H20 by @Abatom in #13693
  • [Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set by @NickLucche in #12513
  • [V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) by @afeldman-nm in #10980
  • Revert "[V1][Core] Fix memory issue with logits & sampling" by @ywang96 in #13775
  • Fix precommit fail in fused_moe intermediate_cache2 chunking by @mgoin in #13772
  • [Misc] Clean Up EngineArgs.create_engine_config by @robertgshaw2-redhat in #13734
  • [Misc][Chore] Clean Up AsyncOutputProcessing Logs by @robertgshaw2-redhat in #13780
  • Remove unused kwargs from model definitions by @hmellor in #13555
  • [Doc] arg_utils.py: fixed a typo by @eli-b in #13785
  • [Misc] set single whitespace between log sentences by @cjackal in #13771
  • [Bugfix][Quantization] Fix FP8 + EP by @tlrmchlsmth in #13784
  • [Misc][Attention][Quantization] init property earlier by @wangxiyuan in #13733
  • [V1][Metrics] Implement vllm:lora_requests_info metric by @markmc in #13504
  • [Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" by @LucasWilkinson in #13802
  • [Bugfix] Support MLA for CompressedTensorsWNA16 by @mgoin in #13725
  • Fix CompressedTensorsWNA16MoE with grouped scales by @mgoin in #13769
  • [Core] LoRA V1 - Add add/pin/list/remove_lora functions by @varun-sundar-rabindranath in #13705
  • [Misc] Check that the model can be inspected upon registration by @DarkLight1337 in #13743
  • [Core] xgrammar: Expand list of unsupported jsonschema keywords by @russellb in #13783
  • [Bugfix] Modify modelscope api usage in transformer_utils by @shen-shanshan in #13807
  • [misc] Clean up ray compiled graph type hints by @ruisearch42 in #13731
  • [Feature] Support KV cache offloading and disagg prefill with LMCache connector. by @YaoJiayi in #12953
  • [ROCm][Quantization][Kernel] Using HIP FP8 header by @gshtras in #12593
  • [CI/Build] Fix V1 LoRA failure by @jeejeelee in #13767
  • [Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs by @Chen-0210 in #13724
  • [Bugfix] Initialize attention bias on the same device as Query/Key/Value by @edwardzjl in #13468
  • [Bugfix] Flush TunableOp results before worker processes are destroyed. by @naromero77amd in #13623
  • [Bugfix] Fix deepseek-vl2 inference with more than 2 images by @Isotr0py in #13818
  • Fix /v1/audio/transcriptions Bad Request Error by @HermitSun in #13811
  • [Bugfix] Revert inspection code in #13743 by @DarkLight1337 in #13832
  • Fix string parsing error by @Chen-0210 in #13825
  • [Neuron] Add custom_ops for neuron backend by @liangfu in #13246
  • Fix failing MyGemma2Embedding test by @hmellor in #13820
  • [Model] Support Grok1 by @mgoin in #13795
  • DeepSeek V2/V3/R1 only place lm_head on last pp rank by @hmellor in #13833
  • [misc] Show driver IP info when Ray fails to allocate driver worker by @ruisearch42 in #13858
  • [V1][Spec Decode] Change Spec Decode Rejection Sampling API by @LiuXiaoxuanPKU in #13729
  • [Misc]Code Cleanup by @noemotiovon in #13859
  • [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues by @henrylhtsang in #13797
  • Improve pipeline partitioning by @hmellor in #13839
  • [Doc] fix the incorrect module path of tensorize_vllm_model by @tianyuzhou95 in #13863
  • [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms by @SageMoore in #13844
  • [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine by @sethkimmel3 in #13837
  • [Misc] Improve LoRA spelling by @jeejeelee in #13831
  • [Misc] Fix input processing for Ultravox by @ywang96 in #13871
  • [Bugfix] Add test example for Ultravox v0.5 by @DarkLight1337 in #13890
  • Add comments on accessing kv_cache and attn_metadata by @hmellor in #13887
  • [Bugfix] Handle None parameters in Mistral function calls. by @fgreinacher in #13786
  • [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor by @b8zhong in #13736
  • [Bugfix] Do not crash V0 engine on input errors by @joerunde in #13101
  • [Bugfix] Update expected token counts for Ultravox tests by @DarkLight1337 in #13895
  • [TPU] use torch2.6 with whl package by @Chenyaaang in #13860
  • [Misc] fixed qwen_vl_utils parameter error by @chaunceyjiang in #13906
  • [Bugfix] Backend option to disable xgrammar any_whitespace by @wallashss in #12744
  • [BugFix] Make FP8 Linear compatible with torch.compile by @WoosukKwon in #13918
  • [Kernel] FlashMLA integration by @LucasWilkinson in #13747
  • [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined by @HollowMan6 in #13851
  • Use CUDA 12.4 as default for release and nightly wheels by @mgoin in #12098
  • [misc] Rename Ray ADAG to Compiled Graph by @ruisearch42 in #13928
  • [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding by @SageMoore in #13922
  • [V1][Metrics] Handle preemptions by @markmc in #13169
  • [CI/Build] Add examples/ directory to be labelled by mergify by @b8zhong in #13944
  • [Misc] fixed 'required' is an invalid argument for positionals by @chaunceyjiang in #13948
  • [PP] Correct cache size check by @zhengy001 in #13873
  • Fix test_block_fp8.py test for MoE by @mgoin in #13915
  • [VLM] Support multimodal inputs for Florence-2 models by @Isotr0py in #13320
  • [Model] Deepseek GGUF support by @SzymonOzog in #13167
  • Update quickstart.md by @observerw in #13958
  • Deduplicate .pre-commit-config.yaml's exclude by @hmellor in #13967
  • [bugfix] Fix profiling for RayDistributedExecutor by @ruisearch42 in #13945
  • Update LMFE version to v0.10.11 to support new versions of transforme… by @noamgat in #13930
  • [Bugfix] Fix qwen2.5-vl overflow issue by @Isotr0py in #13968
  • [VLM] Generalized prompt updates for multi-modal processor by @DarkLight1337 in #13964
  • [Attention] MLA support for V1 by @chenyang78 in #13789
  • Bump azure/setup-helm from 4.2.0 to 4.3.0 by @dependabot in #13742
  • [VLM] Deprecate legacy input mapper for OOT multimodal models by @DarkLight1337 in #13979
  • [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups by @SageMoore in #13970
  • [V1][Minor] Minor cleanup for GPU Model Runner by @WoosukKwon in #13983
  • [core] Perf improvement for DSv3 on AMD GPUs by @qli88 in #13718
  • [Attention] Flash MLA for V1 by @LucasWilkinson in #13867
  • [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict by @benchislett in #13626
  • [Misc] Print FusedMoE detail info by @jeejeelee in #13974
  • [V1]SupportsV0Only protocol for model definitions by @ywang96 in #13959
  • [Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #13911
  • [Doc] Move multimodal Embedding API example to Online Serving page by @DarkLight1337 in #14017
  • [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) by @hasB4K in #13987
  • Use smaller embedding model when not testing model specifically by @hmellor in #13891
  • [Hardware][Intel-Gaudi] Regional compilation support by @Kacper-Pietkun in #13213
  • [V1][Minor] Restore V1 compatibility with LLMEngine class by @Ryp in #13090
  • Update AutoAWQ docs by @hmellor in #14042
  • [Bugfix] Fix MoeWNA16Method activation by @jeejeelee in #14024
  • [VLM][Bugfix] Enable specifying prompt target via index by @DarkLight1337 in #14038
  • [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series by @LouieYang in #14031
  • [Doc] Fix ROCm documentation by @b8zhong in #14041
  • Fix entrypoint tests for embedding models by @hmellor in #14052
  • [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU by @vanbasten23 in #13379
  • [v1] Cleanup the BlockTable in InputBatch by @heheda12345 in #13977
  • Add RELEASE.md by @atalman in #13926
  • [v1] Move block pool operations to a separate class by @heheda12345 in #13973
  • [core] Bump ray to 2.43 by @ruisearch42 in #13994
  • [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass by @ProExpertProg in #10902
  • [Docs] Add pipeline_parallel_size to optimization docs by @b8zhong in #14059
  • [Bugfix] Add file lock for ModelScope download by @jeejeelee in #14060
  • [Misc][Kernel]: Add GPTQAllSpark Quantization by @wyajieha in #12931
  • [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor by @bigPYJ1151 in #14053
  • [Documentation] Add more deployment guide for Kubernetes deployment by @KuntaiDu in #13841
  • [Doc] Consolidate whisper and florence2 examples by @Isotr0py in #14050
  • [V1][Minor] Do not print attn backend twice by @WoosukKwon in #13985
  • [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class by @SageMoore in #14065
  • [v1][Bugfix] Only cache blocks that are not in the prefix cache by @heheda12345 in #14073
  • [v1] Add __repr__ to KVCacheBlock to avoid recursive print by @heheda12345 in #14081
  • [Model] Add LoRA support for TransformersModel by @jeejeelee in #13770
  • [Misc] Accurately capture the time of loading weights by @waltforme in #14063
  • [Doc] Source building add clone step by @qux-bbb in #14086
  • [v0][structured output] Support reasoning output by @gaocegege in #12955
  • Update deprecated Python 3.8 typing by @hmellor in #13971
  • [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure by @realShengYao in #14051
  • [Misc] typo find in deepseek_v2 by @noooop in #14106
  • [Misc][Platform] Move use allgather to platform by @MengqingCao in #14010
  • [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 by @comaniac in #13921
  • [V1] Refactor parallel sampling support by @markmc in #13774
  • Improve the docs for TransformersModel by @hmellor in #14147
  • [ROCm] Faster Custom Paged Attention kernels by @tjtanaa in #12348
  • Fix head_dim not existing in all model configs (Transformers backend) by @hmellor in #14141
  • [V0][Metrics] Remove unimplemented vllm:tokens_total by @markmc in #14134
  • [V0][Metrics] Deprecate some KV/prefix cache metrics by @markmc in #14136
  • [V1] Simplify stats logging by @njhill in #14082
  • [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics by @markmc in https://github.com//pull/14055
  • [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 by @mgoin in #14100
  • [Kernel] Optimize moe intermediate_cache usage by @mgoin in #13625
  • [Docs] Add GPTQModel by @Qubitium in #14056
  • [v1] Add comments to the new ragged paged attention Pallas kernel by @vanbasten23 in #14155
  • [Model] Add support for GraniteMoeShared models by @tjohnson31415 in #13313
  • [core] moe fp8 block quant tuning support by @divakar-amd in #14068
  • [Misc] Remove lru_cache in NvmlCudaPlatform by @comaniac in #14156
  • [core] Pass all driver env vars to ray workers unless excluded by @ruisearch42 in #14099
  • Use math.prod instead of np.prod for trivial ops by @zhanwenchen in #14142
  • Fix benchmark_moe.py tuning for CUDA devices by @mgoin in #14164
  • [platform] add debug logging during inferring the device type by @youkaichao in #14195
  • [sleep mode] error out with expandable_segments by @youkaichao in #14189
  • [doc] add "Failed to infer device type" to faq by @youkaichao in #14200
  • [Bugfix] Restrict MacOS CPU detection by @mgoin in #14210
  • [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs by @njhill in #13869
  • [V0][Metrics] Deprecate some questionable request time metrics by @markmc in #14135
  • [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py by @lk-chen in #14161
  • add cutlass support for blackwell fp8 gemm by @kushanam in #13798
  • [TPU][Profiler] Support start_profile/stop_profile in TPU worker by @lsy323 in #13988
  • Fix performance when --generation-config is not None by @hmellor in #14223
  • [Frontend] Do prompt_logprobs clamping for chat as well as completions by @hmellor in #14225
  • [Docs] Update Dockerfile dependency image by @mgoin in #14215
  • [v1][Metrics] Add design doc by @markmc in #12745
  • Serialize using safetensors for KV caches by @KuntaiDu in #14228
  • Clean up unused padding_idx variables across many model definitions by @tlrmchlsmth in #13240
  • [ROCm] Disable a few more kernel tests that are broken on ROCm by @SageMoore in #14145
  • [V1][TPU] TPU multimodal model support for ragged attention by @mgoin in #14158
  • [misc] announce china meetup by @youkaichao in #14248
  • Moved numba from common requirements to cuda/rocm specific requirements by @npanpaliya in #14199
  • Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 by @mgoin in #14157
  • [Bugfix] Fix gptq_marlin for deepseek-v3 by @rainkert in #13750
  • [V1][Bugfix] Do not reset prefix caching metrics by @comaniac in #14235
  • [Model] New model support for Phi-4-multimodal-instruct by @congcongchen123 in #14119
  • [V1] EP/TP MoE + DP Attention by @tlrmchlsmth in #13931
  • [platforms] improve rocm debugging info by @youkaichao in #14257
  • Temporarily disable test_awq_gemm_opcheck by @mgoin in #14251
  • [Frontend] Allow return_tokens_as_token_ids to be passed as a request param by @benchislett in #14066
  • [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing by @ywang96 in #14256
  • [Bugfix][V1] Fix allowed_token_ids for v1 Sampler by @houseroad in #14169
  • [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID by @iacolippo in #14217
  • [Doc] [3/N] Refer code examples for common cases in dev multimodal processor by @DarkLight1337 in #14278
  • Small update for external_launcher backend docs by @zhe-thoughts in #14288
  • [V1][Frontend] Add Testing For V1 Runtime Parameters by @robertgshaw2-redhat in #14159
  • [LoRA] Remove linear hack outside transformers backend by @Isotr0py in #14177
  • [Misc] Add Qwen2MoeForCausalLM moe tuning support by @jeejeelee in #14276
  • [Doc] Fixed typo in prefix_caching.md by @DaividFrank in #14293
  • [Bugfix] Fix broken vision language example by @Isotr0py in #14292
  • [Docs] Add Meta Slides by @simon-mo in #14297
  • [V1][Minor] Remove obsolete FIXME comment by @njhill in #14304
  • Deprecate best_of Sampling Parameter in anticipation for vLLM V1 by @vincent-4 in #13997
  • [V1][BugFix] Fix for mixed top_k batch by @njhill in #14301
  • [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env by @yangsijia-serena in #14267
  • [V1][Easy] Add empty allowed_token_ids in the v1 sampler test by @houseroad in #14308
  • [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch by @pyc96 in #14237
  • [Bugfix] Remove num_tokens_across_dp by @tlrmchlsmth in #14302
  • [BugFix] Fix prefix caching V0 MLA by @LucasWilkinson in #14255
  • [CI/Build] Use spawn multiprocessing mode for V1 test pipeline by @russellb in #14243
  • Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM by @mgoin in #13917
  • [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation by @terrytangyuan in #13850
  • [BugFix] MLA + V1, illegal memory access and accuracy issues by @LucasWilkinson in #14253
  • [misc] Mention ray list nodes command to troubleshoot ray issues by @ruisearch42 in #14318
  • [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 by @gaocegege in #14114
  • [V1] LoRA - Enable more V1 tests by @varun-sundar-rabindranath in #14315
  • [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention by @NickLucche in #11301
  • [Hardware] Update the flash attn tag to support Blackwell by @pavanimajety in #14244
  • [Model] Update Paligemma multimodal processing with PromptUpdate by @kylehh in #14015
  • [VLM] Support Pixtral-HF on V1 by @lk-chen in #14275
  • [Core] Optimizing cross-attention QKVParallelLinear computation by @NickLucche in #12325
  • [Frontend][Docs] Transcription API streaming by @NickLucche in #13301
  • [Doc] Update reasoning with stream example to use OpenAI library by @liuyanyi in #14077
  • [Doc] Correct beam_search using in generative_models.md by @upayuryeva in #14363
  • [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend by @tdoublep in #14152
  • [Bugfix][Core] fix abort_seq_group and memory leak when n>1 by @courage17340 in #14326
  • [Core] Don't use cache during multi-modal profiling by @DarkLight1337 in #14336
  • [Doc] Fix date typo in README.md by @jitseklomp in #14366
  • [RLHF] use worker_extension_cls for compatibility with V0 and V1 by @youkaichao in #14185
  • Reinstate best_of for V0 by @hmellor in #14356
  • Adding cpu inference with VXE ISA for s390x architecture by @dilipgb in #12613
  • Add authors to license header. by @tdoublep in #14371
  • Fix mla prefill context performance by @ZhongYingMatrix in #13897
  • [V1] Do not detokenize if sampling param detokenize is False by @hj-mistral in #14224
  • [Distributed] Add enable_expert_parallel arg by @tlrmchlsmth in #14305
  • [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa by @mgoin in #13569
  • [CI] Disable spawn when running V1 Test by @tdoublep in #14345
  • [Kernel] Add needs_fixed_stride_order tag to most GEMMs by @tlrmchlsmth in #14306
  • [Bugfix] Fix use_direct_call condition in FusedMoE layer for by @tlrmchlsmth in #14382
  • [Bug] Fix Attention when ignored in by quant_method by @mgoin in #14313
  • [V1][Bugfix] Standardize quantized kv cache rejection for attention backends by @mgoin in #14221
  • [Docs] Add nsight guide to profiling docs by @mgoin in #14298
  • [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue by @yaochengji in #14310
  • [Doc] Fix a typo by @dyli-google in #14385
  • [Bugfix] Correctly call cudaProfilerStop in benchmarks script by @b8zhong in #14183
  • [Perf] Reduce MLA CPU overheads in V1 by @LucasWilkinson in #14384
  • [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object by @ProExpertProg in #14390
  • [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs by @LucasWilkinson in #14396
  • [Bugfix] Fix JambaForCausalLM LoRA by @jeejeelee in #14370
  • [Build] Add nightly wheel fallback when latest commit wheel unavailable by @Isotr0py in #14358
  • OpenVINO: added CPU-like conditions by @ilya-lavrenov in #14338
  • [GH] Auto-apply multi-modality label to relevant PRs by @DarkLight1337 in #14402
  • correct wrong markdown syntax by @vincent-pli in #14414
  • [Bugfix] Further clean up LoRA test by @jeejeelee in #14422
  • [Bugfix] Clean up multi-modal processors by @DarkLight1337 in #14417
  • [Misc] Set default value of seed to None by @SmartManoj in #14274
  • [BUGFIX] Skip tokenization support for throughput benchmark by @maleksan85 in #12712
  • Fix missing kv_caches and attn_metadata in OpenVINOCausalLM by @hmellor in #14271
  • Use the optimized block sizes after tuning the kernel. by @vanbasten23 in #14329
  • [V1][Core] Support for Structured Outputs by @aarnphm in #12388
  • [Doc] Update prefix_caching.md to match the example image by @York-RDWang in #14420
  • [Benchmarks] Make detokenization optional in benchmark scripts by @JArnoldAMD in #11697
  • [Kernel] optimize performance of gptq marlin kernel when n is small by @jinzhen-lin in #14138
  • [Misc] Add Phi4-MM example by @jeejeelee in #14343
  • [v1] torch.compile integration explanation by @youkaichao in #14437
  • [V1] Eagerly remove finished requests from the batch by @njhill in #14388
  • [V1][Metrics] Fix traceback with preemptions+LoRA by @markmc in #14220
  • [Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 by @yarongmu-google in #14459
  • [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC by @afeldman-nm in #13949
  • [Bugfix][V1] Handle MLA in kv_cache_interface by @tlrmchlsmth in #14462
  • Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)" by @tlrmchlsmth in #14471
  • [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache by @hasB4K in #14369
  • [MISC][V1] Register process killing handler only in the main thread by @comaniac in #14380
  • [core] add extra_args to SamplingParams by @akeshet in #13300
  • [CI/Build] refactor: set timezone of container to UTC by @bufferoverflow in #12888
  • Default to generation_config from model by @hmellor in #12622
  • [Doc]add doc for Qwen models tool calling by @WangErXiao in #14478
  • [Doc] Added QwQ-32B to the supported models list in the reasoning out… by @WangErXiao in #14479
  • [Bugfix] Make the deviceprofiler include LoRA memory. by @jeejeelee in #14469
  • Add training doc signposting to TRL by @hmellor in #14439
  • [Build/BugFix] Fix hopper 12.8 build by @LucasWilkinson in #14354
  • Add RLHF document by @hmellor in #14482
  • [CI/Build] Use a fixed seed to avoid flaky tests by @DarkLight1337 in #14480
  • [V1] TPU - Add tensor parallel support via Ray by @alexm-redhat in #13618
  • [VLM] Add TP support for Phi-4-MM by @Isotr0py in #14453
  • [Misc] add use_tqdm_on_load to reduce logs by @aarnphm in #14407
  • [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13776
  • [benchmarks] Add option to use unique jsonschema for each request by @russellb in #14457
  • [Misc] Don't run ruff at all on 3rd party libs by @DarkLight1337 in #14493
  • Move requirements into their own directory by @hmellor in #12547
  • [Bugfix] DeepSeek Accuracy by @LucasWilkinson in #14476
  • [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling by @Isotr0py in #14361
  • Update CODEOWNERS for structured output by @russellb in #14496
  • [Misc] Upgrade to Python 3.9 typing for additional directories by @DarkLight1337 in #14492
  • [V1] Support bad_words in sampler by @22quinn in #13376
  • Revert "[V1][Core] Fix memory issue with logits & sampling" by @robertgshaw2-redhat in #14504
  • [Attention] Default to FlashMLA backend for MLA by @LucasWilkinson in #14451
  • [V1][TPU] Remove unnecessary padding for running on TPU. by @vanbasten23 in #14467
  • [Feat] Support chunked prefill for LMCache connector by @YaoJiayi in #14505
  • [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 by @yanyc428 in #12428
  • [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work by @Isotr0py in #14498
  • [Hardware][TPU] Fix the recompiling issue in logits processor after warmup by @yaochengji in #14510
  • [Misc] Ensure out-of-tree quantization method recognize by cli args by @liuyanyi in #14328
  • [Bugfix] Wrong requirements path - rocm by @martinhoyer in #14527
  • [Feature] Consolidate performance benchmark datasets by @JenZhao in #14036
  • [Misc] Add log information for handle_process_request. by @chaunceyjiang in #14130
  • [Docs] Mention model_impl arg when explaining Transformers fallback by @hmellor in #14552
  • [Frontend] support image embeds by @chaunceyjiang in #13955
  • [Kernel] Add more dtype support for GGUF kernels by @SzymonOzog in #14043
  • [Doc] Update PaliGemma note to a warning by @DarkLight1337 in #14565
  • Correct capitalisation: Github -> GitHub by @hmellor in #14561
  • [V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL by @ywang96 in #14548
  • Correct capitalisation: VLLM -> vLLM by @hmellor in #14562
  • [Docs] Make installation URLs nicer by @hmellor in #14556
  • [Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 by @chaunceyjiang in #14554
  • [Perf] Improve MLA on V1 by @simon-mo in #14540
  • [Minor] Update the tqdm bar for parallel sampling by @WoosukKwon in #14571
  • [V1] LoRA - Add triton kernels for V1 by @varun-sundar-rabindranath in #13096
  • Fix typo in benchmark_serving_structured_output.py by @russellb in #14566
  • [V1] Prevent xgrammar from breaking TPU support by @russellb in #14575
  • [Kernel] moe wna16 cuda kernel by @jinzhen-lin in #13321
  • [MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils by @comaniac in #14379
  • [Neuron] Add Neuron device communicator for vLLM v1 by @gnovack in #14085
  • [neuron] add reshape_and_cache by @liangfu in #14391
  • [V1][PP] Do not block engine core when no requests to schedule by @comaniac in #14585
  • [Bugfix] Fix FP16 overflow for DeepSeek V2 by @Concurrensee in #13232
  • [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #14508
  • [Misc] Correct deepseek-vl2 chat template by @Isotr0py in #14558
  • [Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync by @cynthieye in #14377
  • [VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor by @Isotr0py in #14602
  • benchmarks: simplify test jsonschema by @russellb in #14567
  • dynamic distpatch of fp8 kernels by @jeffdaily in #14245
  • [Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 by @DarkLight1337 in #14609
  • Uninstall dependencies before installing requirements/tpu.txt by @richardsliu in #14586
  • [V1] Add regex structured output support with xgrammar by @russellb in #14590
  • docs: Add documentation for s390x cpu implementation by @dilipgb in #14198
  • [BugFix/Build] Fix sparse kernels not getting built on hopper by @LucasWilkinson in #14572
  • [Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. by @jikunshang in #14564
  • [V1] Remove cache from StructuredOutputManager by @russellb in #14622
  • fix some typos : supported_head_sizes by @hackty in #14627
  • [V1] Delay all xgrammar usage until needed by @russellb in #14616
  • Fix run_tpu_test by @richardsliu in #14641
  • [V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly by @vanbasten23 in #14597
  • [Bugfix][V1][PP] Only warmup sampler at last PP rank by @comaniac in #14643
  • [release] Add commands to clean up logs on TPU release node by @khluu in #14642
  • [Feature] Add vllm bench CLI by @randyjhc in #13993
  • [core][V1] pluggable scheduler by @joerunde in #14466
  • [Doc] Update benchmarks README by @JenZhao in #14646
  • [Model] Extend Ultravox to accept audio longer than 30s by @farzadab in #13631
  • [V1][Core] Support MistralTokenizer for Structured Output by @aarnphm in #14625
  • [Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization by @Isotr0py in #14545
  • [Kernel] GGUF MoE kernel by @SzymonOzog in #14613
  • [V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing by @benchislett in #14645
  • [Kernel] Add ModelOpt FP4 Checkpoint Support by @pavanimajety in #12520
  • [CPU] Upgrade CPU backend to torch-2.6 by @bigPYJ1151 in #13381
  • [ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. by @SageMoore in #14629
  • [Model] Add support for Gemma 3 by @WoosukKwon in #14660
  • [Bugfix] Missing thumbnail from NVLM-D processor by @ameyanjarlekar in #14633
  • [ROCm] Enable chunked prefill/paged attention in MLA on ROCm by @SageMoore in #14316
  • [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. by @tjtanaa in #14664
  • [BugFix][V1] Fix parallel sampling finishing/aborts by @njhill in #14512
  • [V1] Allow sliding window + prefix caching by @WoosukKwon in #13069
  • [release] Add force remove for TPU logs by @khluu in #14697
  • [bugfix] fixup warning message for plugged schedulers for v1 by @joerunde in #14700
  • Add ray[data] as tpu dependency by @richardsliu in #14691
  • [ROCm][FP8] Fix for adjustments needed only for fnuz by @gshtras in #14689
  • [BugFix][TritonMLA] Process weights after model loading for GGUF by @tywuAMD in #14555
  • [Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config by @hasB4K in #14367
  • [V1][TPU] Add assertion on multi-step-scheduler by @lsy323 in #14707
  • [Quant] BartModel SupportsQuant by @kylesayrs in #14699
  • [Quant] Bamba SupportsQuant by @kylesayrs in #14698
  • [Bugfix] Fix chunked prefill for GGUF by @SzymonOzog in #14666
  • [CI/Build] Delete ultravox LoRA test by @jeejeelee in #14730
  • [Bugfix] fix benchmark moe by @jeejeelee in #14653
  • [VLM] Support pan-and-scan for Gemma3 multi-modal processor by @DarkLight1337 in #14672
  • [VLM] Support loading InternVideo2.5 models as original InternVLChatModel by @Isotr0py in #14738
  • [Bugfix] Fix prompt format of GLM4V by @DarkLight1337 in #14539
  • [V1][Minor] Minor enhancements on scheduler by @WoosukKwon in #14732
  • [Misc] Clean up processor tests by @DarkLight1337 in #14771
  • [V1][Core] using cached vocab_size for Structured Outputs by @aarnphm in #14630
  • [V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output by @afeldman-nm in #14624
  • [Attention] Remove slow setattr in MLA by @LucasWilkinson in #14769
  • [Doc] Fix typo in documentation by @yasu52 in #14783
  • [Doc] Fix small typo in Transformers fallback by @heheda12345 in #14791
  • [V1] TPU - Enable prefix caching by default by @alexm-redhat in #14773
  • forward fix PR 14245, restore build on ROCm 6.2 by @jeffdaily in #14709
  • [V1] Move OOM check into sampler run by @ywang96 in #14728
  • [V1] Temporarily disable FlashInfer Rejection Sampler by @WoosukKwon in #14788
  • [Kernel] LoRA - Enable CUDAGraphs for V1 by @varun-sundar-rabindranath in #14626
  • [Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. by @tdoublep in #14431
  • [Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it by @gau-nernst in #14681
  • [ci] Reduce number of tests in fastcheck by @khluu in #14782
  • [Misc][Minor] Simplify SamplingParams.__post_init__() by @njhill in #14772
  • [Neuron] flatten test parameterization for neuron attention kernels by @liangfu in #14712
  • [Feature] Add visionarena offline support for benchmark_throughput by @JenZhao in #14654
  • [CI] Fix missing example model id in processor test by @ywang96 in #14787
  • [Attention] MLA get rid of materialization by @LucasWilkinson in #14770
  • [Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel by @gau-nernst in #14667
  • [BugFix]Fix performance serving benchmark when enable profiling by @Potabk in #14737
  • [Misc] Clean up type annotation for SupportsMultiModal by @DarkLight1337 in #14794
  • [Bugfix] Fix small typo in the example of Streaming delimiter by @bravo325806 in #14793
  • [Misc] Gemma3ForConditionalGeneration supports LoRA by @jeejeelee in #14797
  • [V1][Minor] Minor code cleanup for scheduling metrics by @WoosukKwon in #14800
  • [Bugfix][W8A8] fixed cutlass block fp8 binding by @DefTruth in #14796
  • [VLM] Various cleanup and fixes by @DarkLight1337 in #14806
  • [BugFix]: properly catch templating error when preprocess input by @gcalmettes in #13976
  • [Bugfix] Fix Aria test loading by @DarkLight1337 in #14823
  • [V1] Fix vocab size calculation for structured output by @russellb in #14826
  • [Frontend] Fix log message to use http vs https by @russellb in #14774
  • [V1][Metrics] Updated list of deprecated metrics in v0.8 by @markmc in #14695
  • [Frontend] track server_load by @daniel-salib in #13950
  • [Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 by @wyajieha in #14430
  • [release] Remove log cleanup commands from TPU job by @khluu in #14838
  • Re-enable the AMD Entrypoints Test by @Alexei-V-Ivanov-AMD in #14711
  • [Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14778
  • [V1] Fix model parameterization for structured output tests by @russellb in #14833
  • Update to torch==2.6.0 by @mgoin in #12721
  • [CI] Add TPU v1 test by @richardsliu in #14834
  • [Build/CI] Move ninja to common deps by @russellb in #14835
  • [Build/CI] Upgrade aiohttp to incldue CVE fix by @russellb in #14840
  • [Doc] More neutral K8s deployment guide by @terrytangyuan in #14084
  • [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … by @yarongmu-google in #14844
  • [Neuron][CI] update docker run command by @liangfu in #14829
  • [Bugfix][V1] Fix flashinfer sampling by @DefTruth in #14815
  • Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… by @tlrmchlsmth in #14848
  • Disable outlines cache by default by @russellb in #14837
  • [Misc] Remove misleading message in gemma2 and gemma3 by @Isotr0py in #14850
  • [Misc][Easy] Annotate unused vars in the csrc files by @houseroad in #14798
  • [V1] V1 Enablement Oracle by @robertgshaw2-redhat in #13726
  • [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md by @simon-mo in #14852
  • [CPU] Support FP8 KV cache by @bigPYJ1151 in #14741
  • [Attention] Get rid of mla cache alignment by @LucasWilkinson in #14842
  • [CI/Build] Delete LoRA bias test by @jeejeelee in #14849
  • [V1][Structured Output] calculate vocab_size eagerly by @aarnphm in #14851
  • [Doc] V1 user guide by @JenZhao in #13991
  • [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes by @russellb in #14839
  • [Bugfix] EAGLE output norm bug by @luyuzhe111 in #14464
  • [VLM] Limit multimodal input cache by memory by @DarkLight1337 in #14805
  • [CI][Intel GPU] refine intel GPU ci docker build by @jikunshang in #14860
  • [Core] Expose API endpoint /is_sleeping by @waltforme in #14312
  • [VLM] Merged multi-modal processor for Pixtral by @Flechman in #12211
  • [Misc][Doc] Minor benchmark README update by @ywang96 in #14874
  • [VLM] Clean up Phi-4-MM ViT implementation by @Isotr0py in #14812
  • [V1] Remove V0 fallback for mistral-tokenizer by @ywang96 in #14873
  • [Kernel] Add more tuned configs by @simon-mo in #14877
  • [BugFix] Fix torch distributed stateless PG backend init by @njhill in #14870
  • [V1] [Spec Decode] Fix ngram tests by @LiuXiaoxuanPKU in #14878
  • [Bugfix] Limit profiling run sequence length by max_model_len by @kylesayrs in #14785
  • [Bugfix] Explicitly disable Phi-4-multimodal in V1 by @DarkLight1337 in #14889
  • Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) by @DarkLight1337 in #14892
  • [BugFix][V1] Fix overhead related to bad_words sampling when not in use by @njhill in #14894
  • [V1][BugFix] Detect interleaved sliding window attention by @WoosukKwon in #14896
  • [Misc] Catching Ray Compiled Graph PP test failures for V1 by @ruisearch42 in #14847
  • [Doc] Add guidance for using ccache with pip install -e . in doc by @vadiklyutiy in #14901
  • [V1] Enable Entrypoints Tests by @robertgshaw2-redhat in #14903
  • [CI] Fix Tool Calling Tests by @robertgshaw2-redhat in #14898
  • [CI/Build] Update defaults for test reproducibility by @DarkLight1337 in #14893
  • [V1] Optimize the overhead of rewinding by @WoosukKwon in #14905
  • [V1][Minor] Add repr to ConstantList by @WoosukKwon in #14907
  • [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context by @LucasWilkinson in #14910
  • [Misc] Replace os environ to monkeypatch in test suite by @t-sibiraj in #14516
  • [Benchmark] Do not save detailed info to json by default by @simon-mo in #14879
  • [V1] [Spec Decode] Support random sampling for spec decode by @LiuXiaoxuanPKU in #13933
  • [V1] Remove input cache client by @DarkLight1337 in #14864
  • [Misc][XPU] Use None as device capacity for XPU by @yma11 in #14932
  • [Doc] Add vLLM Beijing meetup slide by @heheda12345 in #14938
  • setup.py: drop assumption about local main branch by @russellb in #14692
  • [MISC] More AMD unused var clean up by @houseroad in #14926
  • fix minor miscalled method by @kushanam in #14327
  • [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. by @vanbasten23 in #14846
  • [Bugfix] Fix Ultravox on V1 by @DarkLight1337 in #14929
  • [Misc] Add --seed option to offline multi-modal examples by @DarkLight1337 in #14934
  • [Bugfix][ROCm] running new process using spawn method for rocm in tests. by @vllmellm in #14810
  • [Doc] Fix misleading log during multi-modal profiling by @DarkLight1337 in #14955
  • Add patch merger by @patrickvonplaten in #14957
  • [V1] Default MLA to V1 by @simon-mo in #14921
  • [Bugfix] Fix precommit - line too long in pixtral.py by @tlrmchlsmth in #14960
  • [Bugfix][Model] Mixtral: use unused head_dim config argument by @qtrrb in #14961
  • [Fix][Structured Output] using vocab_size to construct matcher by @aarnphm in #14868
  • [Bugfix] Make Gemma3 MM V0 only for now by @ywang96 in #14971

New Contributors

  • @ajayvohra2005 made their first contribution in #13589
  • @Edwinhr716 made their first contribution in #12913
  • @Hongbosherlock made their first contribution in #12978
  • @johnzheng1975 made their first contribution in #13668
  • @JenZhao made their first contribution in #13594
  • @bufferoverflow made their first contribution in #13011
  • @cakeng made their first contribution in #12583
  • @eli-b made their first contribution in #13785
  • @YaoJiayi made their first contribution in #12953
  • @edwardzjl made their first contribution in #13468
  • @naromero77amd made their first contribution in #13623
  • @henrylhtsang made their first contribution in #13797
  • @tianyuzhou95 made their first contribution in #13863
  • @b8zhong made their first contribution in #13736
  • @Chenyaaang made their first contribution in #13860
  • @observerw made their first contribution in #13958
  • @qli88 made their first contribution in #13718
  • @benchislett made their first contribution in #13626
  • @hasB4K made their first contribution in #13987
  • @Kacper-Pietkun made their first contribution in #13213
  • @Ryp made their first contribution in #13090
  • @LouieYang made their first contribution in #14031
  • @vanbasten23 made their first contribution in #13379
  • @atalman made their first contribution in #13926
  • @wyajieha made their first contribution in #12931
  • @qux-bbb made their first contribution in #14086
  • @realShengYao made their first contribution in #14051
  • @zhanwenchen made their first contribution in #14142
  • @rainkert made their first contribution in #13750
  • @congcongchen123 made their first contribution in #14119
  • @iacolippo made their first contribution in #14217
  • @zhe-thoughts made their first contribution in #14288
  • @DaividFrank made their first contribution in #14293
  • @vincent-4 made their first contribution in #13997
  • @pyc96 made their first contribution in #14237
  • @upayuryeva made their first contribution in #14363
  • @courage17340 made their first contribution in #14326
  • @dilipgb made their first contribution in #12613
  • @ZhongYingMatrix made their first contribution in #13897
  • @hj-mistral made their first contribution in #14224
  • @yaochengji made their first contribution in #14310
  • @dyli-google made their first contribution in #14385
  • @vincent-pli made their first contribution in #14414
  • @York-RDWang made their first contribution in #14420
  • @yarongmu-google made their first contribution in #14459
  • @22quinn made their first contribution in #13376
  • @yanyc428 made their first contribution in #12428
  • @martinhoyer made their first contribution in #14527
  • @gnovack made their first contribution in #14085
  • @cynthieye made their first contribution in #14377
  • @jeffdaily made their first contribution in #14245
  • @hackty made their first contribution in #14627
  • @randyjhc made their first contribution in #13993
  • @ameyanjarlekar made their first contribution in #14633
  • @tywuAMD made their first contribution in #14555
  • @yasu52 made their first contribution in #14783
  • @gau-nernst made their first contribution in #14681
  • @Potabk made their first contribution in #14737
  • @bravo325806 made their first contribution in #14793
  • @daniel-salib made their first contribution in #13950
  • @cyang49 made their first contribution in #14778
  • @luyuzhe111 made their first contribution in #14464
  • @Flechman made their first contribution in #12211
  • @vadiklyutiy made their first contribution in #14901
  • @t-sibiraj made their first contribution in #14516
  • @vllmellm made their first contribution in #14810
  • @qtrrb made their first contribution in #14961

Full Changelog: v0.7.3...v0.8.0

Don't miss a new vllm release

NewReleases is sending notifications on new releases.