pypi vllm 0.8.5
v0.8.5

latest releases: 0.12.0, 0.11.2, 0.11.1...
7 months ago

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

  • Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
  • Add ModernBERT (#16648)
  • Add Granite Speech Support (#16246)
  • Add PLaMo2 (#14323)
  • Add Kimi-VL model support (#16387)
  • Add Qwen2.5-Omni model support (thinker only) (#15130)
  • Snowflake Arctic Embed (Family) (#16649)
  • Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

  • Add structural_tag support using xgrammar (#17085)
  • Disaggregated serving:
    • KV Connector API V1 (#15960)
    • Adding LMCache KV connector for v1 (#16625)
  • Clean up: Remove Sampler from Model Code (#17084)
  • MLA: Simplification to batch P/D reordering (#16673)
  • Move usage stats to worker and start logging TPU hardware (#16211)
  • Support FlashInfer Attention (#16684)
  • Faster incremental detokenization (#15137)
  • EAGLE-3 Support (#16937)

Features

  • Validate urls object for multimodal content parts (#16990)
  • Prototype support sequence parallelism using compilation pass (#16155)
  • Add sampling params to v1/audio/transcriptions endpoint (#16591)
  • Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
  • Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

  • Attention:
    • FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
    • Update to lastest FA3 code (#13111)
    • Support Cutlass MLA for Blackwell GPUs (#16032)
  • MoE:
    • Add expert_map support to Cutlass FP8 MOE (#16861)
    • Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
  • Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
  • Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

  • TPU:
    • Enable structured decoding on TPU V1 (#16499)
    • Capture multimodal encoder during model compilation (#15051)
    • Enable Top-P (#16843)
  • AMD:
    • AITER Fused MOE V1 Support (#16752)
    • Integrate Paged Attention Kernel from AITER (#15001)
    • Support AITER MLA (#15893)
    • Upstream prefix prefill speed up for vLLM V1 (#13305)
    • Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
    • Add skinny gemms for unquantized linear on ROCm (#15830)
    • Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

  • Add open-webui example (#16747)
  • Document Matryoshka Representation Learning support (#16770)
  • Add a security guide (#17230)
  • Add example to run DeepSeek with Ray Serve LLM (#17134)
  • Benchmarks for audio models (#16505)

Security and Dependency Updates

  • Don't bind tcp zmq socket to all interfaces (#17197)
  • Use safe serialization and fix zmq setup for mooncake pipe (#17192)
  • Bump Transformers to 4.51.3 (#17116)

Build and testing

  • Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

What's Changed

  • Improve configs - SchedulerConfig by @hmellor in #16533
  • [Misc] remove warning if triton>=3.2.0 by @DefTruth in #16553
  • [Misc] refactor examples by @reidliu41 in #16563
  • [Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in #16523
  • [fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in #16048
  • [Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in #16593
  • [TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 by @NickLucche in #16596
  • Fix triton install condition on CPU by @hmellor in #16600
  • s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in #16036
  • [Model][VLM] Add Kimi-VL model support by @courage17340 in #16387
  • [Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in #16616
  • [DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in #16614
  • config check sleep mode support oot platforms by @celestialli in #16562
  • [Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in #16390
  • [Kernel] moe wna16 marlin kernel by @jinzhen-lin in #14447
  • [BugFix]: Update minimum pyzmq version by @taneem-ibrahim in #16549
  • [Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in #16623
  • [Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in #16631
  • Add vllm bench [latency, throughput] CLI commands by @mgoin in #16508
  • Fix vLLM x torch.compile config caching by @zou3519 in #16491
  • [Misc] refactor argument parsing in examples by @reidliu41 in #16635
  • [CI/Build] Fix LoRA OOM by @jeejeelee in #16624
  • Add "/server_info" endpoint in api_server to retrieve the vllm_config.  by @Cangxihui in #16572
  • [Kernel] Remove redundant Exp calculations by @DefTruth in #16123
  • [Misc] Update compressed-tensors WNA16 to support zero-points by @dsikka in #14211
  • [Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in #10546
  • [Model] Add PLaMo2 by @Alnusjaponica in #14323
  • [Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in #16628
  • [Misc] Modify LRUCache touch by @jeejeelee in #16689
  • Disable remote caching when calling compile_fx by @zou3519 in #16611
  • [Feature] add model aware kv ops helper by @billishyahao in #16020
  • [ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in #16664
  • [V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py by @shen-shanshan in #16578
  • [CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook by @yankay in #16405
  • [Misc] refactor examples series by @reidliu41 in #16708
  • [Doc] Improve OOM troubleshooting by @DarkLight1337 in #16704
  • [Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in #16693
  • [Model] support modernbert by @xsank in #16648
  • [Hardware] Add processor inputs to platform validation by @joerunde in #16680
  • Improve error for structured output backend selection by @hmellor in #16717
  • [Misc] Remove redundant comment by @jianzs in #16703
  • Help user create custom model for Transformers backend remote code models by @hmellor in #16719
  • [V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in #16432
  • [V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in #16636
  • Adding vllm buildkite job for IBM Power by @AaruniAggarwal in #16679
  • [V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in #11737
  • [rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in #16426
  • [Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in #16734
  • [Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in #16741
  • [V1] Remove log noise when idle by @russellb in #16735
  • [Ray] Improve documentation on batch inference by @richardliaw in #16609
  • [misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in #16760
  • [Doc] Add more tips to avoid OOM by @DarkLight1337 in #16765
  • [doc] add open-webui example by @reidliu41 in #16747
  • [Bugfix] Fix GLM4 model by @intervitens in #16618
  • [Doc] Fix a 404 link in installation/cpu.md by @windsonsea in #16773
  • [Misc] refactor examples series - lmcache by @reidliu41 in #16758
  • Improve configs - TokenizerPoolConfig + DeviceConfig by @hmellor in #16603
  • fix: hyperlink by @reidliu41 in #16778
  • [Doc] Make sure to update vLLM when installing latest code by @DarkLight1337 in #16781
  • [Doc] Document Matryoshka Representation Learning support by @noooop in #16770
  • [Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion by @insukim1994 in #16784
  • [V1][Perf] Faster incremental detokenization by @njhill in #15137
  • [Bugfix]Fix index out of range error in api server log by @WangErXiao in #16787
  • [Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 by @Ximingwang-09 in #16753
  • [Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small by @lengrongfu in #16548
  • [TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even by @NickLucche in #16726
  • [V1][TPU] Enable Top K by @NickLucche in #15489
  • [ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints by @sijiac in #16674
  • [V1][Metrics] Fix http metrics middleware by @markmc in #15894
  • [MLA] Simplification to batch P/D reordering by @njhill in #16673
  • [P/D][V1] KV Connector API V1 by @ApostaC in #15960
  • [Attention] Update to lastest FA3 code by @LucasWilkinson in #13111
  • Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema by @tarukumar in #16721
  • [Doc] Improve help examples for --compilation-config by @DarkLight1337 in #16729
  • [Misc] Update outdated note: LMCache now supports chunked prefill by @chaunceyjiang in #16697
  • [V1][Structured Output] Minor modification to _validate_structured_output() by @shen-shanshan in #16748
  • Add hardware print to TPU V1 test by @mgoin in #16792
  • [BugFix] Accuracy fix for llama4 int4 - improperly casted scales by @LucasWilkinson in #16801
  • Improve configs - MultiModalConfig + PoolerConfig + DecodingConfig by @hmellor in #16789
  • [Misc] add collect_env to cli and docker image by @lengrongfu in #16759
  • [ROCm] [Attention] Cleanup ROCm output passing by @ProExpertProg in #16431
  • [Bugfix] fix pp for llama4 by @luccafong in #16746
  • [Doc] add podman setup instructions for official image by @nathan-weinberg in #16796
  • [Docs] Fix a link and grammar issue in production-stack.md by @windsonsea in #16809
  • [Model] use AutoWeightsLoader for BigCode, GPT-J by @jonghyunchoe in #16823
  • [Misc] Clean up Kimi-VL by @DarkLight1337 in #16833
  • Fix nullable_kvs fallback by @hmellor in #16837
  • [New Model]: Snowflake Arctic Embed (Family) by @noooop in #16649
  • [Misc] refactor examples series - Chat Completion Client With Tools by @reidliu41 in #16829
  • [Doc] Updated Llama section in tool calling docs to have llama 3.2 config info by @jmho in #16857
  • publish neuron docker image by @omrishiv in #16733
  • [Model][VLM] Add Qwen2.5-Omni model support (thinker only) by @fyabc in #15130
  • [rocm][MI300] llama4 maverick fp8 moe config tp8 by @divakar-amd in #16847
  • [Frontend] Add sampling params to v1/audio/transcriptions endpoint by @NickLucche in #16591
  • [Misc] Benchmarks for audio models by @NickLucche in #16505
  • [V1][Misc] stop update prefix cache stats when logs_stats is disabled by @vie-serendipity in #16460
  • [Model] Refactor Phi-4-multimodal to use merged processor and support V1 by @Isotr0py in #15477
  • [Model] Qwen2.5-Omni Cleanup by @ywang96 in #16872
  • [VLM] Clean up models by @DarkLight1337 in #16873
  • [doc] update hyperlink by @reidliu41 in #16877
  • Log how much time loading a compiled artifact takes by @zou3519 in #16848
  • Serialize tensors using int8 views by @p88h in #16866
  • Improve configs - CacheConfig by @hmellor in #16835
  • [easy] Pass compile_fx only the config patches by @zou3519 in #16845
  • [Bugfix] Fix v1/spec_decode/test_ngram.py by @zixi-qi in #16895
  • [CI/CD][V1] Add spec decode tests to CI by @WoosukKwon in #16900
  • [Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in #16907
  • [Doc] Split dummy_processor_inputs() in Multimodal Docs by @alex-jw-brooks in #16915
  • Restore buffers when wake up from level 2 sleep (#16564) by @fingertap in #16889
  • [Misc] fix collect_env version parse by @wangxiyuan in #15267
  • [Misc] Refactor platform to get device specific stream and event by @shen-shanshan in #14411
  • [Bugfix] Fix GLM rotary_dim issue and support v1 by @Isotr0py in #16912
  • Raise error for data-parallel with benchmark_throughput by @kartikx in #16737
  • [XPU][Bugfix] minor fix for XPU by @yma11 in #15591
  • [doc] install required python3-dev apt package by @davidxia in #16888
  • [Doc] mention how to install in CPU editable mode by @davidxia in #16923
  • [Core] Speed up decode by remove synchronizing operation in sampler by @chanh in #16436
  • [V1][Spec Decode] Handle draft tokens beyond max_model_len by @WoosukKwon in #16087
  • [TPU][V1] Implicitly adjust page size when there's SMEM OOM by @yaochengji in #16871
  • Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml by @mgoin in #16946
  • [TPU][V1] Capture multimodal encoder during model compilation by @NickLucche in #15051
  • [V1] V1 FlashInfer Attention by @mgoin in #16684
  • [TPU][V1] Enable Top-P by @NickLucche in #16843
  • [Doc] Remove unnecessary V1 flag by @DarkLight1337 in #16924
  • [BugFix][Spec Decode] No in-place update to draft probs by @WoosukKwon in #16952
  • [Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other by @jeffrey-dot-li in #16863
  • [ROCm] Add aiter tkw1 kernel for Llama4 fp8 by @kliuae in #16727
  • [Misc] Remove the chunked prefill warning for LoRA by @jeejeelee in #16925
  • [Kernel] Add expert_map support to Cutlass FP8 MOE by @varun-sundar-rabindranath in #16861
  • [V1] Remove additional_config check by @wangxiyuan in #16710
  • [Performance][ROCm] Add skinny gemms for unquantized linear on ROCm by @charlifu in #15830
  • Support S3 Sharded loading with RunAI Model Streamer by @omer-dayan in #16317
  • [Bugfix] Fix f-string for Python 3.9-3.11 by @DarkLight1337 in #16962
  • [Doc] Update ai_accelerator/hpu-gaudi.inc.md by @windsonsea in #16956
  • [Perf] Optimize _update_states for GPU model runner by @SnowCharmQ in #16910
  • [Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams by @chaunceyjiang in #16767
  • [Model] Use autoweightloader for mamba by @sfeng33 in #16950
  • [V1] Remove pre-allocation for KV cache by @WoosukKwon in #16941
  • [Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS by @LeiWang1999 in #6036
  • [BugFix] Fix incremental detokenization perf issue by @njhill in #16963
  • [Doc] Improve documentation for multimodal CLI args by @DarkLight1337 in #16960
  • [FEAT][ROCm] Integrate Paged Attention Kernel from AITER by @vllmellm in #15001
  • [Misc] refactor example series by @reidliu41 in #16972
  • [Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in #16974
  • Improve configs - SpeculativeConfig by @hmellor in #16971
  • [BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) by @timzsu in #16973
  • [Misc] Add S3 environment variables for better support of MinIO. by @chaunceyjiang in #16977
  • [frontend] enhance tool_calls type check by @reidliu41 in #16882
  • [FEAT][ROCm]: Support AITER MLA by @vllmellm in #15893
  • Add assertion for no objects while hashing hf_config by @zou3519 in #16930
  • Fencing Kernels Tests for enabling on AMD by @Alexei-V-Ivanov-AMD in #16929
  • [BugFix] Remove default multiproc executor collective_rpc timeout by @njhill in #17000
  • [Core][V1][TPU] Enable structured decoding on TPU V1 by @Chenyaaang in #16499
  • [Bugfix] validate urls object for multimodal content parts by @gcalmettes in #16990
  • add Dockerfile build vllm against torch nightly by @yangw-dev in #16936
  • [Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 by @maleksan85 in #13305
  • [V1][DP] More robust DP/EP dummy request coordination by @njhill in #16277
  • [BugFix] Revert ROCm Custom Paged Attention Env Flag Check by @vllmellm in #17022
  • Revert "[Misc] Add S3 environment variables for better support of MinIO." by @chaunceyjiang in #17021
  • [misc] tune some env vars for GB200 by @youkaichao in #16992
  • [INTEL-HPU][v0] Port delayed sampling to upstream by @xuechendi in #16949
  • [doc] add download path tips by @reidliu41 in #17013
  • [Bugfix] Triton FA function takes no keyword arguments by @vllmellm in #16902
  • [V1] Avoid socket errors during shutdown when requests are in in-flight by @njhill in #16807
  • [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in #16998
  • [Misc] Improve readability of get_open_port function. by @gitover22 in #17024
  • [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers by @chaunceyjiang in #16964
  • [CI] Run v1/test_serial_utils.py in CI by @russellb in #16996
  • Mistral-format support for compressed-tensors by @mgoin in #16803
  • Categorize tests/kernels/ based on kernel type by @mgoin in #16799
  • [Doc] Add top anchor and a note to quantization/bitblas.md by @windsonsea in #17042
  • Ensure that pid passed to kill_process_tree is int for mypy by @hmellor in #17051
  • [CI] Update structured-output label automation by @russellb in #17055
  • Improve Transformers backend model loading QoL by @hmellor in #17039
  • CacheConfig.block_size should always be int when used by @hmellor in #17052
  • Use @property and private field for data_parallel_rank_local by @hmellor in #17053
  • [Frontend] Support guidance:no-additional-properties for compatibility with xgrammar by @tjohnson31415 in #15949
  • [BugFix][V1] Fix int32 token index overflow when preparing input ids by @sarckk in #16806
  • [V1][Spec Decode] Always use argmax for sampling draft tokens by @WoosukKwon in #16899
  • [CI/Build] workaround for CI build failure by @csy1204 in #17070
  • [Quantization]add prefix for commandA quantized model by @CXIAAAAA in #17017
  • [Minor] Use larger batch sizes for A100/B100/B200/MI300x by @WoosukKwon in #17073
  • [Bugfix] Enable V1 usage stats by @mgoin in #16986
  • More informative error when using Transformers backend by @hmellor in #16988
  • Addendum Fix to support FIPS enabled machines with MD5 hashing by @sydarb in #17043
  • [Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… by @zhangyuygss in #16472
  • [V1] Update structured output by @reidliu41 in #16812
  • [doc] update to hyperlink by @reidliu41 in #17096
  • Add docs for runai_streamer_sharded by @omer-dayan in #17093
  • [Chore] Remove Sampler from Model Code by @WoosukKwon in #17084
  • Disable enforce_eager for V1 TPU sampler and structured output tests by @mgoin in #17016
  • Simplify TokenizerGroup by @hmellor in #16790
  • Fix OOT registration test by @hmellor in #17099
  • [V1][PP] Optimization: continue scheduling prefill chunks by @ruisearch42 in #17080
  • [Misc] Remove OLMo2 config copy by @Isotr0py in #17066
  • Improve static type checking in LoRAModelRunnerMixin by @hmellor in #17104
  • [V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning by @shen-shanshan in #16954
  • [Frontend] Using matryoshka_dimensions control the allowed output dimensions. by @noooop in #16970
  • Add missing rocm_skinny_gemms kernel test to CI by @mgoin in #17060
  • [Misc] refactor example series - structured outputs by @reidliu41 in #17040
  • [V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics by @markmc in #16665
  • [CI] Add automation for the tool-calling github label by @russellb in #17118
  • Updating builkite job for IBM Power by @AaruniAggarwal in #17111
  • existing torch installation pip command fix for docs by @atilla00 in #17059
  • Molmo Requirements by @Eyshika in #17026
  • Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly by @hmellor in #17124
  • Improve configs - LoRAConfig + PromptAdapterConfig by @hmellor in #16980
  • [Docs] Generate correct github links for decorated functions by @russellb in #17125
  • Add collective_rpc to llm engine by @yinghai in #16999
  • Add chat template for Llama 4 models by @maxdebayser in #16428
  • [Misc] Add example to run DeepSeek with Ray Serve LLM by @ruisearch42 in #17134
  • Better error message for missing mistral params.json by @mgoin in #17132
  • Use custom address for listening socket by @jglaser in #15988
  • [FEAT] [ROCm]: AITER Fused MOE V1 Support by @vllmellm in #16752
  • [Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 by @LucasWilkinson in #16864
  • fix float16 support for kimi-vl by @zhouzaida in #17156
  • [Doc] V1 : Update LoRA status by @varun-sundar-rabindranath in #17133
  • [Docs] Fix True->true in supported_models.md by @mgoin in #17141
  • Move missed SchedulerConfig args into scheduler config group in EngineArgs by @hmellor in #17131
  • [Misc] Clean up redundant code in uniproc_executor.py by @lifuhuang in #16762
  • [Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton by @MengqingCao in #15099
  • [Misc] Benchmark Serving Script Support Appending Results by @LucasWilkinson in #17028
  • [Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance by @cynthieye in #16457
  • [Bugfix] remove fallback in guided_json (int range, patterns) by @csy1204 in #16725
  • [Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization by @rasmith in #15734
  • [Doc] Add headings to improve gptqmodel.md by @windsonsea in #17164
  • Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 by @houseroad in #17158
  • [Doc] Add two links to disagg_prefill.md by @windsonsea in #17168
  • [Doc] Move todo out of beam search docstring by @alex-jw-brooks in #17183
  • [Bugfix] Fix mistral model tests by @DarkLight1337 in #17181
  • [Bugfix] Fix Mistral ChatCompletionRequest Body Exception by @JasmondL in #16769
  • Bump Transformers to 4.51.3 by @hmellor in #17116
  • Use Transformers helper get_text_config() instead of checking for text_config by @hmellor in #17105
  • [doc] update wrong hf model links by @reidliu41 in #17184
  • [Misc] Inline Molmo requirements by @DarkLight1337 in #17190
  • [Security] Use safe serialization and fix zmq setup for mooncake pipe by @russellb in #17192
  • [V1] Move usage stats to worker and start logging TPU hardware by @dyli-google in #16211
  • [Bugfix] Fix hybrid model tests by @DarkLight1337 in #17182
  • Fix Python packaging edge cases by @tiran in #17159
  • [BugFix][Frontend] Fix LLM.chat() tokenization by @njhill in #16081
  • [V1][Spec Decode] EAGLE-3 Support by @benchislett in #16937
  • [Misc] Refine ray_serve_deepseek example by @ruisearch42 in #17204
  • [Bugfix] gemma[2,3] interleaved attention when sliding window is disabled by @heheda12345 in #17180
  • [AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary by @rasmith in #17215
  • [v1] [P/D] Adding LMCache KV connector for v1 by @ApostaC in #16625
  • [Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env by @jamesjwu in #17142
  • [MISC][AMD] Add unused annotation to rocm kernel file by @houseroad in #17097
  • [doc] add Anything LLM integration by @reidliu41 in #17216
  • [Minor][Spec Decode] Add use_eagle to SpeculativeConfig by @WoosukKwon in #17213
  • [Doc] Minor fix for the vLLM TPU setup page by @yarongmu-google in #17206
  • [Minor][Models] Fix Return Types of Llama & Eagle by @WoosukKwon in #17220
  • Allocate kv_cache with stride order by @wenscarl in #16605
  • [ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. by @charlifu in #17011
  • [V1][Metrics] Allow V1 AsyncLLM to use custom logger by @liuzijing2014 in #14661
  • [BugFix] Avoid race conditions in zero-copy tensor transmission by @njhill in #17203
  • [CI/test] Fix Eagle Correctness Test by @WoosukKwon in #17209
  • [Core] Remove prompt string from engine core data structures by @njhill in #17214
  • [Bugfix] Fix missing int type for -n in multi-image example by @Isotr0py in #17223
  • [Bugfix] Fix standard models tests by @DarkLight1337 in #17217
  • [Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device by @adobrzyn in #17186
  • [V1] Add structural_tag support using xgrammar by @russellb in #17085
  • [BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set by @andyxning in #17088
  • [Chore] added stubs for vllm_flash_attn during development mode by @aarnphm in #17228
  • [Docs] Update structured output doc for V1 by @russellb in #17135
  • [Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps by @junstar92 in #9276
  • Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 by @houseroad in #16573
  • [MISC] rename interval to max_recent_requests by @andyxning in #14285
  • [Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation by @imkero in #16878
  • [Minor] Fix lint error in main branch by @WoosukKwon in #17233
  • [CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh by @reidliu41 in #16271
  • Update test_flash_attn.py by @ShuaibinLi in #17102
  • [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel by @rasmith in #12591
  • [Misc] Make cached tokenizer pickle-compatible by @DarkLight1337 in #17048
  • [Bugfix] Fix QWen2 VL multimodal mapping by @jeejeelee in #17240
  • [Bugfix] Get a specific type of layer from forward context by @heheda12345 in #17222
  • [MISC] Use string annotation types for class definitions by @jianzs in #17244
  • [Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens by @sfc-gh-zhwang in #17033
  • [Bugfix] Fix Lora Name Parsing by @alex-jw-brooks in #17196
  • [NVIDIA] Support Cutlass MLA for Blackwell GPUs by @kaixih in #16032
  • [Feature] support sequence parallelism using compilation pass by @cascade812 in #16155
  • [doc] Add feature status legend by @reidliu41 in #17257
  • [Metrics] Fix minor inconsistencies in bucket progression by @DarkLight1337 in #17262
  • [V1][Spec Decode] Make eagle compatible with prefix caching. by @LiuXiaoxuanPKU in #17137
  • [BugFix] Fix vllm_flash_attn install issues by @LucasWilkinson in #17267
  • [Bugfix] Fix missing ARG in Dockerfile for arm64 platforms by @lkm-schulz in #17261
  • [Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… by @Ther-LF in #16751
  • [Bugfix] Fix Mistral3 spatial merge error by @mgoin in #17270
  • [Doc] Fix wrong github link in LMCache examples by @KuntaiDu in #17274
  • [Doc] small fix by @reidliu41 in #17277
  • [Misc] Validate stop_token_ids contents by @njhill in #17268
  • [Minor][Models] Pass partial_rotary_factor parameter to rope by @Eviannn in #17266
  • [Core] Remove legacy input mapper/processor from V0 by @DarkLight1337 in #15686
  • [Model] Add Granite Speech Support by @alex-jw-brooks in #16246
  • Update tpu_worker.py 's typo by @idouba in #17288
  • Add missing class docstring for PromptAdapterConfig by @hmellor in #17302
  • [Bugfix] Add missing get_language_model to new MLLMs by @DarkLight1337 in #17300
  • [doc] update wrong model id by @reidliu41 in #17287
  • [Misc] Minor typo/grammar in platforms/interface.py by @NickLucche in #17307
  • [Misc] Clean up Qwen2.5-Omni code by @DarkLight1337 in #17301
  • [Docs] Add a security guide by @russellb in #17230
  • Improve conversion from dataclass configs to argparse arguments by @hmellor in #17303
  • Make name of compressed-tensors quant method consistent across vLLM by @hmellor in #17255
  • Explicitly explain quant method override ordering and ensure all overrides are ordered by @hmellor in #17256
  • [Security] Don't bind tcp zmq socket to all interfaces by @russellb in #17197
  • [Chore] cleanup license indicators in light of SPDX by @aarnphm in #17259
  • [BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in #17283
  • [Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. by @charlifu in #16854
  • [Model] Qwen3 Dense FP8 Compat Fixes by @simon-mo in #17318

New Contributors

  • @Nash-123 made their first contribution in #16036
  • @celestialli made their first contribution in #16562
  • @taneem-ibrahim made their first contribution in #16549
  • @Cangxihui made their first contribution in #16572
  • @angkywilliam made their first contribution in #10546
  • @Alnusjaponica made their first contribution in #14323
  • @xsank made their first contribution in #16648
  • @jianzs made their first contribution in #16703
  • @p88h made their first contribution in #16432
  • @AaruniAggarwal made their first contribution in #16679
  • @davidheineman made their first contribution in #16741
  • @richardliaw made their first contribution in #16609
  • @intervitens made their first contribution in #16618
  • @windsonsea made their first contribution in #16773
  • @insukim1994 made their first contribution in #16784
  • @Ximingwang-09 made their first contribution in #16753
  • @sijiac made their first contribution in #16674
  • @tarukumar made their first contribution in #16721
  • @nathan-weinberg made their first contribution in #16796
  • @jmho made their first contribution in #16857
  • @vie-serendipity made their first contribution in #16460
  • @zixi-qi made their first contribution in #16895
  • @fingertap made their first contribution in #16889
  • @kartikx made their first contribution in #16737
  • @davidxia made their first contribution in #16888
  • @chanh made their first contribution in #16436
  • @jeffrey-dot-li made their first contribution in #16863
  • @sfeng33 made their first contribution in #16950
  • @LeiWang1999 made their first contribution in #6036
  • @timzsu made their first contribution in #16973
  • @yangw-dev made their first contribution in #16936
  • @gitover22 made their first contribution in #17024
  • @csy1204 made their first contribution in #17070
  • @sydarb made their first contribution in #17043
  • @zhangyuygss made their first contribution in #16472
  • @atilla00 made their first contribution in #17059
  • @Eyshika made their first contribution in #17026
  • @yinghai made their first contribution in #16999
  • @jglaser made their first contribution in #15988
  • @zhouzaida made their first contribution in #17156
  • @lifuhuang made their first contribution in #16762
  • @JasmondL made their first contribution in #16769
  • @tiran made their first contribution in #17159
  • @jamesjwu made their first contribution in #17142
  • @wenscarl made their first contribution in #16605
  • @liuzijing2014 made their first contribution in #14661
  • @adobrzyn made their first contribution in #17186
  • @andyxning made their first contribution in #17088
  • @junstar92 made their first contribution in #9276
  • @ShuaibinLi made their first contribution in #17102
  • @cascade812 made their first contribution in #16155
  • @lkm-schulz made their first contribution in #17261
  • @Ther-LF made their first contribution in #16751
  • @Eviannn made their first contribution in #17266
  • @idouba made their first contribution in #17288

Full Changelog: v0.8.4...v0.8.5

Don't miss a new vllm release

NewReleases is sending notifications on new releases.