Highlights
This release features 538 commits, 207 contributors (65 new contributors)!
- This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
- This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.
Model Support
- New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
- Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
- Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
- Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
- Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
- Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
- Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
- Reasoning: SeedOSS reason parser (#24263).
Engine Core
- KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
- V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
- Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
- Async scheduling: Uniprocessor executor support (#24219).
- Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
- Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
- Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
- LoRA: Optimized weight loading (#25403).
- Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
- torch.compile: CUDA graph Inductor partition integration (#24281).
Hardware & Performance
- NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
- DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
- New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
- AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
- Intel XPU: MoE DP accuracy fix (#25465).
Large Scale Serving & Performance
- Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
- Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
- EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
- Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
- MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
- Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).
Quantization
- FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
- FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
- W4A8: Faster preprocessing (#23972).
- Compressed tensors: Blocked FP8 for MoE (#25219).
API & Frontend
- OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
- Multimodal: Media UUID caching (#23950), image path format (#25081).
- Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
- CLI: --enable-logging (#25610), improved --help (#24903).
- Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
- Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
- UX: Removed misleading quantization warning (#25012).
Security
Dependencies
- PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
- Build requirements: C++17 now enforced globally (#24823).
- TPU: Deprecated
xm.mark_step
in favor oftorch_xla.sync
(#25254).
V0 Deprecation
- Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
- Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
- Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).
What's Changed
- [Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in #24707
- [DOCs] Update ROCm installation docs section by @gshtras in #24691
- Enable conversion of multimodal models to pooling tasks by @maxdebayser in #24451
- Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in #24686
- [Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in #24712
- [BugFix] Fix Qwen3-Next PP by @njhill in #24709
- [CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in #24640
- [CI] Add ci_envs for convenient local testing by @noooop in #24630
- [CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in #24721
- [Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in #24717
- [Bugfix] Fix BNB name match by @jeejeelee in #24735
- [Kernel] [CPU] refactor
cpu_attn.py:_run_sdpa_forward
for better memory access by @ignaciosica in #24701 - [sleep mode] save memory for on-the-fly quantization by @youkaichao in #24731
- [Multi Modal] Add FA3 in VIT by @wwl2755 in #24347
- [Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in #24548
- [Doc]: fix typos in various files by @didier-durand in #24726
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24740
- [Bugfix] Fix MRoPE dispatch on XPU by @yma11 in #24724
- [Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in #24739
- [Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in #20452
- [Bugfix][Frontend] Fix
--enable-log-outputs
does not match the documentation by @kebe7jun in #24626 - [Models] Optimise and simplify
_validate_and_reshape_mm_tensor
by @lgeiger in #24742 - [Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in #24741
- [Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in #24733
- [UX] Remove AsyncLLM torch profiler disabled log by @mgoin in #24609
- [CI] Speed up model unit tests in CI by @afeldman-nm in #24253
- [Bugfix] Fix incompatibility between #20452 and #24548 by @DarkLight1337 in #24754
- [CI] Trigger BC Linter when labels are added/removed by @zhewenl in #24767
- [Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in #23937
- [Compilation Bug] Fix Inductor Graph Output with Shape Issue by @yewentao256 in #24772
- Invert pattern order to make sure that out_proj layers are identified by @anmarques in #24781
- [Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode by @MatthewBonanni in #24705
- Add FLASHINFER_MLA to backend selector test by @MatthewBonanni in #24753
- [Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes (#24660) by @sighingnow in #24667
- [Core] Support async scheduling with uniproc executor by @njhill in #24219
- [Frontend][Multimodal] Allow skipping media data when UUIDs are provided. by @huachenheli in #23950
- [Model] Add Olmo3 model implementation by @2015aroras in #24534
- [Bugfix] Fix GPUModelRunner has no attribute lora_manager by @jeejeelee in #24762
- [Chore] Remove unused batched RoPE op & kernel by @WoosukKwon in #24789
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24791
- [Docs] Remove Neuron install doc as backend no longer exists by @hmellor in #24396
- [Doc]: Remove 404 hyperlinks by @rozeappletree in #24785
- [Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization by @elvischenv in #24757
- [Kernels][DP/EP] Optimize Silu Kernel for R1 by @elvircrn in #24054
- [Core][Multimodal] Cache
supports_kw
by @lgeiger in #24773 - [CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe by @mgoin in #24750
- [Misc] Correct an outdated comment. by @russellb in #24765
- [Doc]: fix typos in various files by @didier-durand in #24798
- [CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again by @wwl2755 in #24771
- Remove redundant assignment in xfer_buffers, This is a little fix by @ChenTaoyu-SJTU in #24732
- [Minor] Simplify duplicative device check for cuda by @ziliangpeng in #24793
- [Chore] Minor simplification for non-PP path by @WoosukKwon in #24810
- [Multi Modal][Performance] Fused Q,K's apply_rope into one by @wwl2755 in #24511
- [Misc] Improve
s3_utils
type hints withBaseClient
by @Zerohertz in #24825 - [Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement by @yewentao256 in #24783
- fix type of sampling rate for encode_base64 by @co63oc in #24826
- [Benchmarks] Throw usage error when using dataset-name random and dataset-path together by @yeqcharlotte in #24819
- Force use C++17 globally to avoid compilation error by @chenfengjin in #24823
- [Chore] Remove ipex_ops warning by @robertgshaw2-redhat in #24835
- [Spec Decoding]Support Spec Decoding Metrics in DP Mode by @wuhang2014 in #24049
- [Hybrid Allocator] Support Pipeline Parallel by @heheda12345 in #23974
- [Docs] Have a try to improve frameworks/streamlit.md by @windsonsea in #24841
- [kv cache] update num_free_blocks in the end by @andyxning in #24228
- [Frontend] Skip
stop
in reasoning content by @gaocegege in #14550 - [Bugfix] MiDashengLM model contact error under concurrent testing by @bingchen-mi in #24738
- [Doc]: fix typos in various files by @didier-durand in #24821
- [Misc] rename interval to max_recent_requests by @andyxning in #24229
- [Misc] Own KVConnectors installation by @NickLucche in #24867
- [P/D]
kv_output_aggregator
support heterogeneous by @LCAIZJ in #23917 - [UT] enhance free kv cache block queue popleft_n by @andyxning in #24220
- [XPU] Set consistent default KV cache layout by @NickLucche in #24745
- [Misc] Fix examples openai_pooling_client.py by @noooop in #24853
- [Model]: support Ling2.0 by @ant-yy in #24627
- [Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 by @Isotr0py in #24822
- Fp8 paged attention update by @xiao-llm in #22222
- Reinstate existing torch script by @hmellor in #24729
- [USAGE] Improve error handling for weight initialization in Unquantized… by @koiker in #20321
- Move
MultiModalConfig
fromconfig/__init__.py
toconfig/multimodal.py
by @hmellor in #24659 - [Transform] Deterministic Hadacore Transforms by @kylesayrs in #24106
- Update num_tokens_across_dp to use nccl instead of gloo by @SageMoore in #24105
- Bump Flashinfer to 0.3.1 by @bbartels in #24868
- [gpt-oss] Add IncompleteDetails to ResponsesRepsonse by @qandrew in #24561
- [gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still by @qandrew in #24759
- [Performance] Remove redundant clone() calls in cutlass_mla by @alexm-redhat in #24891
- [Bug] Fix Cutlass Scaled MM Compilation Error by @yewentao256 in #24887
- [ci] fix wheel names for arm wheels by @simon-mo in #24898
- [Tests] fix initialization of kv hash in tests by @mickaelseznec in #24273
- [Compile] Fix noop_elimination pass and add tests for noop_elimination by @ZJY0516 in #24880
HuggingFace
->Hugging Face
inIntegration with Hugging Face
docs by @sergiopaniego in #24889- Updated CODEOWNERS for flashinfer, mla, fused_moe by @mgoin in #24906
- [Deprecation] Remove DeepGEMM Old Symbol Wrapper by @yewentao256 in #24902
- [ROCm][Bugfix] Fix the case where there's bias by @gshtras in #24895
- Add pytest-cov and .coveragerc by @rzabarazesh in #24778
- [Bug] Fix
is_flashmla_supported
Check Error by @yewentao256 in #24774 - [CI] Small Accuracy Eval Test for Deepseek Model by @yewentao256 in #24259
- [Metrics] Hide deprecated metrics with gpu_ prefix by @markmc in #24245
- [Docs] Update instructions for how to using existing torch binary by @zou3519 in #24892
- Upgrade flashinfer to 0.3.1 by @houseroad in #24470
- [XPU] Fix circular import error. by @jikunshang in #24927
- Remove V0 Encoder-Decoder Support by @WoosukKwon in #24907
- [Bugfix] Fix sequence parallelism bug when enable pipeline parallelism by @cascade812 in #24021
- [Bug] [Spec Dec]: Fix kv_cache dtype mismatch for Eagle3 drafter on FP8 target by @vllmellm in #24505
- [QWEN NEXT] Fused MoE kernels Optimization configs by @samanamp in #24924
- [benchmark] Add triton version in the moe tuned config by @jeejeelee in #24769
- [Bugfix] remove duplicate tokens streamed in required tool choice streaming by @Jason-CKY in #23312
- [Mamba] Support TP>1 with quantization for mamba2 mixer in case
n_groups % tp_size == 0
by @tomeras91 in #24593 - [Feat][EPLB] A novel static EPLB placement strategy for MoE models. by @cboss6 in #23745
- Move
SpeculativeConfig
fromconfig/__init__.py
toconfig/speculative.py
by @hmellor in #24904 - [Docs] move benchmarks README to contributing guides by @yeqcharlotte in #24820
- feat: Add Grafana and Perces monitoring dashboards for vLLM by @liangwen12year in #23498
- (doc): set cmake c++ compatible standard when building on MacOS CPU. by @teekenl in #23483
- [CI] Add Decode Context Parallelism (DCP) test to CI by @minosfuture in #24487
- [Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 by @cyang49 in #24331
- [Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing by @lgeiger in #24925
- [Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM by @SageMoore in #23693
- [Bugfix] Fix unable to run encoder model when disable_hybrid_kv_cache_manager is true by @lianyiibo in #24571
- [Misc] Add removed encoder-decoder models to previously supported models list by @Isotr0py in #24961
- Directly get max encoder len from VLLM config in V1 by @Sugar-zsg in #24866
- [gpt-oss][1b] streaming add item id, content id by @qandrew in #24788
- [MISC] Add code owners of vllm/v1 to vllm/v1/core by @heheda12345 in #24928
- [ROCm] Add dependencies for ROCm by @Concurrensee in #24900
- [gpt-oss][1][bugfix] fix streaming final output by @qandrew in #24466
- Use kwargs for long lists of
EngineCoreRequest
arguments in tests and fix extra kwargs by @qthequartermasterman in #24987 - fp8 kv cache support fix for torch.compile by @maleksan85 in #22758
- [Perf] Reuse workspace for FP8+FP4 Marlin MoE by @mgoin in #20500
- [CI][Bugfix] Fix failing Blackwell test by @MatthewBonanni in #24993
- [CI] GPT-OSS GPQA eval test for Blackwell by @mgoin in #24920
- [FP8] Extend per-token-group quantization support to QuantFP8 by @tahsintunan in #24342
- Removes source compilation of nixl dependency by @bbartels in #24874
- [Doc] Add --force-overwrite option to generate_cmake_presets.py by @elvischenv in #24375
- [Core] Use
CpuGpuBuffer
for block table tensors by @njhill in #24795 - [Benchmarks] Add MMVU video dataset support and clean up deprecated datasets by @Isotr0py in #24719
- [UX] Enforce valid choices for envs like VLLM_ATTENTION_BACKEND, etc by @mgoin in #24761
- [Docs] fix invalid doc link by @yyzxw in #25017
- [UX] Remove "quantization is not fully optimized yet" log by @mgoin in #25012
- [misc] fix typo in value error by @prashantgupta24 in #24995
- [Core] Get num_encoder_tokens from scheduler config by @russellb in #24989
- [V0 Deprecation] Remove MQLLMEngine by @WoosukKwon in #25019
- [Model] Support Qwen3-VL Model Series by @ywang96 in #24727
- [Rocm] [quantization] Fix quark ptpc moe and add test case by @haoyangli-amd in #24649
- Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) by @pliops-daniels in #23255
- [XPU] Fix xpu model runner call torch.cuda APIs by @jikunshang in #25011
- [EPLB] Support EPLB for Mixtral Model by @rouchenzi in #22842
- [Core][MultiModalHasher] Hash images without converting image mode by @lgeiger in #24969
- [Model] Pass param prefix to LLMHead by @whx-sjtu in #24862
- [Model] Apply SharedFusedMoE to glm4_moe. by @whx-sjtu in #24849
- [Core] Remove tokenizer group in vLLM by @zhuohan123 in #24078
- [Docs] Fix griffe warning in base_static_graph.py by @windsonsea in #25018
- [DP] Create placement groups by ray_device_key by @xinyu-intel in #25026
- [Frontend] Support returning all prompt logprobs by @chaunceyjiang in #24956
- [BugFix] enable DOTALL to match multi-line tool_call parameters in extract_tool_call_required_streaming by @shijun-yin in #24668
- [Misc] Avoid use of deprecated
AutoModelForVision2Seq
by @DarkLight1337 in #25065 - Add RADIO Vision Encoder Support to vLLM by @danielafrimi in #24595
- [Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check by @bigPYJ1151 in #25046
- Apply fixes for CUDA 13 by @Aidyn-A in #24599
- [fix] lora benchmarks pass no_lora_flag_cpu by @dolpm in #23774
- [Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. by @sighingnow in #24957
- [Docs] improve code formatting and comments for eliminate griffe build warning. by @samzong in #25010
- Remove old cutlass mla by @MatthewBonanni in #23961
- [Docs] vllm/benchmarks/datasets.py fix docstring param format. by @samzong in #24970
- [CI Bugfix] Fix failing test_invalid_env by @mgoin in #25078
- [V0 Deprecation] Remove V0 Core tests by @WoosukKwon in #25082
- cleanup: remove adapter commons by @simon-mo in #25045
- Remove unused find_cuda_init helper script by @simon-mo in #25044
- [V0 Deprecation] Remove unused output processor util by @WoosukKwon in #25023
- Change log level from info to debug for IOProcessor by @mgoin in #24999
- [CI] Revert back prepare_prompts and check_answers by @WoosukKwon in #25087
- [V0 Deprecation] Remove V0 tests in test_sequence.py by @WoosukKwon in #25088
- [CI Bugfix] Fix failing test_model_load_with_params tests due to tokenizer refactor by @mgoin in #25086
- [V1] Logits processor docs by @afeldman-nm in #22919
- [Misc] Update owners for KV connector and V1 offloading by @ApostaC in #25041
- [Bugfix] Update import path for bc_linter_include by @mmangkad in #24766
- [BUG] Exclude .pth files when pulling remote files by @ahao-anyscale in #25092
- [Kernel] Faster pre-processing time for W4A8 by @czhu-cohere in #23972
- [gpt-oss][2] fix types for streaming by @qandrew in #24556
- [Bugfix][B200] Fix
cutlass_mla
hang by @alexm-redhat in #24966 - [ROCm][Bugfix] Aiter mha fp8 fix by @dllehr-amd in #24991
- Disable failing GPT-OSS Eval (Blackwell) for now by @mgoin in #25107
- [Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic by @elvischenv in #24600
- Add a batched auto tune script by @karan in #25076
- [Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel by @elvischenv in #24833
- [Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses by @bnellnm in #22537
- [V0 Deprecation] Remove V0 Engine tests by @WoosukKwon in #25114
- [V0 Deprecation] Remove V0 Tracing & Metrics tests by @WoosukKwon in #25115
- [V0 Deprecation] Remove misc V0 tests by @WoosukKwon in #25118
- [V0 Deprecation] Skip PP test by @WoosukKwon in #25128
- [Kernels] Enable DeepGEMM by default by @bnellnm in #24462
- [MM Encoder] Apply DP ViT for Qwen3-VL model series by @ywang96 in #24955
- [Docs] Clean up the contributing README by @hmellor in #25099
- [Core][MM] Cleanup
MultiModalCache
by @lgeiger in #25006 - [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models by @toncao in #24960
- [Kernels] Overlap shared experts with combine instead of dispatch by @bnellnm in #24254
- [Model] enable data parallel for InternVL vision encoder by @666even666 in #23909
- Mark prompt logprobs as incompatible with prompt embeds at API level by @qthequartermasterman in #25077
- [XPU] Whisper model support on XPU Platform by @chaojun-zhang in #25123
- [EPLB] Add EPLB support for hunyuan_v1 by @666even666 in #23078
- [V0 Deprecation] Remove more V0 tests by @WoosukKwon in #25117
- [Spec Decode] Efficient padded speculation by @benchislett in #24539
- [benchmark] add peak throughput metrics and plot by @simon-mo in #23867
- [CLI] Use streaming in CLI chat and completion commands by @simon-mo in #23769
- [Kernel] Better inf handling for grouped topk cu by @lumina37 in #24886
- [Docs] Fix API Reference by @hmellor in #25140
- Retrieve
sliding_window
from text config in Gemma3 MM by @hmellor in #25085 - [Bugfix] when use s3 model cannot use default load_format by @lengrongfu in #24435
- [Qwen] Add fp8 checkpoint support for qwen3-next. by @sighingnow in #25079
- Add 'path' option to ImagePrompt data_format by @gfinol in #25081
- [Doc] Fix cross-reference warnings by @punitvara in #25058
- [Chore] Cleanup guided namespace, move to structured outputs config by @aarnphm in #22772
- Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains by @ihb2032 in #24951
- silu-v1: Fix EPS not being used during max-reduction by @elvircrn in #25069
- [Frontend] Support setting logprobs to -1 by @chaunceyjiang in #25031
- [Model] Improve Pooling Model by @jeejeelee in #25149
- Move
StructuredOutputsConfig
fromconfig/__init__.py
toconfig/structured_outputs.py
by @hmellor in #25153 - [Docs] Fix pooling-params doc references in openai_compatible_server.md by @yankay in #24939
- [Docs] add the parallel sampling usage in LLMEngine and AsyncLLM by @gigit0000 in #24222
- Fix forward reference warning in documentation by @hmellor in #25150
- Fix
validate-config
pre-commit check by @hmellor in #25157 - [Bugfix][Mamba] - Fix Conv State Kernel FP32 Support by @Josephasafg in #24883
- [Misc] Clean up flags in
vllm bench serve
by @ywang96 in #25138 - [Structured Output][Refactor] Move
apply_grammar_bitmask()
method fromModelRunner
to structured output utils by @shen-shanshan in #21999 - Refactor dense FP8 tensor/channel/block utils and add CT FP8 block by @mgoin in #21404
- [Misc] Add kv-connector label by @NickLucche in #25156
- [Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel by @jvlunteren in #21197
- [PERF] Add
conv1d
metadata to GDN attn by @vadiklyutiy in #25105 - feat(api): Return 503 on /health when engine is dead by @dongbo910220 in #24897
- [New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task by @noooop in #24872
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #25163
- Enable Allgather/ReduceScatter backend for NaiveAllToAll by @wenscarl in #23964
- [Misc] Add codeowner for Transformers backend by @hmellor in #25180
- [spec decode] Fix MTP inference path for MiMo-7B model by @zixi-qi in #25136
- [ROCm][CI/Build] Use ROCm7.0 as the base by @gshtras in #25178
- [ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation by @Rohan138 in #25104
- [KV offload][1/N] Introduce an offloading component by @orozery in #19848
- [V0 Deprecation] Remove AsyncLLMEngine by @WoosukKwon in #25025
- [fix]: remove data type hardcoding from gptoss model implementation by @nikhil-arm in #23807
- [feat]: Create interface for model-specific M-RoPE by @AzizCode92 in #24194
- [Bug] Fix
returned_lse
not Defined issue by @yewentao256 in #25106 - [Bug] Fix torch Compilation Cache Hit Error by @yewentao256 in #25093
- [V0 Deprecation] Remove unused async_timeout.py by @WoosukKwon in #25190
- [KV offload][1b/N] rename offloading to kv_offload by @orozery in #25191
- [BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv by @LucasWilkinson in #25206
- [CORE] Prompt Embeddings Support for v1 Engine by @qthequartermasterman in #24278
- [KV offload][2/N] Introduce LRU-based CPU offloading management by @orozery in #20075
- [gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming by @qandrew in #24938
- [Perf] Optimize memory peak during EAGLE model loading. by @candyzone in #24585
- [Misc] Clean up MM profiling warnings by @ywang96 in #25222
- [Docs] Fix griffe warnings in vllm/multimodal by @windsonsea in #25216
- [OOT] Support sync_model_loading for OOT by @xuechendi in #25126
- [Build] Update Xgrammar to 0.1.24 to get a CVE fix by @russellb in #25188
- [CPU] Disable oneDNN linear on non-x86 platforms by @bigPYJ1151 in #25166
- [Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton by @bigPYJ1151 in #25137
- [Misc] Cleanup test conftest for deprecated encoder-decoder models by @Isotr0py in #25231
- [bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B by @yma11 in #25146
- [Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE by @Isotr0py in #25055
- [Bugfix][Perf] Misc fixes for Qwen3 VL by @ywang96 in #25238
- Move
PoolerConfig
fromconfig/__init__.py
toconfig/pooler.py
by @hmellor in #25181 - [P/D][Nixl] Introduce
KVTransferMetrics
and aggregation strategy by @NickLucche in #22188 - [V0 Deprecation] Remove V0 logic from
get_input_embeddings
interface by @DarkLight1337 in #25242 - [Qwen] Remove cuda hard-code in qwen3 next by @wxsIcey in #25243
- Update CODEOWNERS by @hmellor in #25269
- Move
ModelConfig
fromconfig/__init__.py
toconfig/model.py
by @hmellor in #25252 - refactor(benchmarks): add type annotations to wait_for_endpoint parameters by @samzong in #25218
- [KV offload][3/N] Add worker-side CPU support by @orozery in #21448
- [Frontend] Pass API server count to each process by @DarkLight1337 in #23717
- [Core] Modify the initialization parameters of the lora manager by @jeejeelee in #25249
- Remove Redundant Assignment in Qwen3_VisionPatchMerger by @LJH-LBJ in #25224
- Encoder model support for the Transformers backend by @hmellor in #25174
- [CI/Build] fix test function_calling by @chaunceyjiang in #25072
- [Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance by @Jialin in #24990
- [Docs] add init.py to vllm/model_executor/layers/quantization/compressed_tensors/transform by @samzong in #24974
- [bugfix] fix structured outputs key missing issue from #24929 by @luccafong in #25195
- [KV offload][4/N] Offloading KV connector by @orozery in #22595
- Optimize triton unified attention performance for sliding window attention by @zixi-qi in #24390
- [Bugfix] GPT OSS Attritbute error on H100 by @varun-sundar-rabindranath in #25228
- [Bugfix] Fix chunked a2_scales in modular kernels by @bnellnm in #25264
- Specify platform in
pip-compile
pre-commit
hook so it runs on MacOS by @hmellor in #25273 - [Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available by @mgoin in #21126
- [BugFix] Make FlashInferMetadataBuilder non-blocking by @nvjullin in #25040
- Fix: Correct FusedMoE layer reference in auto_round quantization by @David-Wen2025 in #24818
- [Frontend] Responses API messages out, just harmony for now by @alecsolder in #24985
- [Compile] Fix Compile Warning for Ignoring
MIN_BLOCK_PER_SM
by @yewentao256 in #25193 - Enable modelopt gemma3 nvfp4/fp8, make workflow more robust by @Edwardf0t1 in #22771
- allow disable flashinfer prefill by @luccafong in #25276
- [BugFix] Fix async scheduling CPU tensor race take 2 by @njhill in #25279
- [Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE #2969 by @Lucaskabela in #25090
- Don't skip special tokens with hermes-style tool calling by @maxdebayser in #25281
- test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support by @qthequartermasterman in #25291
- [docs] Prompt Embedding feature support by @qthequartermasterman in #25288
- [torch.compile] CUDAGraph Inductor partition integration by @BoyuanFeng in #24281
- [BugFix] Ensure appropriate guards in destructors by @njhill in #25284
- [Misc] Support more collective_rpc return types by @njhill in #25294
- Improve weight loading for encoder models in Transformers backend by @hmellor in #25289
- [BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) by @JartX in #25268
- [BugFix] Exclude self when checking for port collision by @njhill in #25286
- [BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention by @xuechendi in #25298
- [Bugfix] fix tool call arguments is empty by @chaunceyjiang in #25223
- [Optimization] Avoid repeated model architecture conversion for pooling models by @DarkLight1337 in #25261
- [Hybrid Allocator] Support full attention with different hidden size by @heheda12345 in #25101
- [Bugfix] Fix Qwen3-VL-MoE weight loading for EP by @ywang96 in #25300
- [V1] Support
LLM.apply_model
by @DarkLight1337 in #18465 - [CI Failure] Disable FlashInfer RoPE to unblock CI by @mgoin in #25299
- [Docs] Fix warnings in mkdocs build (continued) by @wwl2755 in #25042
- Generate _ModelInfo properties file when loading to improve loading speed by @manoelmarques in #23558
- [Model] Cleanup InternViT's data parallel implementation by @Isotr0py in #25306
- [Core] Enable sharded state loader for V1 engine and enhance test coverage by @lirong-lirong in #25308
- [V0 Deprecation] Enable the remaining multimodal tests in V1 by @DarkLight1337 in #25307
- [Docs] Fix warnings in vllm/profiler and vllm/transformers_utils by @windsonsea in #25220
- [V0 Deprecation] Remove LLMEngine by @WoosukKwon in #25033
- [V0 Deprecation] Remove V0 Output Processor by @WoosukKwon in #25320
- [Chore] Remove unused sampler in models by @WoosukKwon in #25324
- [CI] Skip tests failing on main by @WoosukKwon in #25326
- [V0 Deprecation] Remove V0 core by @WoosukKwon in #25321
- [Doc] improve test-pipeline.yaml documentation by @hl475 in #25305
- [V0 Deprecation] Remove V0 model runner base & simplify worker base by @WoosukKwon in #25328
- [Multi Modal][Performance] Fused Q,K's apply_rope in more models by @wwl2755 in #25005
- [V0 Deprecation] Remove from_seq_group methods by @WoosukKwon in #25330
- [V0 Deprecation] Remove V0 MP executor by @WoosukKwon in #25329
- [V1] Add sliding window support to Flex Attention backend by @Isotr0py in #24089
- [MM][Perf] Minor Optimization on Qwen3-VL
fast_pos_embed_interpolate
by @ywang96 in #25337 - [Bugfix] Typos in error message for missing model config file by @simondanielsson in #25339
- [Optimization] Cache chat template result when processor fails to be loaded by @DarkLight1337 in #25341
- [V0 Deprecation] Remove V0 Sequence class & Sampler by @WoosukKwon in #25332
- [V0 Deprecation] Remove async_output_proc, preemption mode, delay factor by @WoosukKwon in #25334
- feat: Enable engine-level arguments with speculators models by @rahul-tuli in #25250
- [V0 Deprecation] Remove V0 sampling metadata by @WoosukKwon in #25345
- [Perf] Further optimization for Qwen3-VL
fast_pos_embed_interpolate
by @Isotr0py in #25347 - Remove V0 attention backends by @WoosukKwon in #25351
- [Bugfix][V0 Deprecation][CI] use async mock and await for async method by @KKSK-DON in #25325
- Multimodal - audio tests by @debroy-rh in #25285
- [Model] Support Dots OCR by @ywang96 in #24645
- [Docs] GSM8K Accuracy Evaluation doc update by @david6666666 in #25360
- [Bugfix] Fix hermes tool parser handling of non-string argument types by @david6666666 in #22002
- [V0 Deprecation] Remove V0-only methods in multi-modal registry by @DarkLight1337 in #25362
- [V0 Deprecation] Remove
MultiModalPlaceholderMap
by @DarkLight1337 in #25366 - Enable Eagle3 speculative decoding for GPT-OSS model by @eldarkurtic in #25246
- [TPU][Bugfix][CI] Fix broken tests/build dependency by @NickLucche in #25255
- [TPU] Deprecate
xm.mark_step
in favor of ``torch_xla.sync` by @NickLucche in #25254 - refactor: abstract graph mode support into platform interface by @yiz-liu in #25161
- [Misc] Remove unused encoder-decoder error strings by @DarkLight1337 in #25374
- Make pickle import check fast by @hmellor in #25379
- Make
mypy
behave like a proper pre-commit hook by @hmellor in #25313 - MI-300X triton moe configs by @Sara-KS in #23445
- [Bugfix] Fix several issues with p2p xPyD in GET type by @Csrayz in #23993
- [V1][Attention] Split triton_attn in triton-only and rocm specific backends by @bringlein in #24648
- [EPLB] Reduce EPLB Inference Overhead by @abmfy in #24573
- [CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables by @Daisy-Ma-coder in #25274
- [Compiler] Disable Inductor standalone compile by default by @ElizaWszola in #25391
- [CI Failure] Fix fp8 kv cache on <SM90 by @mgoin in #25396
- [DP] support torchrun external launcher with Data Parallelism by @luccafong in #24899
- Remove RFC review hours reference by @simon-mo in #25416
- [torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug (#23091), fix test (#24376), and prep for custom op matching (#24604) by @ProExpertProg in https://github.com/vllm-project/vllm/pull/24542
- [KV offload][5/N] Add
CPUOffloadingSpec
by @orozery in https://github.com/vllm-project/vllm/pull/24251 - [CI/Build] Skip Qwen3-VL initialization tests until models are actually released by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25394
- [TPU] update torch_xla dependency for PyPI compatibility by @jcyang43 in https://github.com/vllm-project/vllm/pull/25278
- [Frontend] Responses API MCP tools for built in tools and to pass through headers by @alecsolder in https://github.com/vllm-project/vllm/pull/24628
- [Bugfix] fix custom op test by @ProExpertProg in https://github.com/vllm-project/vllm/pull/25429
- [Core] Drop overly aggressive whisper assertion by @russellb in https://github.com/vllm-project/vllm/pull/25408
- [Bugfix] Fix missing
clear_connector_metadata
by @NickLucche in https://github.com/vllm-project/vllm/pull/25397 - [BugFix] [DP/EP] Fix slow execution when BS <= DP by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/25407
- [Performance] Remove input pads in cutlass_mla and optimize v_proj output handling by @alexm-redhat in https://github.com/vllm-project/vllm/pull/25184
- [Perf] Apply torch.compile for
per_block_cast_to_fp8
by @yewentao256 in https://github.com/vllm-project/vllm/pull/24611 - [V0 deprecation] Remove platform v1 controling interface by @Isotr0py in https://github.com/vllm-project/vllm/pull/25410
- [V0 deprecation] Remove
_set_default_args_v0
function by @Isotr0py in https://github.com/vllm-project/vllm/pull/25409 - [Bug] Fix Long Context OOM Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25290
- [feat] Support MRoPE + YaRN by @JJJYmmm in https://github.com/vllm-project/vllm/pull/25384
- [XPU] Fix
compile_size
isNone
case. by @jikunshang in https://github.com/vllm-project/vllm/pull/25433 - [benchmarks]allow skip ready check for bench serve by @luccafong in https://github.com/vllm-project/vllm/pull/25420
- [Bugfix] Remove contiguous output req for context parallel MLA by @mgoin in https://github.com/vllm-project/vllm/pull/25414
- [Docs] Fix griffe warnings in vllm/lora/ops by @windsonsea in https://github.com/vllm-project/vllm/pull/25369
- [DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/24588
- [NIXL][OOT platform] support nixl_connector with oot platform and other nixl_backend by @xuechendi in https://github.com/vllm-project/vllm/pull/25121
- [Model] Enable DP for ViT in Qwen2-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25445
- Handle triton kernel import exception by @minosfuture in https://github.com/vllm-project/vllm/pull/25319
- [Frontend] Add a new xml-based tool parser for qwen3-coder by @Zhikaiiii in https://github.com/vllm-project/vllm/pull/25028
- [Misc] Move DP for ViT code inside model executor dir by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25459
- [Test]: Hermes tool parser stream output error in Qwen3 case by @ahartel in https://github.com/vllm-project/vllm/pull/25203
- [Bugfix] Fix idefics3
tie_word_embeddings
by @Isotr0py in https://github.com/vllm-project/vllm/pull/25454 - [Core] Optimize LoRA weight loading by @jeejeelee in https://github.com/vllm-project/vllm/pull/25403
- [docs] Benchmark Serving Incorrect Arg by @vllmellm in https://github.com/vllm-project/vllm/pull/25474
- [CI/Build] Fix disabled v1 attention backend selection test by @Isotr0py in https://github.com/vllm-project/vllm/pull/25471
- [BugFix] Register expert_map as named buffer for wake_up and sleep by @wuxibin89 in https://github.com/vllm-project/vllm/pull/25458
- [P/D] Support NIXL connector to disconnect during a clean shutdown by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24423
- [test/doc] make NixlConnector example more clear by @panpan0000 in https://github.com/vllm-project/vllm/pull/24249
- [XPU] Fix MOE DP accuracy issue on XPU by @faaany in https://github.com/vllm-project/vllm/pull/25465
- [UX] Change kv-cache-memory log level to debug by @mgoin in https://github.com/vllm-project/vllm/pull/25479
- [V1] Remove V0 code paths for Hybrid models by @tdoublep in https://github.com/vllm-project/vllm/pull/25400
- [Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24845
- Add backward compatibility for
GuidedDecodingParams
by @hmellor in https://github.com/vllm-project/vllm/pull/25422 - [Kernels] Support blocked fp8 quantization for compressed tensors MoE by @bnellnm in https://github.com/vllm-project/vllm/pull/25219
- [BugFix] Fix UB in per_token_group_quant.cu by @rivos-shreeasish in https://github.com/vllm-project/vllm/pull/24913
- [Log] Optimize kv cache memory log from Bytes to GiB by @yewentao256 in https://github.com/vllm-project/vllm/pull/25204
- Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu by @minosfuture in https://github.com/vllm-project/vllm/pull/25346
- [V1][Kernel] Add triton implementation for
reshape_and_cache_flash
by @bringlein in https://github.com/vllm-project/vllm/pull/24503 - [Misc] Reduce initialization time of auto_tune by @wdhongtw in https://github.com/vllm-project/vllm/pull/23682
- [Spec Decode][CI] Add e2e test for
examples/spec_decode.py
and prevent breaking Acceptance Length by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24531 - [Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank by @jeejeelee in https://github.com/vllm-project/vllm/pull/25487
- [ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 by @amd-hhashemi in https://github.com/vllm-project/vllm/pull/24988
- [core] add nccl symmetric memory for all reduce by @Amir-19 in https://github.com/vllm-project/vllm/pull/24532
- [Performance] Move apply_w8a8_block_fp8_linear to an op class by @ElizaWszola in https://github.com/vllm-project/vllm/pull/24666
- [Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE by @mgoin in https://github.com/vllm-project/vllm/pull/25444
- [Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue by @jiahanc in https://github.com/vllm-project/vllm/pull/25406
- [Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 by @mgoin in https://github.com/vllm-project/vllm/pull/25508
- Enable symmetric memory all reduce by default only enabling for TP by @ilmarkov in https://github.com/vllm-project/vllm/pull/25070
- [CI] Fix Pre-commit Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/25497
- [Bugfix] gpt-oss container tool output bug by @alecsolder in https://github.com/vllm-project/vllm/pull/25485
- [Build] Update Xgrammar to 0.1.25 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25467
- [Bugfix] Fix for the import error from #24588 by @gshtras in https://github.com/vllm-project/vllm/pull/25481
- [CI/Build] Fix and re-enable v1 PP test on CI by @Isotr0py in https://github.com/vllm-project/vllm/pull/25496
- [Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] by @Jialin in https://github.com/vllm-project/vllm/pull/24830
- [V0 Deprecation] Remove placeholder attn by @tdoublep in https://github.com/vllm-project/vllm/pull/25510
- Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… by @rouchenzi in https://github.com/vllm-project/vllm/pull/25493
- Fix triton_reshape_and_cache_flash.py triton import by @mgoin in https://github.com/vllm-project/vllm/pull/25522
- [gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI by @qandrew in https://github.com/vllm-project/vllm/pull/25428
- Remove redundant mutates_args and dispatch_key for direct_register_custom_op by @mgoin in https://github.com/vllm-project/vllm/pull/25512
- [BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting by @kouroshHakha in https://github.com/vllm-project/vllm/pull/25359
- Add
VLLM_NVTX_SCOPES_FOR_PROFILING=1
to enablenvtx.annotate
scopes by @coreylowman in https://github.com/vllm-project/vllm/pull/25501 - [Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for
_chunk_cumsum_fwd_kernel
by @tdoublep in https://github.com/vllm-project/vllm/pull/25197 - [ROCm] Small functional changes for gptoss by @jpvillam-amd in https://github.com/vllm-project/vllm/pull/25201
- [Perf] Increase default max splits for FA3 full cudagraphs by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25495
- [Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 by @alexm-redhat in https://github.com/vllm-project/vllm/pull/25509
- [BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25505
- Improve output when failing json.loads() on structured output test by @dougbtv in https://github.com/vllm-project/vllm/pull/25483
- Add CUTLASS FP8 MOE benchmark scripts and kernel config by @chenxi-yang in https://github.com/vllm-project/vllm/pull/25302
- [Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' by @yewentao256 in https://github.com/vllm-project/vllm/pull/25519
- [BUG] Allows for RunAI Streamer and Torch.compile cache to be used together by @ahao-anyscale in https://github.com/vllm-project/vllm/pull/24922
- [Model] Support SeedOss Reason Parser by @LuYanFCP in https://github.com/vllm-project/vllm/pull/24263
- [V1][Metrics] Add per-request TPOT histogram by @baxingpiaochong in https://github.com/vllm-project/vllm/pull/24015
- [Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen by @benchislett in https://github.com/vllm-project/vllm/pull/25520
- [Core] Support weight_loader_v2 for
UnquantizedLinearMethod
by @kylesayrs in https://github.com/vllm-project/vllm/pull/23036 - [Compile] Fix AMD Compile Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/25518
- [BugFix] Fix MLA assert with CUTLASS MLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25478
- [fix]: add Arm 4bit fused moe support by @nikhil-arm in https://github.com/vllm-project/vllm/pull/23809
- [KV sharing] Re-land Gemma3n model changes from #22628 by @sarckk in https://github.com/vllm-project/vllm/pull/24357
- [Spec Decode] Enable FlashInfer Spec Decoding by @benchislett in https://github.com/vllm-project/vllm/pull/25196
- [Perf] Fix jit compiles at runtime of fla gated delta rule by @coreylowman in https://github.com/vllm-project/vllm/pull/25432
- [Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls by @bbrowning in https://github.com/vllm-project/vllm/pull/25514
- [TPU][Bugfix] fix the missing apply_model in tpu worker by @yaochengji in https://github.com/vllm-project/vllm/pull/25526
- [Misc] Retry HF processing if "Already borrowed" error occurs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25535
- [Bugfix][CPU] Skip unsupported custom op register on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25534
- [CI/Build] Fix v1 OOT registration test by @Isotr0py in https://github.com/vllm-project/vllm/pull/25547
- [Misc]] Move processing context to multimodal directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25548
- [CI/Build] add nightly prime-rl integration tests by @Jackmin801 in https://github.com/vllm-project/vllm/pull/25207
- [V0 Deprecation] Remove max_seq_len_to_capture by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25543
- [BugFix] Potential Fix for FA3 full-cudagraph IMA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25490
- [misc] update the warning message by @youkaichao in https://github.com/vllm-project/vllm/pull/25566
- [Bugfix] Fix dummy video number of frames calculation by @ywang96 in https://github.com/vllm-project/vllm/pull/25553
- [Bug] fix import and unit test by @jmkuebler in https://github.com/vllm-project/vllm/pull/25558
- [Benchmark] Fix regression in structured output benchmark by @russellb in https://github.com/vllm-project/vllm/pull/25500
- [docs] fix nixl kv_connector_extra_config.backends key by @panpan0000 in https://github.com/vllm-project/vllm/pull/25565
- [Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output by @taohui in https://github.com/vllm-project/vllm/pull/25405
- Move
DeviceConfig
,ObservabilityConfig
,SpeechToTextConfig
to their own files by @hmellor in https://github.com/vllm-project/vllm/pull/25564 - [Misc] Improve type annotations for jsontree by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25577
- [ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled by @gshtras in https://github.com/vllm-project/vllm/pull/25275
- [ROCm][Build][Bugfix] Fix ROCm base docker whls installation order by @gshtras in https://github.com/vllm-project/vllm/pull/25415
- Fixes and updates to bench_per_token_quant_fp8 by @mgoin in https://github.com/vllm-project/vllm/pull/25591
- [Bugfix] add cache model when from object storage get model by @lengrongfu in https://github.com/vllm-project/vllm/pull/24764
- Support mnnvl all2allv from Flashinfer by @wenscarl in https://github.com/vllm-project/vllm/pull/21003
- Suppress benign cuBLAS warning when capturing cudagraphs with DBO by @SageMoore in https://github.com/vllm-project/vllm/pull/25596
- [Docs] Enable
fail_on_warning
for the docs build in CI by @hmellor in https://github.com/vllm-project/vllm/pull/25580 - [V0 Deprecation] Remove unused classes in attention by @WoosukKwon in https://github.com/vllm-project/vllm/pull/25541
- [Logging] Improve log for when DeepEP HT disables CUDA Graphs by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25531
- feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel by @djmmoss in https://github.com/vllm-project/vllm/pull/25503
- [Refactor] Use DeepGEMM Col Major TMA Aligned Tensor by @yewentao256 in https://github.com/vllm-project/vllm/pull/25517
- Improve
--help
for enhanced user experience by @hmellor in https://github.com/vllm-project/vllm/pull/24903 - [MISC] replace c10::optional with std::optional by @842974287 in https://github.com/vllm-project/vllm/pull/25602
- [Model] Improve DotsOCRForCausalLM by @jeejeelee in https://github.com/vllm-project/vllm/pull/25466
- [Kernel] Support DCP for Triton backend by @frank-wei in https://github.com/vllm-project/vllm/pull/25132
- [Bug] Dynamo Unsupported due to
BasevLLMParameter.torch_function
calling disabled super() by @yewentao256 in https://github.com/vllm-project/vllm/pull/25613 - Enable Fbgemm NVFP4 on Dense models by @samanamp in https://github.com/vllm-project/vllm/pull/25609
- [Model] Add LongCat-Flash by @OftenDream in https://github.com/vllm-project/vllm/pull/23991
- optimize: eliminate duplicate split_enc_dec_inputs calls by @nicole-lihui in https://github.com/vllm-project/vllm/pull/25573
- [Bugfix] fix apply_temperature to avoid nan in probs by @courage17340 in https://github.com/vllm-project/vllm/pull/24734
- [Misc] Simplify PoolerOutput and move to
v1/outputs
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25629 - Map CwmForCausalLM to llama and LlamaForCausalLM by @jacobkahn in https://github.com/vllm-project/vllm/pull/25611
- typo: remove duplicate
is
by @nicole-lihui in https://github.com/vllm-project/vllm/pull/25641 - Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25607
- [fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin by @fadara01 in https://github.com/vllm-project/vllm/pull/25579
- [Misc] Fix Qwen3-VL
video_grid_thw
typing by @ywang96 in https://github.com/vllm-project/vllm/pull/25646 - [Bugfix] Add triton.language.tensor placeholder by @adobrzyn in https://github.com/vllm-project/vllm/pull/25649
- [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling by @Isotr0py in https://github.com/vllm-project/vllm/pull/25648
- [mypy] Further improve MM type annotations by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25654
- [Bugfix] Parse SpeculativeConfig Error by @yyzxw in https://github.com/vllm-project/vllm/pull/25142
- [V0 deprecation] Remove unreachable model_config.supported_tasks by @noooop in https://github.com/vllm-project/vllm/pull/25642
- Add backward compatibility for
guided_...
API by @hmellor in https://github.com/vllm-project/vllm/pull/25615 - [CI/Build] Fix flaky entrypoints test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25663
- [XPU][Triton]add xpu config in triton_reshape_and_cache_flash by @jikunshang in https://github.com/vllm-project/vllm/pull/25643
- [Hardware][RISC-V] Add riscv64 support for vLLM with scalar by @langc23 in https://github.com/vllm-project/vllm/pull/22112
- [mypy] Fix wrong type annotations related to tuple by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25660
- [misc] warning by default for hanging / busy / idle by @youkaichao in https://github.com/vllm-project/vllm/pull/25627
- [torch.compile] Make Query Quantization Fusable by @jmkuebler in https://github.com/vllm-project/vllm/pull/24914
- [CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/25652
- [ux] Switch a warning to debug about a pytorch fallback by @russellb in https://github.com/vllm-project/vllm/pull/23750
- [Bugfix] Fix InternS1 video processing after Transformers v4.56 by @Isotr0py in https://github.com/vllm-project/vllm/pull/25644
- [Misc] Remove cruft file in repo by @NickLucche in https://github.com/vllm-project/vllm/pull/25678
- [Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/25532
- [BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… by @AlonKejzman in https://github.com/vllm-project/vllm/pull/24662
- Revert "[Bug] Dynamo Unsupported due to
BasevLLMParameter.torch_function
calling disabled super()" by @mgoin in https://github.com/vllm-project/vllm/pull/25681 - [BugFix] Fix DBO hang by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25625
- [Model] Add optional parameter to reasoning parser constructor by @taohui in https://github.com/vllm-project/vllm/pull/25554
- [Model] Define
merge_by_field_config
MM interface by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25676 - [V0 deprecation] Clean up V0 fallback in compilation config by @Isotr0py in https://github.com/vllm-project/vllm/pull/25675
- [V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/25489
- [V0 deprecation] Clean up LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/25686
- [Misc] Simplify
test_argsort_mm_positions
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25690 - [Optimization] Streamline
InputPreprocessor
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25702 - [Optimization] Use a cheaper cache key in
get_model_architecture
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25682 - [Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24986
- [Core] Enable command line logging for LLMEngine by @zhuohan123 in https://github.com/vllm-project/vllm/pull/25610
- [Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 by @tomeras91 in https://github.com/vllm-project/vllm/pull/25708
- Fix routing_bias dtype by @wenscarl in https://github.com/vllm-project/vllm/pull/25711
- [Refactor] Remove DeepGEMM OP Register by @yewentao256 in https://github.com/vllm-project/vllm/pull/25710
- [Misc] Don't log shm dequeue delay warning on worker side by @njhill in https://github.com/vllm-project/vllm/pull/25720
- Llamas 3.1 405B fp4 changes upstreaming from 355_wip by @maleksan85 in https://github.com/vllm-project/vllm/pull/25135
- [Core] Force PIECEWISE CUDAGraph mode for encoder-decoder by @russellb in https://github.com/vllm-project/vllm/pull/25701
- [Misc] Remove unnecessary memoryviews in shm_broadcast.py by @njhill in https://github.com/vllm-project/vllm/pull/25721
- EVS Support (Video tokens pruning) by @BloodAxe in https://github.com/vllm-project/vllm/pull/22980
- [CI/Build] fix doc build warning: Failed to get 'name: description' pair by @yitingdc in https://github.com/vllm-project/vllm/pull/25733
- fix: revert cast to cpu in
MsgpackEncoder._encode_tensor
to avoid hidden performance regressions by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25738 - perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/25739
- [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X by @xaguilar-amd in https://github.com/vllm-project/vllm/pull/25703
- fix: print outputt offline_inference/base/chat.py example by @Iceber in https://github.com/vllm-project/vllm/pull/25744
- [Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. by @sighingnow in https://github.com/vllm-project/vllm/pull/25743
- Remove cuda hard-code in compute_causal_conv1d_metadata by @wxsIcey in https://github.com/vllm-project/vllm/pull/25555
- [misc] refactor speculative config by @yyzxw in https://github.com/vllm-project/vllm/pull/25657
- [Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk by @SageMoore in https://github.com/vllm-project/vllm/pull/25698
- Support LongCat-Flash-Chat tool call by @Xu-Wenqing in https://github.com/vllm-project/vllm/pull/24083
- [Doc] Update Batch-level DP docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25757
- [Model] Mamba2 varlen and metadata refactor by @cyang49 in https://github.com/vllm-project/vllm/pull/21467
- [CI] Fix test_shared_storage_connector_hashes by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/25748
- [Bugfix] Properly abort pooling request. by @noooop in https://github.com/vllm-project/vllm/pull/25734
- [CI/Build] Split up Distributed Tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25572
- [CI/Build] Fix some V1 tests not being run by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/25569
- [Quantization] Add field to skip unquantized modules for GPTQ config by @Isotr0py in https://github.com/vllm-project/vllm/pull/25455
- [BugFix] Fix using
dbo_decode_token_threshold
always (and ignoringdbo_prefill_token_threshold
) by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/25622 - [ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray by @eicherseiji in https://github.com/vllm-project/vllm/pull/25439
- [Fix][torch.compile] fix unique_filepath by @ZJY0516 in https://github.com/vllm-project/vllm/pull/25732
- Eagle3 that supports the Minicpm3 model by @LDLINGLINGLING in https://github.com/vllm-project/vllm/pull/24243
- [Doc]: improve CPU(x86) build-wheel-from-source section by @brokedba in https://github.com/vllm-project/vllm/pull/25617
New Contributors
- @SamitHuang made their first contribution in #24733
- @rozeappletree made their first contribution in #24785
- @ChenTaoyu-SJTU made their first contribution in #24732
- @ziliangpeng made their first contribution in #24793
- @chenfengjin made their first contribution in #24823
- @LCAIZJ made their first contribution in #23917
- @xiao-llm made their first contribution in #22222
- @koiker made their first contribution in #20321
- @cboss6 made their first contribution in #23745
- @liangwen12year made their first contribution in #23498
- @lianyiibo made their first contribution in #24571
- @tahsintunan made their first contribution in #24342
- @haoyangli-amd made their first contribution in #24649
- @rouchenzi made their first contribution in #22842
- @xinyu-intel made their first contribution in #25026
- @shijun-yin made their first contribution in #24668
- @Aidyn-A made their first contribution in #24599
- @dolpm made their first contribution in #23774
- @samzong made their first contribution in #25010
- @mmangkad made their first contribution in #24766
- @ahao-anyscale made their first contribution in #25092
- @karan made their first contribution in #25076
- @toncao made their first contribution in #24960
- @666even666 made their first contribution in #23909
- @lumina37 made their first contribution in #24886
- @gfinol made their first contribution in #25081
- @punitvara made their first contribution in #25058
- @gigit0000 made their first contribution in #24222
- @Rohan138 made their first contribution in #25104
- @candyzone made their first contribution in #24585
- @wxsIcey made their first contribution in #25243
- @LJH-LBJ made their first contribution in #25224
- @David-Wen2025 made their first contribution in #24818
- @alecsolder made their first contribution in #24985
- @Lucaskabela made their first contribution in #25090
- @manoelmarques made their first contribution in #23558
- @lirong-lirong made their first contribution in #25308
- @KKSK-DON made their first contribution in #25325
- @debroy-rh made their first contribution in #25285
- @Sara-KS made their first contribution in #23445
- @Daisy-Ma-coder made their first contribution in #25274
- @jcyang43 made their first contribution in https://github.com/vllm-project/vllm/pull/25278
- @Zhikaiiii made their first contribution in https://github.com/vllm-project/vllm/pull/25028
- @ahartel made their first contribution in https://github.com/vllm-project/vllm/pull/25203
- @wuxibin89 made their first contribution in https://github.com/vllm-project/vllm/pull/25458
- @rivos-shreeasish made their first contribution in https://github.com/vllm-project/vllm/pull/24913
- @Amir-19 made their first contribution in https://github.com/vllm-project/vllm/pull/24532
- @coreylowman made their first contribution in https://github.com/vllm-project/vllm/pull/25501
- @LuYanFCP made their first contribution in https://github.com/vllm-project/vllm/pull/24263
- @baxingpiaochong made their first contribution in https://github.com/vllm-project/vllm/pull/24015
- @Jackmin801 made their first contribution in https://github.com/vllm-project/vllm/pull/25207
- @taohui made their first contribution in https://github.com/vllm-project/vllm/pull/25405
- @OftenDream made their first contribution in https://github.com/vllm-project/vllm/pull/23991
- @nicole-lihui made their first contribution in https://github.com/vllm-project/vllm/pull/25573
- @jacobkahn made their first contribution in https://github.com/vllm-project/vllm/pull/25611
- @fadara01 made their first contribution in https://github.com/vllm-project/vllm/pull/25579
- @langc23 made their first contribution in https://github.com/vllm-project/vllm/pull/22112
- @AlonKejzman made their first contribution in https://github.com/vllm-project/vllm/pull/24662
- @BloodAxe made their first contribution in https://github.com/vllm-project/vllm/pull/22980
- @yitingdc made their first contribution in https://github.com/vllm-project/vllm/pull/25733
- @xaguilar-amd made their first contribution in https://github.com/vllm-project/vllm/pull/25703
- @Iceber made their first contribution in https://github.com/vllm-project/vllm/pull/25744
- @LDLINGLINGLING made their first contribution in https://github.com/vllm-project/vllm/pull/24243
- @brokedba made their first contribution in https://github.com/vllm-project/vllm/pull/25617
Full Changelog: v0.10.2...v0.11.0rc5