vllm-project/vllm v0.10.2rc3 on GitHub

Highlights

This release contains 740 commits from 266 contributors (97 new)!

Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.

Model Support

New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).

Engine Core

V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with --model-impl terratorch support.
Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
Performance core improvements: --safetensors-load-strategy for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214).
Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).

Hardware & Performance

NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).

Quantization

New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
Breaking change: Removed original Marlin quantization format (#23204).

API & Frontend

OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).

Dependencies

Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.

V0 Deprecation

Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).

Breaking Changes

PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
Metrics renaming - TPOT deprecated in favor of ITL

What's Changed

[Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in #23064
[Misc] enhance static type hint by @andyxning in #23059
[Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in #23058
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in #23055
[Misc] refactor function name by @andyxning in #23029
[Misc] Fix backward compatibility from #23030 by @ywang96 in #23070
[XPU] Fix compile size for xpu by @jikunshang in #23069
[XPU][CI]add xpu env vars in CI scripts by @jikunshang in #22946
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in #23053
[Bugfix] fix IntermediateTensors equal method by @andyxning in #23027
[Refactor] Get prompt updates earlier by @DarkLight1337 in #23097
chore: remove unnecessary patch_padding_side for the chatglm model by @carlory in #23090
[Bugfix] Support compile for Transformers multimodal by @zucchini-nlp in #23095
[CI Bugfix] Pin openai<1.100 to unblock CI by @mgoin in #23118
fix: OpenAI SDK compat (ResponseTextConfig) by @h-brenoskuk in #23126
Use Blackwell FlashInfer MXFP4 MoE by default if available by @mgoin in #23008
Install tpu_info==0.4.0 to fix core dump for TPU by @xiangxu-google in #23135
[Misc] Minor refactoring for prepare_inputs by @WoosukKwon in #23116
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT by @WoosukKwon in #23041
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code by @tdoublep in #23122
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests by @robertgshaw2-redhat in #22871
[V0 Deprecation] Remove V0 FlashInfer attention backend by @WoosukKwon in #22776
chore: disable enable_cpp_symbolic_shape_guards by @xiszishu in #23048
[TPU] make ptxla not imported when using tpu_commons by @yaochengji in #23081
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes by @nikheal2 in #22725
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema by @bbeckca in #22023
[Log] Warning Once for Cutlass MLA by @yewentao256 in #23137
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 by @ZJY0516 in #23114
[misc] split engine_model into json file for nsys profile tool by @gracehonv in #23117
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn by @pliops-daniels in #22889
Fix GLM-4.5V-FP8 numerical issue by @zixi-qi in #22949
[Misc] Add request_id into benchmark_serve.py by @hustxiayang in #23065
[Bugfix] Fix broken Minimax-01-VL model by @Isotr0py in #22116
[bug fix] Fix llama4 spec decoding by @zixi-qi in #22691
[Misc] Avoid accessing req_ids inside a loop by @WoosukKwon in #23159
[Doc] use power of 2 by @Tialo in #23172
[Misc] Fix seq_lens for graph capture by @WoosukKwon in #23175
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel by @elvischenv in #21716
[Model] Add transformers problem_type (e.g. multi_label_classification) support by @noooop in #23173
[Model] support new model ovis2.5 by @myselvess in #23084
[Bugfix] Fix benchmark_moe.py by @jeejeelee in #23177
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL by @tjtanaa in #22742
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock by @yiz-liu in #23169
Add return_token_ids parameter to OpenAI API endpoints by @ultmaster in #22587
Migrate LlavaOnevisionMultiInputs to TensorSchema by @bbeckca in #21844
[CI/Build] Update transformers to v4.55.2 by @Isotr0py in #23093
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks by @tanruixiang in #22654
[Frontend] Add /collective_rpc API endpoint by @22quinn in #23075
[Misc] Enable yapf for FlashInfer backend by @WoosukKwon in #23193
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. by @bnellnm in #23125
fix: use cache_salt for gpt-oss by @dr75 in #23186
[Misc] Minor refactoring for FlashInfer backend by @WoosukKwon in #23147
[CI/Build] Add support for Python 3.13 by @mgoin in #13164
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend by @amirkl94 in #22357
[CI/Build] Replace lm-eval gsm8k tests with faster implementation by @mgoin in #23002
[BugFix] fix CUTLASS MLA full cudagraph by @LucasWilkinson in #23200
[Benchmarks] Add video inputs to ShareGPTDataset. by @huachenheli in #23199
[Quantization] Bump Compressed Tensors Version by @kylesayrs in #23202
[Core] Optimize scheduler request removal for single completions by @chi2liu in #21917
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py by @mgoin in #23132
[Core] Add torch profiler CPU traces for AsyncLLM. by @huachenheli in #21794
[Doc] Update V1 status of various pooling models by @DarkLight1337 in #23189
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention by @linzebing in #23185
Fix a performance comparison issue in Benchmark Suite by @louie-tsai in #23047
chore: support pytorch format in lora by @KilJaeeun in #22790
[CI/Build] Also check DP in benchmarks throughput script by @zhewenl in #23038
[CI/Build] Sync multimodal tests by @DarkLight1337 in #23181
[BugFix] Fix stuck stats/metrics after requests are aborted by @njhill in #22995
fix cuda graph by @fsx950223 in #22721
[Model] use autoWeightsLoader for gptoss by @calvin0327 in #22446
Fix missing quotes by @wzshiming in #23242
[Model] Support deepseek with eagle by @xyang16 in #21086
[Bugfix] Ensure correctness of Cohere2Vision processing by @DarkLight1337 in #23245
Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image by @mgoin in #23129
[Model][V1] Support Ernie MTP by @xyxinyang in #22169
[Model] Improve olmo and olmo2 by @jeejeelee in #23228
[Fix] fix offline env use local mode path by @lengrongfu in #22526
[Bugfix] Ensure correctness of HCXVision processing by @DarkLight1337 in #23254
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute by @shixianc in #23045
[CLI][Doc] Formalize --mm-encoder-tp-mode by @DarkLight1337 in #23190
[Misc] Add max_seq_len to CommonAttentionMetadata by @WoosukKwon in #23216
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER by @JartX in #22795
Support conditional torch.compile per module by @sarckk in #22269
Migrate Mistral3ImagePixelInputs to TensorSchema by @bbeckca in #21945
Limit HTTP header count and size by @russellb in #23267
Small fix for Command-A-Vision by @dongluw in #23268
[Kernel/Quant] Remove the original marlin format and qqq by @mgoin in #23204
[Fix] correct tool_id for kimi-k2 when use tool_choice=required by @MoyanZitto in #21259
[Frontend] improve error logging of chat completion by @heheda12345 in #22957
[Optimization] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x by @misrasaurabh1 in #20413
Do not use eval() to convert unknown types by @russellb in #23266
[Feature] use --eplb_config to set eplb param by @lengrongfu in #20562
[misc] fix multiple arch wheels for the nightly index by @youkaichao in #23110
Remove chunked_prefill_enabled flag in V1 MLA by @MatthewBonanni in #23183
Feature/mla tests by @MatthewBonanni in #23195
[Fix] remove is_marlin param in benchmark_moe by @shixianc in #23286
[EP] Add logging for experts map by @22quinn in #22685
Remove duplicate entry in vllm.attention.all by @russellb in #23296
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter by @mgoin in #23284
[Optimization] Make new_block_ids None if empty by @WoosukKwon in #23262
[CPU] Refactor CPU W8A8 scaled_mm by @bigPYJ1151 in #23071
[CI/Build] Split out mm processor tests by @DarkLight1337 in #23260
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support by @Josephasafg in #23035
[Compile] Fix Compile Warning SM100 Cutlass MLA by @yewentao256 in #23287
[Model][VLM] Support R-4B Model by @yannqi in #23246
Delete images older than 24h. by @QiliangCui in #23291
[CI] Block the cu126 wheel build while broken by @mgoin in #23285
[Sampler] Support returning final logprobs by @22quinn in #22387
[Bugfix] Fix extra whitespace in strings caused by newline by @DarkLight1337 in #23272
[BugFix] Fix Python 3.9 Support by @jaredoconnell in #23306
[Model] Add LFM2 architecture by @paulpak58 in #22845
[Refactor] Simplify code for MM budget by @DarkLight1337 in #23310
[Doc] Fix batch-level DP example by @DarkLight1337 in #23325
[Performance] V1 Pooling Models E2E Performance Optimization by @noooop in #23162
[V1] Remove unnecessary check for main thread by @robertgshaw2-redhat in #23298
[Bugfix] set system_message in phi4mini chat template by @zhuangqh in #23309
[Multimodal] Always enable hashing mm data by @ywang96 in #23308
[ci/build] Fix abi tag for aarch64 by @youkaichao in #23329
Migrate MolmoImageInputs to TensorSchema by @bbeckca in #22022
Fix nvfp4 swizzling by @yiliu30 in #23140
add tg-mxfp4-moe-test by @IwakuraRein in #22540
[Bug] Fix R1 Accuracy 0 Bug by @yewentao256 in #23294
[Bugfix] Fix port conflict by obtaining a list of open ports upfront by @minosfuture in #21894
[Misc] Misc code cleanup/simplification by @njhill in #23304
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message by @heheda12345 in #23318
[Misc] fix VLLM_TORCH_PROFILER_DIR to absolute path by @andyxning in #23191
[Core] Always use tensor cores for Flashinfer Decode Wrapper by @pavanimajety in #23214
Make sure that vectorize_with_alignment produced vectorized global loads by @elvircrn in #23182
[Structured Outputs] Refactor bitmask construction into get_grammar_bitmask by @WoosukKwon in #23361
[CI] Clean up actions: remove helm, publish workflows and improve pr … by @simon-mo in #23377
[CI] improve pr comments bot by @simon-mo in #23380
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm by @mgoin in #23265
Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. by @tvalentyn in #23270
[Feature][Responses API] Support logprobs(non-stream) by @kebe7jun in #23319
[Core] Support custom executor qualname by @22quinn in #23314
[Kernel] Add FP8 support with FlashMLA backend by @MatthewBonanni in #22668
[Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed by @DarkLight1337 in #18800
Migrate MllamaImagePixelInputs to TensorSchema by @bbeckca in #22020
[CI/Build] Skip Idefics3 and SmolVLM generation test again by @Isotr0py in #23356
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement by @yewentao256 in #23351
[CI] Add end-to-end V1 min_tokens test coverage by @arjunbreddy22 in #22495
[Misc] Add gemma3 chat template with pythonic-style function calling by @philipchung in #17149
[New Model] Add Seed-Oss model by @FoolPlayer in #23241
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models by @heheda12345 in #23154
[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator by @sfeng33 in #23079
[gpt-oss] add input/output usage in responses api when harmony context is leveraged by @gcalmettes in #22667
Migrate MiniCPMOAudioInputs to TensorSchema by @bbeckca in #21847
[Bugfix] Fix pooling models on non-CUDA devices by @bigPYJ1151 in #23392
[V0 Deprecation] Remove V0 LoRA test by @jeejeelee in #23418
[Misc] Move M-RoPE init logic to _init_mrope_positions by @WoosukKwon in #23422
[Attention] Allow V1 flash_attn to support cross-attention by @russellb in #23297
[misc] Remove outdate comment about runai_model_streamer by @carlory in #23421
[Doc] Update the doc for log probs + prefix caching by @heheda12345 in #23399
[Misc] local import code clean by @andyxning in #23420
[Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script by @namanlalitnyu in #23375
[Fix] Bump triton version in rocm-build requirements by @bringlein in #21630
[Bugfix]: Installing dev environment due to pydantic incompatible version by @hickeyma in #23353
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support by @PapaGoose in #23337
[BugFix] Fix the issue where image embeddings were incorrectly split.… by @bppps in #23366
fix(tests): Ensure reliable CUDA cache clearing in MoE test by @AzizCode92 in #23416
Add unit tests for batched guided and non-guided requests by @sarckk in #23389
[Doc]: fix various typos in multiple files by @didier-durand in #23179
[Model] Add Ovis2.5 PP support by @Isotr0py in #23405
[Bugfix] Fix broken Florence-2 model by @Isotr0py in #23426
[Quantization] Allow GGUF quantization to skip unquantized layer by @Isotr0py in #23188
add an env var for path to pre-downloaded flashinfer cubin files by @842974287 in #22675
[CI/Build] add EP dependencies to docker by @zhewenl in #21976
[PERF] PyTorch Symmetric Memory All-Reduce by @ilmarkov in #20759
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model by @rasmith in #22281
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel by @elvischenv in #22703
[BugFix] Fix batch updates for pooling models by @njhill in #23398
[BugFix] Fix MinPLogitsProcessor.update_states() by @njhill in #23401
[Model] Support DP for ViT on MiniCPM-V-4 by @david6666666 in #23327
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh by @mgoin in #23360
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs by @fengli1702 in #22527
Add glm4.5v tp2,4 fp8 config on H100_80GB by @chenxi-yang in #23443
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000)" by @DarkLight1337 in #23396
fix(tests): Correct unreachable assertion in truncation test by @AzizCode92 in #23425
Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #23454
[Misc] Modify CacheConfig import by @jeejeelee in #23459
[gpt-oss] Streaming Output for Python Tool by @ZJY0516 in #23409
Migrate Pixtral inputs to TensorSchema by @bbeckca in #23472
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks by @22quinn in #23477
Migrate Paligemma inputs to TensorSchema by @bbeckca in #23470
[kernel] Support W4A8 on Hopper by @czhu-cohere in #23198
[Misc] update dict parse to EPLBConfig from json dumps to dict unpacking by @lengrongfu in #23305
(Misc): add missing test for zero truncation size. by @teekenl in #23457
[New Model]Donut model by @princepride in #23229
[Model] Enable BLOOM on V1 by @DarkLight1337 in #23488
[Misc] Remove unused slot_mapping buffer by @WoosukKwon in #23502
fix incompatibililty with non cuda platform for nvfp4 by @luccafong in #23478
[Doc: ]fix various typos in multiple files by @didier-durand in #23487
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 by @minosfuture in #23504
Frontend: Adding LM Format Enforcer support to V1 engine by @noamgat in #22564
[Bugfix] Fix Qwen2.5-VL quantized model weights loading by @zifeitong in #23512
[Misc] Unified linear print info by @jeejeelee in #23516
Migrate tarsier inputs to TensorSchema by @bbeckca in #23500
Migrate skyworkr1v inputs to TensorSchema by @bbeckca in #23499
Migrate DonutImagePixelInputs to TensorSchema by @bbeckca in #23509
[Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) by @FFFfff1FFFfff in #23408
[gpt-oss] use reasoning channel for reasoning text in serving_chat by @yuguo68 in #22920
[Refactor] Dynamic target and content for prompt updates by @DarkLight1337 in #23411
[Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests by @fake0fan in #22711
[Fix] DeepSeek V3.1 tool parser error message by @skyloevil in #23492
Feature/benchmark/random mm data/images by @h-brenoskuk in #23119
[Bugfix] Allow dynamic number of patches for llava_onevision by @DarkLight1337 in #23525
[misc] add shanghai meetup by @youkaichao in #23535
[Attention] Unify mamba and attention backend selection by @ayushsatyam146 in #23171
[Doc] Add caution for API server scale-out by @DarkLight1337 in #23550
[Refactor] Pass tokenizer explicitly instead of binding to prompt update by @DarkLight1337 in #23542
Updates to Flex + VLLm integration by @drisspg in #21416
[Bugfix] Fix Qwen3 MoE GPTQ inference by @Isotr0py in #23490
[Refactor] Refactor persistent buffers with CpuGpuBuffer by @WoosukKwon in #23515
[test][RL] Add sleep level 2 test and fix reload with sleep mode by @22quinn in #23521
[Kernel] Add fused grouped_topk kernel for MoE by @xyang16 in #23274
[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector by @Abatom in #23403
[XPU] Delay BF16 check to worker init for spawn compatibility by @chaojun-zhang in #22979
[TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. by @patemotter in #23574
[Docs] Update Documentation of Cohere Command-A Models by @Terrencezzj in #23584
[Misc] Simplify FlashInfer attention metadata by @WoosukKwon in #23585
[Misc] Add release note draft to PR template by @simon-mo in #23598
[CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests by @mgoin in #23568
Update Flashinfer to 0.2.14.post1 by @weireweire in #23537
[Bug] Fix DeepGEMM Env Control by @yewentao256 in #23591
[CI/Build] Use vLLM client's user agent to fetch images by @DarkLight1337 in #23561
Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper by @Copilot in #23385
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT by @liuzijing2014 in #22760
[CI/Build] Fix typo in #23561 by @DarkLight1337 in #23616
[fix] fix seed-oss-parser by @FoolPlayer in #23560
[mypy] Fix incorrect type hint for EAGLE3 support by @DarkLight1337 in #23617
[Benchmarks] add benchmark for embedding models by @ZJY0516 in #23000
[Docs] Fix titles for multi-file examples that are rendered in the docs by @hmellor in #23573
Fix CLI parameter documentation inconsistency in pooling_models.md by @oneraghavan in #23630
[Bugfix] Fix Qwen25VL packed_modules_mapping by @jeejeelee in #23604
[Bugfix] Fix scheduling when repeated images in one request by @ywang96 in #23544
[V1] Enable V1 for compute capability < 8.0 + FP32 by @DarkLight1337 in #23614
Fix nits from #20059 by @hmellor in #23548
Fix writing benchmark results with tuple keys by @huydhn in #23633
[Perf] Remove duplicated NVFP4 blockscales to save memory by @mgoin in #23379
[Model] fix DeepSeek e_score_correction_bias dtype to fp32 by @jeejeelee in #23640
[Bugfix] Add missing enable_log_outputs parameter to init_app_state function by @lordmathis in #23634
feat: add usage to TranscriptionResponse (text and json response_format) by @gcalmettes in #23576
Support FlashAttention Backend for Hybrid SSM Models by @heheda12345 in #23299
[Docs] Fix broken links to docs/api/summary.md by @hmellor in #23637
[Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) by @OYE93 in #23565
[Kernel] Added flashinfer fp8 per-tensor gemms by @nvjullin in #22895
[Doc]: fix various spelling issues in multiple files by @didier-durand in #23636
[CPU] add cpu fused moe pytorch native implementation by @TianyuLi0 in #23146
[ROCm] Starting to add AMD code reviewers for ROCm components by @hongxiayang in #23496
[Docs] Reduce requirements for docs build by @hmellor in #23651
[Bugfix] fix bf16 multimodal model hash by @yuekaizhang in #23623
[model] support qwen2audio embedding input by @yuekaizhang in #23625
[Misc] Add override for allreduce fusion thresholds by @nvjullin in #23639
[CI] [Doc]: Add GH Action for auto labeling issues with rocm tag by @vllmellm in #20988
[Bugfix] Fix cuda event usage with CPU model runner by @bigPYJ1151 in #23643
[Docs] Fix warnings in mkdocs build by @Zerohertz in #23649
[Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models by @tdoublep in #23665
[v1] Add cross-attention KV cache support for encoder-decoder models by @russellb in #23664
[Bugfix] Fix incorrect original shape in hashing by @DarkLight1337 in #23672
[Misc] Fix comments in tests/kernels/quantization by @ZJY0516 in #23675
[Model] Enable video support for InternVL3.5 models by @Isotr0py in #23658
[doc] Hybrid KV Cache Manager design doc by @heheda12345 in #22688
Enhance the pre-notification policy by @sidhpurwala-huzaifa in #23532
[Docs] Move quant supported hardware table to README by @hmellor in #23663
[V1][P/D]P2pNcclConnector supports flashinfer by @Abatom in #23536
[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 by @tdoublep in #22594
[Compile] Fix Cmake Warning by @yewentao256 in #23689
[Bugfix] UnboundLocalError when GptOss reasoning specified by @coval3nte in #23054
feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 by @zixuanzhang226 in #23695
[Feature][Responses API] Support MCP tool in background mode by @wuhang2014 in #23494
fix pynccl reduce_scatter by @youzhedian in #23648
[quantization] use channel scales for w4a8 + misc fixes by @czhu-cohere in #23570
[gpt-oss] Enable unit test for response API harmony integration by @heheda12345 in #23533
[Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 by @mgoin in #23678
[Docs] Fix math rendering in docs by @hmellor in #23676
[Bugfix][gpt-oss] passing the cache config in gpt-oss by @frank-wei in #23613
[Bugfix]: Qwen3 Coder Tool Parser by @ranpox in #23099
[Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. by @huachenheli in #23686
[Model] Add Ernie4.5 VL Model Support by @CSWYF3634076 in #22514
[Frontend] Add --log-error-stack to print stack trace for error response by @heheda12345 in #22960
[Frontend] Optimize beam search performance by limiting concurrency by @heheda12345 in #23599
[Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs by @dsikka in #22674
[XPU] Add xpu torch.compile support by @jikunshang in #22609
[CI/Build] Remove redundant LoRA model tests by @jeejeelee in #23706
[Bugfix] fix when config.yaml config value is list parse error by @lengrongfu in #23528
[Core] Use key-only cache for BaseMultiModalProcessor by @DarkLight1337 in #23018
[XPU]fix cuda event used in XPU model runner by @jikunshang in #23708
[CI/Build] Remove redundant register in model init tests by @DarkLight1337 in #23715
[Docs] Fix an admonition important by @windsonsea in #23726
Optimize input preparation for FlashInfer [2/N] by @WoosukKwon in #23174
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py by @WoosukKwon in #23728
[FlashInfer] Cache hyper params in metadata builder by @WoosukKwon in #23732
[CI/Build] Reduce LoRA layer test cases by @jeejeelee in #23721
[XPU] Fix OOM issue for data parallel with Ray backend by @faaany in #22500
[Docs] Fix a 1-2-3 list and style issues in tpu.md by @windsonsea in #23729
[model] Support MiniCPM-V 4.5 by @tc-mb in #23586
[Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled by @cndoit18 in #23718
[Misc] Remove unnecessary _send_reconfig_message() in core_client.py by @njhill in #23127
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models by @tdoublep in #23716
[Model] Explicit default_pooling_type interface by @DarkLight1337 in #23736
Add vLLM Korea Meetup in the README.md and meetups.md by @rebel-hongseok in #23746
Fix pre-commit on main by @hmellor in #23747
[Model] Interface to enable batch-level DP support by @DarkLight1337 in #23733
Only run get_attr_docs if generating help text by @hmellor in #23723
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt by @yewentao256 in #23666
[Model] Enable native HF format InternVL support by @Isotr0py in #23742
[Doc]: upgrade version of crate-ci tool for improved typo detection by @didier-durand in #23755
[LogitsProcs] Deduplicate built-in LP implementation logic by @njhill in #23362
[Docs] Remove in-tree Gaudi install instructions by @hmellor in #23628
[BugFix] Fix topk_softmax assert by @ProExpertProg in #19764
[Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal by @DarkLight1337 in #23749
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models by @tdoublep in #22589
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #23743
ci: Add arm64 docker build to release pipeline by @seemethere in #23210
Disable torch.compile for dynamic rope models in Transformers backend by @hmellor in #23738
[Multimodal] Generate mm_hash based on request metadata when caching is turned off by @ywang96 in #23690
[V1][Mamba] - Enable V1 by default for Mamba Models by @Josephasafg in #23650
DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 by @zyongye in #23608
[Bugfix] Fix Marlin NVFP4 for modelopt by @mgoin in #23659
[Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue by @yewentao256 in #23595
[Bugfix] Fix for V1 priority scheduling crashes at preemption by @Hanchenli in #23713
Migrate Qwen inputs to TensorSchema by @bbeckca in #23473
[Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 by @Shrey1306 in #23556
[Perf] Tune configs for triton block fp8 gemm H100/H200 by @mgoin in #23748
Gracefully handle edge cases in harmony utils by @Ithanil in #23155
[CI] make all multi-gpu weight loading tests run nightly by @killershrimp in #23792
Add deprecation warning for lora_extra_vocab_size by @ahengljh in #23635
[Transform] [Quantization] Add transforms to compressed tensors by @kylesayrs in #22486
[CI] enable idefics3 and fuyu-8b test in multimodal test by @ZJY0516 in #23790
[Bugfix] when set offline model running error by @lengrongfu in #23711
[Kernel] cuda kernels for upcoming decode context parallel feature by @youzhedian in #23791
[New Model]: Support GteNewModelForSequenceClassification by @noooop in #23524
[Model] Add PP support and VLM backbone compatability for GPT-OSS by @Isotr0py in #23680
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader function by @JartX in #22797
[Doc]: fix typos in .md files (including those of #23751) by @didier-durand in #23825
[CI/Build][Bugfix] Fix Qwen VL tests on CPU by @bigPYJ1151 in #23818
[BugFix][Spec Decode] Use float64 for uniform_probs by @WoosukKwon in #23803
[Model] [gpt-oss] fix gpt-oss pp support by @ZJY0516 in #23815
[Doc]: fix typos in Python scripts by @didier-durand in #23828
[Bugfix] Fix benchmark_moe.py for blockwise fp8. by @crischeng in #23823
[CI] Fix linting error on main by @tdoublep in #23835
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE by @nvpohanh in #23819
[Bugfix] Add fake mode around passes by @angelayi in #23349
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime by @jeanschmidt in #23757
Add scale_config.yml file for Meta autoscalers for GH Actions by @jeanschmidt in #23840
Migrate Llama4ImagePatchInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22021
[ROCm][Aiter] Add triton fp8 bmm kernel for mla by @divakar-amd in https://github.com/vllm-project/vllm/pull/23264
[bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) by @He-Jingkai in https://github.com/vllm-project/vllm/pull/23829
[NVIDIA] Support SiluMul + NVFP4 quant fusion by @elvischenv in https://github.com/vllm-project/vllm/pull/23671
chore: build release image by default by @simon-mo in https://github.com/vllm-project/vllm/pull/23852
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23737
[V1] Enable prefill optimization for Gemma3n by @sarckk in https://github.com/vllm-project/vllm/pull/22628
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled by @yewentao256 in https://github.com/vllm-project/vllm/pull/23858
[V0 Deprecation] Remove V0 Samplers test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23862
[XPU] support data parallel for MoE models on XPU by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/22887
[Models] Improve iteration over layers by @lgeiger in https://github.com/vllm-project/vllm/pull/19497
[ROCm][Fix] Fix rocm build caused by #23791 by @charlifu in https://github.com/vllm-project/vllm/pull/23847
[tests] Improve speed and reliability of test_transcription_api_correctness by @russellb in https://github.com/vllm-project/vllm/pull/23854
[Bugfix] Use ReplicatedLinear for SequenceClassification head by @Isotr0py in https://github.com/vllm-project/vllm/pull/23836
[BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD by @KingsleyZhang123 in https://github.com/vllm-project/vllm/pull/23864
[Platform] import activation_quant_fusion for CUDA only by @wangxiyuan in https://github.com/vllm-project/vllm/pull/23882
Fix(async): Add support for truncate_prompt_tokens in AsyncLLM by @oneraghavan in https://github.com/vllm-project/vllm/pull/23800
[CI/Build] Clean up LoRA test by @jeejeelee in https://github.com/vllm-project/vllm/pull/23890
[mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. by @huachenheli in https://github.com/vllm-project/vllm/pull/23895
[Misc] Fix warnings for mistral model by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23552
Better errors for Transformers backend missing features by @hmellor in https://github.com/vllm-project/vllm/pull/23759
[V0 Deprecation] Remove pooling model support in V0 by @maxdebayser in https://github.com/vllm-project/vllm/pull/23434
[CPU] Enable data parallel for CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23903
[Performance] V1 Classify Models E2E Performance Optimization by @noooop in https://github.com/vllm-project/vllm/pull/23541
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec by @sfeng33 in https://github.com/vllm-project/vllm/pull/23779
Update PyTorch to 2.8.0 by @huydhn in https://github.com/vllm-project/vllm/pull/20358
Adds json_count_leaves utility function by @aditchawdhary in https://github.com/vllm-project/vllm/pull/23899
[MODEL] Apertus and XIELU by @EduardDurech in https://github.com/vllm-project/vllm/pull/23068
[Models] Use in-place adds in Idefics2Vision by @lgeiger in https://github.com/vllm-project/vllm/pull/23932
[BugFix] Async scheduling and PP compatibility with DP by @njhill in https://github.com/vllm-project/vllm/pull/23770
[CI] Add aiter to matching list of issue auto labeller for rocm tag by @vllmellm in https://github.com/vllm-project/vllm/pull/23942
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant by @youzhedian in https://github.com/vllm-project/vllm/pull/23929
[RL][BugFix] Fix missing tokenizer error for token-in-token-out by @22quinn in https://github.com/vllm-project/vllm/pull/23904
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj by @mgoin in https://github.com/vllm-project/vllm/pull/23939
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models by @tdoublep in https://github.com/vllm-project/vllm/pull/23824
Revert gemma3n fast prefill changes by @sarckk in https://github.com/vllm-project/vllm/pull/23897
[Misc] Make download_weights_from_hf more reliable by @hmellor in https://github.com/vllm-project/vllm/pull/23863
[CI] Fix unavailable image remote URL by @ywang96 in https://github.com/vllm-project/vllm/pull/23966
[Bugfix] Fix --config arg expansion called from api_server.py by @dubejf in https://github.com/vllm-project/vllm/pull/23944
Add routed_scaling_factor to MoE grouped topk by @xyang16 in https://github.com/vllm-project/vllm/pull/23123
[CI] Move testing image from remote URL to S3 by @ywang96 in https://github.com/vllm-project/vllm/pull/23980
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion by @sarckk in https://github.com/vllm-project/vllm/pull/23973
[Core] Cleanup TPU model runner for MM by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23894
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba by @tdoublep in https://github.com/vllm-project/vllm/pull/23831
[Bugfix] Fix test_lora_resolvers.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/23984
[UT] fix unify_kv_cache_configs when kv cache config needs sort by @andyxning in https://github.com/vllm-project/vllm/pull/23843
[Model] Enable encoder DP for MiniCPM-V by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23948
Add LoRA support for DeepSeek models (V2, V3, R1-0528) by @sadeghja1070 in https://github.com/vllm-project/vllm/pull/23971
[Misc] add reorder_batch AttentionMetadataBuilder by @andyxning in https://github.com/vllm-project/vllm/pull/23798
[Refactor] refactor freezing_value/cuda_event initialize outside try finally by @andyxning in https://github.com/vllm-project/vllm/pull/23758
[Misc] enhance type hint for rearrange return value by @andyxning in https://github.com/vllm-project/vllm/pull/23519
[LoRA] Much faster startup when LoRA is enabled by @andylolu2 in https://github.com/vllm-project/vllm/pull/23777
Fix wrong truncate_prompt_tokens type hint by @gmarinho2 in https://github.com/vllm-project/vllm/pull/22761
[Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers. by @ywang96 in https://github.com/vllm-project/vllm/pull/23394
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24001
vllm fix check on max vocab size by @xw285cornell in https://github.com/vllm-project/vllm/pull/22471
[Minor] Fix some random typos in comments by @njhill in https://github.com/vllm-project/vllm/pull/24009
v1: Support KV events from connectors by @orozery in https://github.com/vllm-project/vllm/pull/19737
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) by @JartX in https://github.com/vllm-project/vllm/pull/23994
[Misc] Avoid redundant copy for encoder-only models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24012
Fix the bug related to loading GPTP INT3 weights. by @Jun-Howie in https://github.com/vllm-project/vllm/pull/23328
[Misc] Move fast prefill logic to separate method by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24013
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization by @Isotr0py in https://github.com/vllm-project/vllm/pull/23357
[Misc] refactor code by import as for torch._inductor.config by @andyxning in https://github.com/vllm-project/vllm/pull/23677
Migrate Phi4 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23471
[Misc] IO Processor plugins for pooling models by @christian-pinto in https://github.com/vllm-project/vllm/pull/22820
[Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser by @DevonPeroutky in https://github.com/vllm-project/vllm/pull/22769
[Misc] add hash_function doc string by @andyxning in https://github.com/vllm-project/vllm/pull/24014
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs by @Isotr0py in https://github.com/vllm-project/vllm/pull/24022
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN by @noooop in https://github.com/vllm-project/vllm/pull/20904
[Kernel] Update DeepGEMM to latest commit by @jeejeelee in https://github.com/vllm-project/vllm/pull/23915
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24026
[Frontend] Gemma3n audio transcriptions/translations endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/23735
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors by @yankay in https://github.com/vllm-project/vllm/pull/24033
[Model]: support KeyeVL-1_5-8B by @Kwai-Keye in https://github.com/vllm-project/vllm/pull/23838
Document multi-proc method selection for profiling by @hypdeb in https://github.com/vllm-project/vllm/pull/23802
[Misc] Minor code simplification for spec decode by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24053
[docs][misc] IOProcessor plugins fixes by @christian-pinto in https://github.com/vllm-project/vllm/pull/24046
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 by @david6666666 in https://github.com/vllm-project/vllm/pull/23817
[Chore][V0 Deprecation] Move LogProb to a separate file by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24055
[bugfix]fix MTP hidden states by @luccafong in https://github.com/vllm-project/vllm/pull/24056
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24042
[V1][Mamba1] - FP32 SSM Kernel Support by @Josephasafg in https://github.com/vllm-project/vllm/pull/23506
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… by @DamonJiang777 in https://github.com/vllm-project/vllm/pull/24028
Remove runtime checks based on pooling params by @maxdebayser in https://github.com/vllm-project/vllm/pull/24051
Migrate OvisImagePatchInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22024
[XPU][Feature] fp8 online quantization support for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/23148
Migrate Interns1 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23510
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24077
[Model] Support dp on ViT on GLM-4.5V by @david6666666 in https://github.com/vllm-project/vllm/pull/23168
[CI]: reduce HTTP calls inside entrypoints openai tests by @AzizCode92 in https://github.com/vllm-project/vllm/pull/23646
correct LWS deployment yaml by @cberge908 in https://github.com/vllm-project/vllm/pull/23104
[Gemma3n] Fix audio batching by @NickLucche in https://github.com/vllm-project/vllm/pull/24052
[BugFix] Fix EXAONE4 rotary embeddings by @lkm2835 in https://github.com/vllm-project/vllm/pull/23918
[Model] Classification models support logit_bias / sigmoid_normalize by @noooop in https://github.com/vllm-project/vllm/pull/24031
[CI Failure] Skip failing nvfp4 silu test by @mgoin in https://github.com/vllm-project/vllm/pull/23959
[docs] add SYS_NICE cap & security-opt for docker/k8s by @panpan0000 in https://github.com/vllm-project/vllm/pull/24017
[Benchmark] Add support for local hf dataset path in benchmark by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23999
[Bugfix] Fix transform_config parsing in Compressed Tensors by @kylesayrs in https://github.com/vllm-project/vllm/pull/23945
Run ruff format on a few files. by @huachenheli in https://github.com/vllm-project/vllm/pull/24075
[Bugfix] Fix packed_factor missing attribute error by @kyuyeunk in https://github.com/vllm-project/vllm/pull/23902
[Metrics] Deprecate TPOT in favor of ITL by @markmc in https://github.com/vllm-project/vllm/pull/24110
Fix weights loading for Apertus by @nathanrchn in https://github.com/vllm-project/vllm/pull/24100
[Log] Only Print Profiler Results on Rank 0 by @yewentao256 in https://github.com/vllm-project/vllm/pull/23370
[CI] Enable all hf transformers baselines in test_hybrid by @tdoublep in https://github.com/vllm-project/vllm/pull/23936
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault by @rasmith in https://github.com/vllm-project/vllm/pull/23692
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24119
[CI/Build] Disable SiluMul NVFP4 quant fusion tests by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24121
[XPU] Fix the bug of LoRA logits on the XPU platform by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/24081
Update release pipeline post PyTorch 2.8.0 update by @youkaichao in https://github.com/vllm-project/vllm/pull/24073
Upgrade xgrammar to 0.1.23 by @russellb in https://github.com/vllm-project/vllm/pull/22988
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing by @afeldman-nm in https://github.com/vllm-project/vllm/pull/23656
fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24071
[Compile] Fix Compile Warning for w4a8_mm_entry.cu by @yewentao256 in https://github.com/vllm-project/vllm/pull/23660
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24093
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24115
[Misc] Add check for dual_chunk_attention by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24070
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models by @sarckk in https://github.com/vllm-project/vllm/pull/24132
[distributed][rl] remove nccl cumem env var override by @youkaichao in https://github.com/vllm-project/vllm/pull/24141
[Nixl] Heterogeneous TP support FlashInfer by @NickLucche in https://github.com/vllm-project/vllm/pull/20189
[CI/Build] Serve images used by multimodal tests through local HTTP Server by @divyanshsinghvi in https://github.com/vllm-project/vllm/pull/23907
[Misc] Clean up deadcode for legacy processing pipeline by @Isotr0py in https://github.com/vllm-project/vllm/pull/24153
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant by @noooop in https://github.com/vllm-project/vllm/pull/24088
Support add_generation_prompt in embeddings endpoint with chat request by @biba10 in https://github.com/vllm-project/vllm/pull/23931
Fix MiniMax attention module prefix and remove useless code by @qscqesze in https://github.com/vllm-project/vllm/pull/23982
FIX: Add libnuma-dev to Dockerfile for dev stage by @dongbo910220 in https://github.com/vllm-project/vllm/pull/20388
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 by @bringlein in https://github.com/vllm-project/vllm/pull/23424
[V1] v1 engine + full CUDA graph support for PLaMo2 by @nopperl in https://github.com/vllm-project/vllm/pull/23998
[Kernels] Overlap shared experts with send/recv by @bnellnm in https://github.com/vllm-project/vllm/pull/23273
Migrate whisper inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23505
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23289
[Feature][P/D]: Optimize NIXL Connector xfer Launch by @david6666666 in https://github.com/vllm-project/vllm/pull/23887
[Bugfix][DP] DP distribution does not require ray[default] by @kebe7jun in https://github.com/vllm-project/vllm/pull/23822
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking by @NagyGeorge in https://github.com/vllm-project/vllm/pull/23460
Remove deprecated PyNcclConnector by @panpan0000 in https://github.com/vllm-project/vllm/pull/24151
[Feature][Responses API]Support MCP tools with streaming mode + background mode by @wuhang2014 in https://github.com/vllm-project/vllm/pull/23927
[Kernel][Bugfix] Fix grouped topk cu by @mayuyuace in https://github.com/vllm-project/vllm/pull/24146
[Refactor] Introduce basic Renderer for completion-style request by @sfeng33 in https://github.com/vllm-project/vllm/pull/24010
Migrate ultravox inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23503
[CPU] Refactor CPU unquantized linear by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24150
[Misc] Enhance output readability of helper script by @wdhongtw in https://github.com/vllm-project/vllm/pull/24214
[Model] Add MiDashengLM model support by @bingchen-mi in https://github.com/vllm-project/vllm/pull/23652
[Core][Model] Terratorch backend integration by @mgazz in https://github.com/vllm-project/vllm/pull/23513
Improve flexibility of auto_tune.sh execution. by @anthonsu in https://github.com/vllm-project/vllm/pull/23766
[Attention][Platform] Refactor MLA to support Custom Op by @whx-sjtu in https://github.com/vllm-project/vllm/pull/23332
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0 by @faaany in https://github.com/vllm-project/vllm/pull/24159
[Attention] FlashAttn MLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/14258
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon by @ignaciosica in https://github.com/vllm-project/vllm/pull/24200
[Feature][Response API] Add streaming support for non-harmony by @kebe7jun in https://github.com/vllm-project/vllm/pull/23741
[Doc] Update vLLM Singapore Meetup info by @tjtanaa in https://github.com/vllm-project/vllm/pull/24234
[Model] Add pp support for hunyuan by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24212
Use hidden_size_per_head as head_size fallback by @nopperl in https://github.com/vllm-project/vllm/pull/24221
[XPU] support Triton Attention backend on Intel GPU by @jikunshang in https://github.com/vllm-project/vllm/pull/24149
[LoRA]: Add lora support to qwen-2.5-omni by @pratapyash in https://github.com/vllm-project/vllm/pull/24231
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp by @nvjullin in https://github.com/vllm-project/vllm/pull/23725
[Perf] Freeze core engine proc heap after init by @njhill in https://github.com/vllm-project/vllm/pull/24008
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24173
[Misc] Slight improve deepgemm print by @jeejeelee in https://github.com/vllm-project/vllm/pull/24085
Upgrade FlashInfer to v0.3.0 by @nvpohanh in https://github.com/vllm-project/vllm/pull/24086
QWEN3 Coder Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24266
[Misc] Have AsyncLLM custom_stat_loggers extend default logger list by @eicherseiji in https://github.com/vllm-project/vllm/pull/20952
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files by @elvischenv in https://github.com/vllm-project/vllm/pull/23727
[CI/Build] Reduce the number of redundant cases to test for LoRA by @zhuohan123 in https://github.com/vllm-project/vllm/pull/24276
[Frontend] Skip unnecessary detokenization when token_id is requested by @NickLucche in https://github.com/vllm-project/vllm/pull/24236
[gpt-oss] tool parser supports for /chat/completions [1/n] by @aarnphm in https://github.com/vllm-project/vllm/pull/22386
[XPU][P/D] Add XPU support in NixlConnector by @zhenwei-intel in https://github.com/vllm-project/vllm/pull/22436
Adding int4 and int8 models for CPU benchmarking by @louie-tsai in https://github.com/vllm-project/vllm/pull/23709
[docs] add shenzhen meetup by @youkaichao in https://github.com/vllm-project/vllm/pull/24326
[gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24306
[Bugfix] Fix silu_mul+quant fusion test by @elvischenv in https://github.com/vllm-project/vllm/pull/24341
[RFC] allow cancelation after shutdown in blocking collective_rpc by @842974287 in https://github.com/vllm-project/vllm/pull/23390
[CI] Add timeouts to tests by @rafvasq in https://github.com/vllm-project/vllm/pull/24260
[Perf][V1] Fully overlap model execution by @benchislett in https://github.com/vllm-project/vllm/pull/23569
Add @22quinn as code reviewer for RL related components by @22quinn in https://github.com/vllm-project/vllm/pull/24346
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24294
[KV Sharing] Raise error if using eagle with fast prefill by @sarckk in https://github.com/vllm-project/vllm/pull/24350
[Feature] Support Decode Context Parallel (DCP) for MLA by @youzhedian in https://github.com/vllm-project/vllm/pull/23734
[Bugfix] Catch and log invalid token ids in detokenizer by @njhill in https://github.com/vllm-project/vllm/pull/24351
[Core] Allow disabling TP sharding for parallel Linear layer by @Isotr0py in https://github.com/vllm-project/vllm/pull/23024
[New Model]: google/embeddinggemma-300m by @noooop in https://github.com/vllm-project/vllm/pull/24318
refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24345
[Multimodal] Improve max video embedding length estimation in V1 by @ywang96 in https://github.com/vllm-project/vllm/pull/24312
[CI] Disable flaky structured output test from CI by @ywang96 in https://github.com/vllm-project/vllm/pull/24366
Add @benchislett to codeowner for spec decode and structured outputs by @benchislett in https://github.com/vllm-project/vllm/pull/24362
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. by @mohankku in https://github.com/vllm-project/vllm/pull/24335
[Bugfix] Fix broken deepseek fp8 TP weights loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/24367
[Bugfix] Fix test_mixtral_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/24371
Lora bias(enable_lora_bias) deprecate warning by @ashwin-phadke in https://github.com/vllm-project/vllm/pull/24339
[Fix] [gpt-oss] fix non-tool calling path for chat completion by @aarnphm in https://github.com/vllm-project/vllm/pull/24324
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24285
[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test by @elvischenv in https://github.com/vllm-project/vllm/pull/24370
break execute_model in gpu_model_runner into sub-functions for custom scopes by @bangshengtang in https://github.com/vllm-project/vllm/pull/24265
[V0 deprecation] Deprecate V0 Neuron backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21159
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode by @youkaichao in https://github.com/vllm-project/vllm/pull/24372
Migrate Qwen2 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23475
[CI][Fix] deterministic seed for flaky CI runs on structured outputs by @aarnphm in https://github.com/vllm-project/vllm/pull/24380
[Benchmark] add benchmark for custom activation op by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23908
QWEN3 Thinking Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24330
[Misc] collect flashinfer version in collect_env.py by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24378
[Bugfix] Fix Qwen3-coder moe tuned config by @jeejeelee in https://github.com/vllm-project/vllm/pull/24072
[TPU] Remove TopKTopPSampler dependency for TPU sampler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24391
Add renderer-based prompt processing for embedding and classification endpoints by @sfeng33 in https://github.com/vllm-project/vllm/pull/24356
Skip MM Encoder for non-first PP ranks by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24387
Add @luccafong to codeowner for spec decode by @luccafong in https://github.com/vllm-project/vllm/pull/24397
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA by @minosfuture in https://github.com/vllm-project/vllm/pull/24385
[xpu] upgrade ipex/python3.12 for xpu by @yma11 in https://github.com/vllm-project/vllm/pull/23830
[Sampler] Support returning all prompt logprobs by @charlotte12l in https://github.com/vllm-project/vllm/pull/23868
[CI/Build] Disable flaky test_structured_output tests by @22quinn in https://github.com/vllm-project/vllm/pull/24404
[CI/Build] Fix local image inputs in test_pixtral.py by @huachenheli in https://github.com/vllm-project/vllm/pull/24401
[Doc] Fix UTF-8 encoding issues in documentation generation on Windows by @alhridoy in https://github.com/vllm-project/vllm/pull/24361
[P/D] Add a shutdown method to the Connector API by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22699
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess by @what-in-the-nim in https://github.com/vllm-project/vllm/pull/24332
[Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess by @what-in-the-nim in https://github.com/vllm-project/vllm/pull/24334
[gpt-oss][Responses API] Fix the function call id format by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24409
[Docs] Fix a tip indentation and typo by @windsonsea in https://github.com/vllm-project/vllm/pull/24419
[Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24417
[Doc] Fix issues in integrations/llamastack.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24428
[Bugfix] Fix get_quant_config when using modelscope by @Potabk in https://github.com/vllm-project/vllm/pull/24421
[Bugfix] Fix mamba2 prefill chunking by @tomeras91 in https://github.com/vllm-project/vllm/pull/23279
[Misc] Terratorch related fixes by @christian-pinto in https://github.com/vllm-project/vllm/pull/24337
Move KVEventsConfig from config/__init__.py to config/kv_events.py by @hmellor in https://github.com/vllm-project/vllm/pull/24433
[Frontend] User-provided uuids for medias in chat. (RFC #22044) by @huachenheli in https://github.com/vllm-project/vllm/pull/23449
[Docs] Move feature compatibility tables to README by @hmellor in https://github.com/vllm-project/vllm/pull/24431
[Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure by @didier-durand in https://github.com/vllm-project/vllm/pull/24438
[Docs]add eplb_config param use docs by @lengrongfu in https://github.com/vllm-project/vllm/pull/24213
[Model] Enable BNB support for qwen2_5_omni_thinker by @jeejeelee in https://github.com/vllm-project/vllm/pull/24420
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/23563
[Spec Decode][Benchmark] Add Blitzedit dataset by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/23605
[Model] Remove quantized mixtral by @jeejeelee in https://github.com/vllm-project/vllm/pull/24437
[CI] Enable encoder model compilation test by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24442
[Model loader]: support multi-thread model weight loading by @BraveY in https://github.com/vllm-project/vllm/pull/23928
[Spec Decode] Fix offline spec_decode.py by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24257
[Attention] FlashAttention MLA cudagraph support by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23958
[Bugfix] Disable the statslogger if the api_server_count is greater than 1 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22227
[Hardware][IBM Z] Fix Outlines Core issue for s390x by @R3hankhan123 in https://github.com/vllm-project/vllm/pull/24034
[CI] Add nightly multiarch manifests to dockerhub by @csahithi in https://github.com/vllm-project/vllm/pull/24102
Update reviewers for modelopt related files by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/24468
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24134
[gpt-oss] Harmony changes with container tool support by @morgendave in https://github.com/vllm-project/vllm/pull/23386
Bump actions/setup-python from 5.4.0 to 6.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24414
[doc] update vllm serve cli args documentation by @cjackal in https://github.com/vllm-project/vllm/pull/24329
Bump actions/stale from 9.1.0 to 10.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24412
Bump actions/github-script from 7.0.1 to 8.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24413
Move KVTransferConfig from config/__init__.py to config/kv_transfer.py by @hmellor in https://github.com/vllm-project/vllm/pull/24434
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs by @CSWYF3634076 in https://github.com/vllm-project/vllm/pull/24074
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/23647
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead by @linzebing in https://github.com/vllm-project/vllm/pull/23673
Add data_parallel_size to VllmConfig string representation by @Prowindy in https://github.com/vllm-project/vllm/pull/24298
[Bugfix] Fix Apertus HF repo name by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24447
[Misc] Improve Worker process title and logging prefix by @22quinn in https://github.com/vllm-project/vllm/pull/22205
[Doc] mention fpdb for multiprocess breakpoints by @mickaelseznec in https://github.com/vllm-project/vllm/pull/24452
[Misc] Support bench serve long context by @minosfuture in https://github.com/vllm-project/vllm/pull/24373
[Doc]: fixing typos to improve docs by @didier-durand in https://github.com/vllm-project/vllm/pull/24480
[Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer by @david6666666 in https://github.com/vllm-project/vllm/pull/24443
[Misc] Add claude settings to gitignore by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24492
[Misc] Add Codex settings to gitignore by @ywang96 in https://github.com/vllm-project/vllm/pull/24493
[gpt-oss] Validate gpt-oss python tool during initialization by @heheda12345 in https://github.com/vllm-project/vllm/pull/23856
[RL] fast weight update with zmq + ipc handles by @weixiao-huang in https://github.com/vllm-project/vllm/pull/24295
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24411
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT by @yewentao256 in https://github.com/vllm-project/vllm/pull/24123
[Model] Systematic support for fp32 head, pooling models part by @noooop in https://github.com/vllm-project/vllm/pull/23810
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both stop str and eos token by @dtransposed in https://github.com/vllm-project/vllm/pull/23938
[Core] Run garbage collector after CUDA graph capture to fix throughput regression by @micah-wil in https://github.com/vllm-project/vllm/pull/24128
[Kernels] Add Flash Linear Attention Kernels by @youkaichao in https://github.com/vllm-project/vllm/pull/24518
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork by @gshtras in https://github.com/vllm-project/vllm/pull/24279
[Bugfix] Fix hidden_size for multimodal classification model by @jeejeelee in https://github.com/vllm-project/vllm/pull/24501
Extend renderer with embedding support and integrate completion endpoint by @sfeng33 in https://github.com/vllm-project/vllm/pull/24405
[Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 by @serihiro in https://github.com/vllm-project/vllm/pull/24368
[Docs] Gemma3n transcriptions endpoint support by @NickLucche in https://github.com/vllm-project/vllm/pull/24512
[TPU] Fix tpu structured decoding in mixed batches by @Chenyaaang in https://github.com/vllm-project/vllm/pull/24458
[CI] execute all piecewise compilation tests together by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24502
[Feature] Disallow FlashMLA on Blackwell by @yewentao256 in https://github.com/vllm-project/vllm/pull/24521
[Log] Use a relative path in debug-level logs to distinguish files with identical names by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23846
[Benchmark] Update bench doc with mtbench, blazedit, spec bench by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24450
[Benchmark] Add option to skip oversampling in benchmark by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24457
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm by @charlifu in https://github.com/vllm-project/vllm/pull/24275
[Bugfix] Improve EPLB config validation error message by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24524
[Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. by @bnellnm in https://github.com/vllm-project/vllm/pull/24538
[Perf] Convert np array to torch tensor to index into block table for attn chunking by @sarckk in https://github.com/vllm-project/vllm/pull/24474
Add @heheda12345 to CODEOWNERS of KVCacheManager related code by @heheda12345 in https://github.com/vllm-project/vllm/pull/24546
[CI] Retry flaky fp8 cutlass mla tests by @njhill in https://github.com/vllm-project/vllm/pull/24536
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) by @ignaciosica in https://github.com/vllm-project/vllm/pull/24129
[BugFix] Fix async core engine client finalizer by @njhill in https://github.com/vllm-project/vllm/pull/24540
[CI] Adjust threshold for flaky ngram spec decoding test by @njhill in https://github.com/vllm-project/vllm/pull/24528
[KV Connector] More async support for get_num_new_matched_tokens by @ApostaC in https://github.com/vllm-project/vllm/pull/23620
[P/D] MultiConnector supports shutdown by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24425
[BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 by @wwl2755 in https://github.com/vllm-project/vllm/pull/24392
[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading by @frank-wei in https://github.com/vllm-project/vllm/pull/24154
[Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. by @huachenheli in https://github.com/vllm-project/vllm/pull/24271
[Bugfix] Update Run:AI Model Streamer Loading Integration by @pwschuurman in https://github.com/vllm-project/vllm/pull/23845
[Docs] Enable relative links in examples to function when rendered in the docs by @hmellor in https://github.com/vllm-project/vllm/pull/24041
[docs] promo pytorch conf and ray summit by @simon-mo in https://github.com/vllm-project/vllm/pull/24562
[Bugfix] Guard _may_reorder_batch for encoder-only models on CPU (#24319) by @comsky in https://github.com/vllm-project/vllm/pull/24348
Consolidate rendering parameters into RenderConfig dataclass by @sfeng33 in https://github.com/vllm-project/vllm/pull/24543
[Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. by @li-jinpeng in https://github.com/vllm-project/vllm/pull/24519
[Attention] add DCP support for FLASH_ATTN_MLA backend by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24453
[ROCm][Bugfix] Fix Aiter RMSNorm by @vllmellm in https://github.com/vllm-project/vllm/pull/23412
[Docs] Improve organisation of API Reference nav by @hmellor in https://github.com/vllm-project/vllm/pull/24569
[Docs] Document the extra memory footprint overhead when using EPLB by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24537
Support for NemotronH Nano VLM by @danielafrimi in https://github.com/vllm-project/vllm/pull/23644
Feature/vit attention unification# 23880 by @baonudesifeizhai in https://github.com/vllm-project/vllm/pull/23978
[LoRA]: Add LoRA support to Mistral's Voxtral models by @pratapyash in https://github.com/vllm-project/vllm/pull/24517
Move LoadConfig from config/__init__.py to config/load.py by @hmellor in https://github.com/vllm-project/vllm/pull/24566
[BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo by @wwl2755 in https://github.com/vllm-project/vllm/pull/24559
[BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat by @lacora in https://github.com/vllm-project/vllm/pull/24549
[BugFix] Ensure integrity of reused CPU tensors during async scheduling by @njhill in https://github.com/vllm-project/vllm/pull/24527
[CI/Build] split true unit tests to Entrypoints Unit Tests by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24418
[rocm] enable torchao quantization for rocm by @draftbk in https://github.com/vllm-project/vllm/pull/24400
[CI] Add PPL test for generation models by @noooop in https://github.com/vllm-project/vllm/pull/24485
[CI/Build] bump timm dependency by @dtrifiro in https://github.com/vllm-project/vllm/pull/24189
fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24167
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs by @RoadToNowhereX in https://github.com/vllm-project/vllm/pull/24217
[Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24092
[BugFix] python collect_env.py and vllm collect-env compatibility with uv venv by @yankay in https://github.com/vllm-project/vllm/pull/24066
[Platform] Custom ops support for LMhead and LogitsProcessor by @zzhx1 in https://github.com/vllm-project/vllm/pull/23564
[CI] Fix tensorizer test assertion by @pwschuurman in https://github.com/vllm-project/vllm/pull/24545
[Core] Split LoRA layers by @jeejeelee in https://github.com/vllm-project/vllm/pull/24574
[Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser by @WangErXiao in https://github.com/vllm-project/vllm/pull/24589
[Logging] allow config logging stream by @842974287 in https://github.com/vllm-project/vllm/pull/24336
[Bugfix] fix modelopt exclude_modules name mapping by @tomeras91 in https://github.com/vllm-project/vllm/pull/24178
[Bugfix] Fix DeepEP config for DP4TP4 by @minosfuture in https://github.com/vllm-project/vllm/pull/23619
[Core] Support configuration parsing plugin by @charlotte12l in https://github.com/vllm-project/vllm/pull/24277
[Misc] update log level debug to warning when process port is used by by @lengrongfu in https://github.com/vllm-project/vllm/pull/24226
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs by @gau-nernst in https://github.com/vllm-project/vllm/pull/24577
[CI] Fail subprocess tests with root-cause error by @njhill in https://github.com/vllm-project/vllm/pull/23795
[v1] Add Whisper model support (encoder-decoder) by @russellb in https://github.com/vllm-project/vllm/pull/21088
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends by @gshtras in https://github.com/vllm-project/vllm/pull/19767
[gpt-oss] raise error for flashinfer backend without trtllm by @heheda12345 in https://github.com/vllm-project/vllm/pull/24482
[Perf] Warmup FlashInfer attention during startup by @mgoin in https://github.com/vllm-project/vllm/pull/23439
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration by @hjjq in https://github.com/vllm-project/vllm/pull/21078
[Misc] Make timeout passable in init_distributed_environment by @jberkhahn in https://github.com/vllm-project/vllm/pull/24522
[Models][Quantization] Add quantization configuration update in Voxtral model by @anmarques in https://github.com/vllm-project/vllm/pull/24122
[distributed] update known issues by @youkaichao in https://github.com/vllm-project/vllm/pull/24624
Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24406
[Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers by @wwl2755 in https://github.com/vllm-project/vllm/pull/24613
[Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides by @petersalas in https://github.com/vllm-project/vllm/pull/24131
[Bugfix] Add missing VIT backend dispatch on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24623
[BugFix] Fix pipeline parallel by @njhill in https://github.com/vllm-project/vllm/pull/24621
[Engine][Chore] use local variable and remove output var assignment by @GuyStone in https://github.com/vllm-project/vllm/pull/24554
Kimi K2 Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24597
Enable --profile in 'vllm bench throughput' by @tomasruizt in https://github.com/vllm-project/vllm/pull/24575
[Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre by @shengshiqi-google in https://github.com/vllm-project/vllm/pull/24469
[Doc]: fixing doc typos by @didier-durand in https://github.com/vllm-project/vllm/pull/24635
[Model] New model support for Motif-1-Tiny by @ca1207 in https://github.com/vllm-project/vllm/pull/23414
Remove redundant all gather + split by @chenxi-yang in https://github.com/vllm-project/vllm/pull/23441
[torchao] Support quantization configs using module swap by @jerryzh168 in https://github.com/vllm-project/vllm/pull/21982
Add the support for the qwen3 next model (a hybrid attention model). by @sighingnow in https://github.com/vllm-project/vllm/pull/24526
[Bugfix] Fix incorrect import of CacheConfig by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24631
[Docs] Revise frameworks/anything-llm.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24489
[Docs] Update V1 doc to reflect whisper support by @russellb in https://github.com/vllm-project/vllm/pull/24606
[Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ by @windsonsea in https://github.com/vllm-project/vllm/pull/24633
[CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test by @charlotte12l in https://github.com/vllm-project/vllm/pull/24615
[Bugfix] Fix _synced_weight_loader by @kyuyeunk in https://github.com/vllm-project/vllm/pull/24565
[CI] Split pooling from entrypoints Test by @noooop in https://github.com/vllm-project/vllm/pull/24632
[Misc] Add @NickLucche to codeowners by @NickLucche in https://github.com/vllm-project/vllm/pull/24647
[CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py by @noooop in https://github.com/vllm-project/vllm/pull/24636
Fix typing for safetensors_load_strategy by @hmellor in https://github.com/vllm-project/vllm/pull/24641
Move LoRAConfig from config/__init__.py to config/lora.py by @hmellor in https://github.com/vllm-project/vllm/pull/24644
[XPU] add missing dependency tblib for XPU CI by @faaany in https://github.com/vllm-project/vllm/pull/24639
[Docs] Fixes a typo in the qwen3next model name. by @sighingnow in https://github.com/vllm-project/vllm/pull/24654
[build] add torch to tool.uv no-build-isolation-package by @youkaichao in https://github.com/vllm-project/vllm/pull/24303
[Bench] Add qwen-next in benchmark_moe.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/24661
[CI] Split mteb test from Language Models Test by @noooop in https://github.com/vllm-project/vllm/pull/24634
Allow users to specify kv cache memory size by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/21489
[HybridKVCache][Platform] Add support_hybrid_kv_cache for platform by @MengqingCao in https://github.com/vllm-project/vllm/pull/24646
[Bugifx] Fix qwen-next packed_modules_mapping by @jeejeelee in https://github.com/vllm-project/vllm/pull/24656
[Docs] Add transcription support to model by @NickLucche in https://github.com/vllm-project/vllm/pull/24664
[Doc] Fix Markdown Pre-commit Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24670
[Docs] Fix typos in EP deployment doc by @hmellor in https://github.com/vllm-project/vllm/pull/24669
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames by @Isotr0py in https://github.com/vllm-project/vllm/pull/24161
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default by @ilmarkov in https://github.com/vllm-project/vllm/pull/24111
[Bugfix] Fix platform-specific routing in CustomOp implementations by @kzawora-intel in https://github.com/vllm-project/vllm/pull/24444
Fix model name included in responses by @hmellor in https://github.com/vllm-project/vllm/pull/24663
fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24616
[Docs] Fix formatting of transcription doc by @hmellor in https://github.com/vllm-project/vllm/pull/24676
[VLM] Migrate remain DP-supported ViT models to use disable_tp by @Isotr0py in https://github.com/vllm-project/vllm/pull/24363
[Ultravox] Use wrapped_model_config to instantiate inner model by @petersalas in https://github.com/vllm-project/vllm/pull/24679
[Doc] Remove Useless Comments by @yewentao256 in https://github.com/vllm-project/vllm/pull/24687
[Qwen3-Next] Add MoE Config for H200 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24688
[BugFix] Fix tokenize asyncio task leak by @njhill in https://github.com/vllm-project/vllm/pull/24677
update spec decode metrics to use throughput by @qandrew in https://github.com/vllm-project/vllm/pull/24127
[Kernel][B200] mxfp4 fused cutlass moe by @djmmoss in https://github.com/vllm-project/vllm/pull/23696
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention by @mxz297 in https://github.com/vllm-project/vllm/pull/24197
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False by @yewentao256 in https://github.com/vllm-project/vllm/pull/24696
[Qwen3-Next] MoE configs for H200 TP=1,2,4 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24695
[CI/Build] Add bc-linter to vLLM CI by @zhewenl in https://github.com/vllm-project/vllm/pull/21234
[Qwen3-Next] Add B200 MoE configs for Qwen3-next by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/24698
[Bugfix][Attention] Fix FlashInfer MLA block size logic by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24692
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel by @mgoin in https://github.com/vllm-project/vllm/pull/23280
[Qwen3-Next] MOE configs for H100 TP4 by @heheda12345 in https://github.com/vllm-project/vllm/pull/24699
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler by @Zazzle516 in https://github.com/vllm-project/vllm/pull/18698
[Bug] Fix Layer weight_block_size Assertion Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24674
[Startup] Make DeepGEMM warmup scale with max-num-batched-tokens by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24693
[V1] feat:add engine v1 tracing by @RichardoMrMu in https://github.com/vllm-project/vllm/pull/20372
[Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases by @sighingnow in https://github.com/vllm-project/vllm/pull/24680

New Contributors

@DoubleVII made their first contribution in #23058
@carlory made their first contribution in #23090
@nikheal2 made their first contribution in #22725
@Tialo made their first contribution in #23172
@myselvess made their first contribution in #23084
@yiz-liu made their first contribution in #23169
@ultmaster made their first contribution in #22587
@KilJaeeun made their first contribution in #22790
@wzshiming made their first contribution in #23242
@misrasaurabh1 made their first contribution in #20413
@yannqi made their first contribution in #23246
@jaredoconnell made their first contribution in #23306
@paulpak58 made their first contribution in #22845
@zhuangqh made their first contribution in #23309
@tvalentyn made their first contribution in #23270
@arjunbreddy22 made their first contribution in #22495
@philipchung made their first contribution in #17149
@FoolPlayer made their first contribution in #23241
@namanlalitnyu made their first contribution in #23375
@hickeyma made their first contribution in #23353
@PapaGoose made their first contribution in #23337
@bppps made their first contribution in #23366
@AzizCode92 made their first contribution in #23416
@fengli1702 made their first contribution in #22527
@FFFfff1FFFfff made their first contribution in #23408
@ayushsatyam146 made their first contribution in #23171
@patemotter made their first contribution in #23574
@Terrencezzj made their first contribution in #23584
@Copilot made their first contribution in #23385
@oneraghavan made their first contribution in #23630
@lordmathis made their first contribution in #23634
@OYE93 made their first contribution in #23565
@TianyuLi0 made their first contribution in #23146
@yuekaizhang made their first contribution in #23623
@coval3nte made their first contribution in #23054
@youzhedian made their first contribution in #23648
@frank-wei made their first contribution in #23613
@faaany made their first contribution in #22500
@cndoit18 made their first contribution in #23718
@rebel-hongseok made their first contribution in #23746
@Hanchenli made their first contribution in #23713
@Shrey1306 made their first contribution in #23556
@Ithanil made their first contribution in #23155
@killershrimp made their first contribution in #23792
@crischeng made their first contribution in #23823
@angelayi made their first contribution in #23349
@jeanschmidt made their first contribution in #23757
@He-Jingkai made their first contribution in https://github.com/vllm-project/vllm/pull/23829
@aditchawdhary made their first contribution in https://github.com/vllm-project/vllm/pull/23899
@EduardDurech made their first contribution in https://github.com/vllm-project/vllm/pull/23068
@dubejf made their first contribution in https://github.com/vllm-project/vllm/pull/23944
@sadeghja1070 made their first contribution in https://github.com/vllm-project/vllm/pull/23971
@DevonPeroutky made their first contribution in https://github.com/vllm-project/vllm/pull/22769
@hypdeb made their first contribution in https://github.com/vllm-project/vllm/pull/23802
@DamonJiang777 made their first contribution in https://github.com/vllm-project/vllm/pull/24028
@cberge908 made their first contribution in https://github.com/vllm-project/vllm/pull/23104
@lkm2835 made their first contribution in https://github.com/vllm-project/vllm/pull/23918
@nathanrchn made their first contribution in https://github.com/vllm-project/vllm/pull/24100
@co63oc made their first contribution in https://github.com/vllm-project/vllm/pull/24071
@divyanshsinghvi made their first contribution in https://github.com/vllm-project/vllm/pull/23907
@biba10 made their first contribution in https://github.com/vllm-project/vllm/pull/23931
@dongbo910220 made their first contribution in https://github.com/vllm-project/vllm/pull/20388
@NagyGeorge made their first contribution in https://github.com/vllm-project/vllm/pull/23460
@wdhongtw made their first contribution in https://github.com/vllm-project/vllm/pull/24214
@bingchen-mi made their first contribution in https://github.com/vllm-project/vllm/pull/23652
@anthonsu made their first contribution in https://github.com/vllm-project/vllm/pull/23766
@whx-sjtu made their first contribution in https://github.com/vllm-project/vllm/pull/23332
@pratapyash made their first contribution in https://github.com/vllm-project/vllm/pull/24231
@samanamp made their first contribution in https://github.com/vllm-project/vllm/pull/24266
@mohankku made their first contribution in https://github.com/vllm-project/vllm/pull/24335
@ashwin-phadke made their first contribution in https://github.com/vllm-project/vllm/pull/24339
@bangshengtang made their first contribution in https://github.com/vllm-project/vllm/pull/24265
@charlotte12l made their first contribution in https://github.com/vllm-project/vllm/pull/23868
@alhridoy made their first contribution in https://github.com/vllm-project/vllm/pull/24361
@what-in-the-nim made their first contribution in https://github.com/vllm-project/vllm/pull/24332
@BraveY made their first contribution in https://github.com/vllm-project/vllm/pull/23928
@R3hankhan123 made their first contribution in https://github.com/vllm-project/vllm/pull/24034
@csahithi made their first contribution in https://github.com/vllm-project/vllm/pull/24102
@Prowindy made their first contribution in https://github.com/vllm-project/vllm/pull/24298
@micah-wil made their first contribution in https://github.com/vllm-project/vllm/pull/24128
@pwschuurman made their first contribution in https://github.com/vllm-project/vllm/pull/23845
@comsky made their first contribution in https://github.com/vllm-project/vllm/pull/24348
@li-jinpeng made their first contribution in https://github.com/vllm-project/vllm/pull/24519
@baonudesifeizhai made their first contribution in https://github.com/vllm-project/vllm/pull/23978
@lacora made their first contribution in https://github.com/vllm-project/vllm/pull/24549
@RoadToNowhereX made their first contribution in https://github.com/vllm-project/vllm/pull/24217
@zzhx1 made their first contribution in https://github.com/vllm-project/vllm/pull/23564
@hjjq made their first contribution in https://github.com/vllm-project/vllm/pull/21078
@tomasruizt made their first contribution in https://github.com/vllm-project/vllm/pull/24575
@shengshiqi-google made their first contribution in https://github.com/vllm-project/vllm/pull/24469
@ca1207 made their first contribution in https://github.com/vllm-project/vllm/pull/23414
@qandrew made their first contribution in https://github.com/vllm-project/vllm/pull/24127
@Zazzle516 made their first contribution in https://github.com/vllm-project/vllm/pull/18698
@RichardoMrMu made their first contribution in https://github.com/vllm-project/vllm/pull/20372

Full Changelog: v0.10.1.1...v0.10.2rc3

vllm-project/vllm v0.10.2rc3 v0.10.2 on GitHub