Highlights
This release contains 740 commits from 266 contributors (97 new)!
Breaking Changes: This release includes PyTorch 2.8.0 upgrade, V0 deprecations, and API changes - please review the changelog carefully.
Model Support
- New model families and enhancements: Apertus (#23068), LFM2 (#22845), MiDashengLM (#23652), Motif-1-Tiny (#23414), Seed-Oss (#23241), Google EmbeddingGemma-300m (#24318), GTE sequence classification (#23524), Donut OCR model (#23229), KeyeVL-1.5-8B (#23838), R-4B vision model (#23246), Ernie4.5 VL (#22514), MiniCPM-V 4.5 (#23586), Ovis2.5 (#23084), Qwen3-Next with hybrid attention (#24526), InternVL3.5 with video support (#23658), Qwen2Audio embeddings (#23625), NemotronH Nano VLM (#23644), BLOOM V1 engine support (#23488), and Whisper encoder-decoder for V1 (#21088).
- Pipeline parallelism expansion: Added PP support for Hunyuan (#24212), Ovis2.5 (#23405), GPT-OSS (#23680), and Kimi-VL-A3B-Thinking-2506 (#23114).
- Data parallelism for vision models: Enabled DP for ViT across Qwen2.5VL (#22742), MiniCPM-V (#23948, #23327), Kimi-VL (#23817), and GLM-4.5V (#23168).
- LoRA ecosystem expansion: Added LoRA support to Voxtral (#24517), Qwen-2.5-Omni (#24231), and DeepSeek models V2/V3/R1-0528 (#23971), with significantly faster LoRA startup performance (#23777).
- Classification and pooling enhancements: Multi-label classification support (#23173), logit bias and sigmoid normalization (#24031), and FP32 precision heads for pooling models (#23810).
- Performance optimizations: Removed unnecessary CUDA sync from GLM-4.1V (#24332) and Qwen2VL (#24334) preprocessing, eliminated redundant all-reduce in Qwen3 MoE (#23169), optimized InternVL CPU threading (#24519), and GLM4.5-V video frame decoding (#24161).
Engine Core
- V1 engine maturation: Extended V1 support to compute capability < 8.0 (#23614, #24022), added cross-attention KV cache for encoder-decoder models (#23664), request-level logits processor integration (#23656), and KV events from connectors (#19737).
- Backend expansion: Terratorch backend integration (#23513) enabling non-language model tasks like semantic segmentation and geospatial applications with
--model-impl terratorch
support. - Hybrid and Mamba model improvements: Enabled full CUDA graphs by default for hybrid models (#22594), disabled prefix caching for hybrid/Mamba models (#23716), added FP32 SSM kernel support (#23506), full CUDA graph support for Mamba1 (#23035), and V1 as default for Mamba models (#23650).
- Performance core improvements:
--safetensors-load-strategy
for NFS based file loading acceleration (#24469), critical CUDA graph capture throughput fix (#24128), scheduler optimization for single completions (#21917), multi-threaded model weight loading (#23928), and tensor core usage enforcement for FlashInfer decode (#23214). - Multimodal enhancements: Multimodal cache tracking with mm_hash (#22711), UUID-based multimodal identifiers (#23394), improved V1 video embedding estimation (#24312), and simplified multimodal UUID handling (#24271).
- Sampling and structured outputs: Support for all prompt logprobs (#23868), final logprobs (#22387), grammar bitmask optimization (#23361), and user-configurable KV cache memory size (#21489).
Hardware & Performance
- NVIDIA Blackwell/SM100 generation: FP8 MLA support with CUTLASS backend (#23289), DeepGEMM Linear with 1.5% E2E throughput improvement (#23351), Hopper DeepGEMM E8M0 for DeepSeekV3.1 (#23666), SM100 FlashInfer CUTLASS MoE FP8 backend (#22357), MXFP4 fused CUTLASS MoE (#23696), default MXFP4 MoE on Blackwell (#23008), and GPT-OSS DP/EP support with 52,003 tokens/s throughput (#23608).
- Breaking change: FlashMLA disabled on Blackwell GPUs due to compatibility issues (#24521).
- Kernel and attention optimizations: FlashAttention MLA with CUDA graph support (#14258, #23958), V1 cross-attention support (#23297), FP8 support for FlashMLA (#22668), fused grouped TopK for MoE (#23274), Flash Linear Attention kernels (#24518), and W4A8 support on Hopper (#23198).
- Performance improvements: 13.7x speedup for token conversion (#20413), TTIT/TTFT improvements for disaggregated serving (#22760), symmetric memory all-reduce by default (#24111), FlashInfer warmup during startup (#23439), V1 model execution overlap (#23569), and various Triton configuration tuning (#23748, #23939).
- Platform expansion: Apple Silicon bfloat16 support for M2+ (#24129), IBM Z V1 engine support (#22725), Intel XPU torch.compile (#22609), XPU MoE data parallelism (#22887), XPU Triton attention (#24149), XPU FP8 quantization (#23148), and ROCm pipeline parallelism with Ray (#24275).
- Model-specific optimizations: Hardware-tuned MoE configurations for Qwen3-Next on B200/H200/H100 (#24698, #24688, #24699, #24695), GLM-4.5-Air-FP8 B200 configs (#23695), Kimi K2 optimization (#24597), and QWEN3 Coder/Thinking configs (#24266, #24330).
Quantization
- New quantization capabilities: Per-layer quantization routing (#23556), GGUF quantization with layer skipping (#23188), NFP4+FP8 MoE support (#22674), W4A8 channel scales (#23570), and AMD CDNA2/CDNA3 FP4 support (#22527).
- Advanced quantization infrastructure: Compressed tensors transforms for linear operations (#22486) enabling techniques like SpinQuantR1R2R4 and QuIP quantization methods.
- FlashInfer quantization integration: FP8 KV cache for TRTLLM prefill attention (#24197), FP8-qkv attention kernels (#23647), and FP8 per-tensor GEMMs (#22895).
- Platform-specific quantization: ROCm TorchAO quantization enablement (#24400) and TorchAO module swap configuration (#21982).
- Performance optimizations: MXFP4 MoE loading cache optimization (#24154) and compressed tensors version updates (#23202).
- Breaking change: Removed original Marlin quantization format (#23204).
API & Frontend
- OpenAI API enhancements: Gemma3n audio transcription/translation endpoints (#23735), transcription response usage statistics (#23576), and return_token_ids parameter (#22587).
- Response API improvements: Streaming support for non-harmony responses (#23741), non-streaming logprobs (#23319), MCP tool background mode (#23494), MCP streaming+background support (#23927), and tool output token reporting (#24285).
- Frontend optimizations: Error stack traces with --log-error-stack (#22960), collective RPC endpoint (#23075), beam search concurrency optimization (#23599), unnecessary detokenization skipping (#24236), and custom media UUIDs (#23449).
- Configuration enhancements: Formalized --mm-encoder-tp-mode flag (#23190), VLLM_DISABLE_PAD_FOR_CUDAGRAPH environment variable (#23595), EPLB configuration parameter (#20562), embedding endpoint chat request support (#23931), and LM Format Enforcer V1 integration (#22564).
Dependencies
- Major updates: PyTorch 2.8.0 upgrade (#20358) - breaking change requiring environment updates, FlashInfer v0.3.0 upgrade (#24086), and FlashInfer 0.2.14.post1 maintenance update (#23537).
- Supporting updates: XGrammar 0.1.23 (#22988), TPU core dump fix with tpu_info 0.4.0 (#23135), and compressed tensors version bump (#23202).
- Deployment improvements: FlashInfer cubin directory environment variable (#22675) for offline environments and pre-cached CUDA binaries.
V0 Deprecation
- Backend removals: V0 Neuron backend deprecation (#21159), V0 pooling model support removal (#23434), V0 FlashInfer attention backend removal (#22776), and V0 test cleanup (#23418, #23862).
- API breaking changes: prompt_token_ids fallback removal from LLM.generate and LLM.embed (#18800), LoRA extra vocab size deprecation warning (#23635), LoRA bias parameter deprecation (#24339), and metrics naming change from TPOT to ITL (#24110).
Breaking Changes
- PyTorch 2.8.0 upgrade - Environment dependency change requiring updated CUDA versions
- FlashMLA Blackwell restriction - FlashMLA disabled on Blackwell GPUs due to compatibility issues
- V0 feature removals - Neuron backend, pooling models, FlashInfer attention backend
- Quantizations - Removed quantized Mixtral hack implementation, and original Marlin format.
- Metrics renaming - TPOT deprecated in favor of ITL
What's Changed
- [Misc] Minor code cleanup for _get_prompt_logprobs_dict by @WoosukKwon in #23064
- [Misc] enhance static type hint by @andyxning in #23059
- [Bugfix] fix Qwen2.5-Omni processor output mapping by @DoubleVII in #23058
- [Bugfix][CI] Machete kernels: deterministic ordering for more cache hits by @andylolu2 in #23055
- [Misc] refactor function name by @andyxning in #23029
- [Misc] Fix backward compatibility from #23030 by @ywang96 in #23070
- [XPU] Fix compile size for xpu by @jikunshang in #23069
- [XPU][CI]add xpu env vars in CI scripts by @jikunshang in #22946
- [Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs by @DarkLight1337 in #23053
- [Bugfix] fix IntermediateTensors equal method by @andyxning in #23027
- [Refactor] Get prompt updates earlier by @DarkLight1337 in #23097
- chore: remove unnecessary patch_padding_side for the chatglm model by @carlory in #23090
- [Bugfix] Support compile for Transformers multimodal by @zucchini-nlp in #23095
- [CI Bugfix] Pin
openai<1.100
to unblock CI by @mgoin in #23118 - fix: OpenAI SDK compat (ResponseTextConfig) by @h-brenoskuk in #23126
- Use Blackwell FlashInfer MXFP4 MoE by default if available by @mgoin in #23008
- Install tpu_info==0.4.0 to fix core dump for TPU by @xiangxu-google in #23135
- [Misc] Minor refactoring for prepare_inputs by @WoosukKwon in #23116
- [Spec Decode] Make
propose_draft_token_ids
non-blocking for lower TTFT by @WoosukKwon in #23041 - [Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code by @tdoublep in #23122
- [CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests by @robertgshaw2-redhat in #22871
- [V0 Deprecation] Remove V0 FlashInfer attention backend by @WoosukKwon in #22776
- chore: disable enable_cpp_symbolic_shape_guards by @xiszishu in #23048
- [TPU] make ptxla not imported when using tpu_commons by @yaochengji in #23081
- [Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes by @nikheal2 in #22725
- Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema by @bbeckca in #22023
- [Log] Warning Once for Cutlass MLA by @yewentao256 in #23137
- [Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 by @ZJY0516 in #23114
- [misc] split engine_model into json file for nsys profile tool by @gracehonv in #23117
- [Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn by @pliops-daniels in #22889
- Fix GLM-4.5V-FP8 numerical issue by @zixi-qi in #22949
- [Misc] Add request_id into benchmark_serve.py by @hustxiayang in #23065
- [Bugfix] Fix broken Minimax-01-VL model by @Isotr0py in #22116
- [bug fix] Fix llama4 spec decoding by @zixi-qi in #22691
- [Misc] Avoid accessing req_ids inside a loop by @WoosukKwon in #23159
- [Doc] use power of 2 by @Tialo in #23172
- [Misc] Fix seq_lens for graph capture by @WoosukKwon in #23175
- [NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel by @elvischenv in #21716
- [Model] Add transformers problem_type (e.g. multi_label_classification) support by @noooop in #23173
- [Model] support new model ovis2.5 by @myselvess in #23084
- [Bugfix] Fix benchmark_moe.py by @jeejeelee in #23177
- [FEAT] [Performance] Enable DP for ViT in Qwen2.5VL by @tjtanaa in #22742
- [Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock by @yiz-liu in #23169
- Add return_token_ids parameter to OpenAI API endpoints by @ultmaster in #22587
- Migrate LlavaOnevisionMultiInputs to TensorSchema by @bbeckca in #21844
- [CI/Build] Update transformers to v4.55.2 by @Isotr0py in #23093
- [Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks by @tanruixiang in #22654
- [Frontend] Add
/collective_rpc
API endpoint by @22quinn in #23075 - [Misc] Enable yapf for FlashInfer backend by @WoosukKwon in #23193
- [Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. by @bnellnm in #23125
- fix: use cache_salt for gpt-oss by @dr75 in #23186
- [Misc] Minor refactoring for FlashInfer backend by @WoosukKwon in #23147
- [CI/Build] Add support for Python 3.13 by @mgoin in #13164
- [NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend by @amirkl94 in #22357
- [CI/Build] Replace lm-eval gsm8k tests with faster implementation by @mgoin in #23002
- [BugFix] fix CUTLASS MLA full cudagraph by @LucasWilkinson in #23200
- [Benchmarks] Add video inputs to ShareGPTDataset. by @huachenheli in #23199
- [Quantization] Bump Compressed Tensors Version by @kylesayrs in #23202
- [Core] Optimize scheduler request removal for single completions by @chi2liu in #21917
- [CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py by @mgoin in #23132
- [Core] Add torch profiler CPU traces for AsyncLLM. by @huachenheli in #21794
- [Doc] Update V1 status of various pooling models by @DarkLight1337 in #23189
- [Attention] Optimize make_local_attention_virtual_batches for Flash Attention by @linzebing in #23185
- Fix a performance comparison issue in Benchmark Suite by @louie-tsai in #23047
- chore: support pytorch format in lora by @KilJaeeun in #22790
- [CI/Build] Also check DP in benchmarks throughput script by @zhewenl in #23038
- [CI/Build] Sync multimodal tests by @DarkLight1337 in #23181
- [BugFix] Fix stuck stats/metrics after requests are aborted by @njhill in #22995
- fix cuda graph by @fsx950223 in #22721
- [Model] use autoWeightsLoader for gptoss by @calvin0327 in #22446
- Fix missing quotes by @wzshiming in #23242
- [Model] Support deepseek with eagle by @xyang16 in #21086
- [Bugfix] Ensure correctness of Cohere2Vision processing by @DarkLight1337 in #23245
- Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image by @mgoin in #23129
- [Model][V1] Support Ernie MTP by @xyxinyang in #22169
- [Model] Improve olmo and olmo2 by @jeejeelee in #23228
- [Fix] fix offline env use local mode path by @lengrongfu in #22526
- [Bugfix] Ensure correctness of HCXVision processing by @DarkLight1337 in #23254
- [Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute by @shixianc in #23045
- [CLI][Doc] Formalize
--mm-encoder-tp-mode
by @DarkLight1337 in #23190 - [Misc] Add max_seq_len to CommonAttentionMetadata by @WoosukKwon in #23216
- [FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER by @JartX in #22795
- Support conditional torch.compile per module by @sarckk in #22269
- Migrate Mistral3ImagePixelInputs to TensorSchema by @bbeckca in #21945
- Limit HTTP header count and size by @russellb in #23267
- Small fix for Command-A-Vision by @dongluw in #23268
- [Kernel/Quant] Remove the original marlin format and qqq by @mgoin in #23204
- [Fix] correct tool_id for kimi-k2 when use tool_choice=required by @MoyanZitto in #21259
- [Frontend] improve error logging of chat completion by @heheda12345 in #22957
- [Optimization] Speed up function
_convert_tokens_to_string_with_added_encoders
by 13.7x by @misrasaurabh1 in #20413 - Do not use eval() to convert unknown types by @russellb in #23266
- [Feature] use --eplb_config to set eplb param by @lengrongfu in #20562
- [misc] fix multiple arch wheels for the nightly index by @youkaichao in #23110
- Remove chunked_prefill_enabled flag in V1 MLA by @MatthewBonanni in #23183
- Feature/mla tests by @MatthewBonanni in #23195
- [Fix] remove is_marlin param in benchmark_moe by @shixianc in #23286
- [EP] Add logging for experts map by @22quinn in #22685
- Remove duplicate entry in vllm.attention.all by @russellb in #23296
- [CI Bugfix] Fix CI by fully removing --enable-prompt-adapter by @mgoin in #23284
- [Optimization] Make new_block_ids None if empty by @WoosukKwon in #23262
- [CPU] Refactor CPU W8A8 scaled_mm by @bigPYJ1151 in #23071
- [CI/Build] Split out mm processor tests by @DarkLight1337 in #23260
- [V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support by @Josephasafg in #23035
- [Compile] Fix Compile Warning SM100 Cutlass MLA by @yewentao256 in #23287
- [Model][VLM] Support R-4B Model by @yannqi in #23246
- Delete images older than 24h. by @QiliangCui in #23291
- [CI] Block the cu126 wheel build while broken by @mgoin in #23285
- [Sampler] Support returning final logprobs by @22quinn in #22387
- [Bugfix] Fix extra whitespace in strings caused by newline by @DarkLight1337 in #23272
- [BugFix] Fix Python 3.9 Support by @jaredoconnell in #23306
- [Model] Add LFM2 architecture by @paulpak58 in #22845
- [Refactor] Simplify code for MM budget by @DarkLight1337 in #23310
- [Doc] Fix batch-level DP example by @DarkLight1337 in #23325
- [Performance] V1 Pooling Models E2E Performance Optimization by @noooop in #23162
- [V1] Remove unnecessary check for main thread by @robertgshaw2-redhat in #23298
- [Bugfix] set system_message in phi4mini chat template by @zhuangqh in #23309
- [Multimodal] Always enable hashing mm data by @ywang96 in #23308
- [ci/build] Fix abi tag for aarch64 by @youkaichao in #23329
- Migrate MolmoImageInputs to TensorSchema by @bbeckca in #22022
- Fix nvfp4 swizzling by @yiliu30 in #23140
- add tg-mxfp4-moe-test by @IwakuraRein in #22540
- [Bug] Fix R1 Accuracy 0 Bug by @yewentao256 in #23294
- [Bugfix] Fix port conflict by obtaining a list of open ports upfront by @minosfuture in #21894
- [Misc] Misc code cleanup/simplification by @njhill in #23304
- [BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message by @heheda12345 in #23318
- [Misc] fix VLLM_TORCH_PROFILER_DIR to absolute path by @andyxning in #23191
- [Core] Always use tensor cores for Flashinfer Decode Wrapper by @pavanimajety in #23214
- Make sure that vectorize_with_alignment produced vectorized global loads by @elvircrn in #23182
- [Structured Outputs] Refactor bitmask construction into get_grammar_bitmask by @WoosukKwon in #23361
- [CI] Clean up actions: remove helm, publish workflows and improve pr … by @simon-mo in #23377
- [CI] improve pr comments bot by @simon-mo in #23380
- [Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm by @mgoin in #23265
- Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. by @tvalentyn in #23270
- [Feature][Responses API] Support logprobs(non-stream) by @kebe7jun in #23319
- [Core] Support custom executor qualname by @22quinn in #23314
- [Kernel] Add FP8 support with FlashMLA backend by @MatthewBonanni in #22668
- [Deprecation] Remove
prompt_token_ids
arg fallback inLLM.generate
andLLM.embed
by @DarkLight1337 in #18800 - Migrate MllamaImagePixelInputs to TensorSchema by @bbeckca in #22020
- [CI/Build] Skip Idefics3 and SmolVLM generation test again by @Isotr0py in #23356
- [Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement by @yewentao256 in #23351
- [CI] Add end-to-end V1 min_tokens test coverage by @arjunbreddy22 in #22495
- [Misc] Add gemma3 chat template with pythonic-style function calling by @philipchung in #17149
- [New Model] Add Seed-Oss model by @FoolPlayer in #23241
- [Attention] Refactor AttentionMetadata Preparation for Encoder-only Models by @heheda12345 in #23154
- [P/D][Nixl] Make kv cache register compatible with hybrid memory allocator by @sfeng33 in #23079
- [gpt-oss] add input/output usage in responses api when harmony context is leveraged by @gcalmettes in #22667
- Migrate MiniCPMOAudioInputs to TensorSchema by @bbeckca in #21847
- [Bugfix] Fix pooling models on non-CUDA devices by @bigPYJ1151 in #23392
- [V0 Deprecation] Remove V0 LoRA test by @jeejeelee in #23418
- [Misc] Move M-RoPE init logic to _init_mrope_positions by @WoosukKwon in #23422
- [Attention] Allow V1 flash_attn to support cross-attention by @russellb in #23297
- [misc] Remove outdate comment about runai_model_streamer by @carlory in #23421
- [Doc] Update the doc for log probs + prefix caching by @heheda12345 in #23399
- [Misc] local import code clean by @andyxning in #23420
- [Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script by @namanlalitnyu in #23375
- [Fix] Bump triton version in rocm-build requirements by @bringlein in #21630
- [Bugfix]: Installing dev environment due to pydantic incompatible version by @hickeyma in #23353
- [Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support by @PapaGoose in #23337
- [BugFix] Fix the issue where image embeddings were incorrectly split.… by @bppps in #23366
- fix(tests): Ensure reliable CUDA cache clearing in MoE test by @AzizCode92 in #23416
- Add unit tests for batched guided and non-guided requests by @sarckk in #23389
- [Doc]: fix various typos in multiple files by @didier-durand in #23179
- [Model] Add Ovis2.5 PP support by @Isotr0py in #23405
- [Bugfix] Fix broken Florence-2 model by @Isotr0py in #23426
- [Quantization] Allow GGUF quantization to skip unquantized layer by @Isotr0py in #23188
- add an env var for path to pre-downloaded flashinfer cubin files by @842974287 in #22675
- [CI/Build] add EP dependencies to docker by @zhewenl in #21976
- [PERF] PyTorch Symmetric Memory All-Reduce by @ilmarkov in #20759
- [BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model by @rasmith in #22281
- [NVIDIA] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel by @elvischenv in #22703
- [BugFix] Fix batch updates for pooling models by @njhill in #23398
- [BugFix] Fix
MinPLogitsProcessor.update_states()
by @njhill in #23401 - [Model] Support DP for ViT on MiniCPM-V-4 by @david6666666 in #23327
- [UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh by @mgoin in #23360
- Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs by @fengli1702 in #22527
- Add glm4.5v tp2,4 fp8 config on H100_80GB by @chenxi-yang in #23443
- Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion (#20000)" by @DarkLight1337 in #23396
- fix(tests): Correct unreachable assertion in truncation test by @AzizCode92 in #23425
- Support DeepSeek-V3.1 tool call by @Xu-Wenqing in #23454
- [Misc] Modify CacheConfig import by @jeejeelee in #23459
- [gpt-oss] Streaming Output for Python Tool by @ZJY0516 in #23409
- Migrate Pixtral inputs to TensorSchema by @bbeckca in #23472
- [Bugfix] Add strong reference to CUDA pluggable allocator callbacks by @22quinn in #23477
- Migrate Paligemma inputs to TensorSchema by @bbeckca in #23470
- [kernel] Support W4A8 on Hopper by @czhu-cohere in #23198
- [Misc] update dict parse to EPLBConfig from json dumps to dict unpacking by @lengrongfu in #23305
- (Misc): add missing test for zero truncation size. by @teekenl in #23457
- [New Model]Donut model by @princepride in #23229
- [Model] Enable BLOOM on V1 by @DarkLight1337 in #23488
- [Misc] Remove unused slot_mapping buffer by @WoosukKwon in #23502
- fix incompatibililty with non cuda platform for nvfp4 by @luccafong in #23478
- [Doc: ]fix various typos in multiple files by @didier-durand in #23487
- [Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 by @minosfuture in #23504
- Frontend: Adding LM Format Enforcer support to V1 engine by @noamgat in #22564
- [Bugfix] Fix Qwen2.5-VL quantized model weights loading by @zifeitong in #23512
- [Misc] Unified linear print info by @jeejeelee in #23516
- Migrate tarsier inputs to TensorSchema by @bbeckca in #23500
- Migrate skyworkr1v inputs to TensorSchema by @bbeckca in #23499
- Migrate DonutImagePixelInputs to TensorSchema by @bbeckca in #23509
- [Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) by @FFFfff1FFFfff in #23408
- [gpt-oss] use reasoning channel for reasoning text in serving_chat by @yuguo68 in #22920
- [Refactor] Dynamic
target
andcontent
for prompt updates by @DarkLight1337 in #23411 - [Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests by @fake0fan in #22711
- [Fix] DeepSeek V3.1 tool parser error message by @skyloevil in #23492
- Feature/benchmark/random mm data/images by @h-brenoskuk in #23119
- [Bugfix] Allow dynamic number of patches for llava_onevision by @DarkLight1337 in #23525
- [misc] add shanghai meetup by @youkaichao in #23535
- [Attention] Unify mamba and attention backend selection by @ayushsatyam146 in #23171
- [Doc] Add caution for API server scale-out by @DarkLight1337 in #23550
- [Refactor] Pass
tokenizer
explicitly instead of binding to prompt update by @DarkLight1337 in #23542 - Updates to Flex + VLLm integration by @drisspg in #21416
- [Bugfix] Fix Qwen3 MoE GPTQ inference by @Isotr0py in #23490
- [Refactor] Refactor persistent buffers with CpuGpuBuffer by @WoosukKwon in #23515
- [test][RL] Add sleep level 2 test and fix reload with sleep mode by @22quinn in #23521
- [Kernel] Add fused grouped_topk kernel for MoE by @xyang16 in #23274
- [Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector by @Abatom in #23403
- [XPU] Delay BF16 check to worker init for spawn compatibility by @chaojun-zhang in #22979
- [TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. by @patemotter in #23574
- [Docs] Update Documentation of Cohere Command-A Models by @Terrencezzj in #23584
- [Misc] Simplify FlashInfer attention metadata by @WoosukKwon in #23585
- [Misc] Add release note draft to PR template by @simon-mo in #23598
- [CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests by @mgoin in #23568
- Update Flashinfer to 0.2.14.post1 by @weireweire in #23537
- [Bug] Fix DeepGEMM Env Control by @yewentao256 in #23591
- [CI/Build] Use vLLM client's user agent to fetch images by @DarkLight1337 in #23561
- Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper by @Copilot in #23385
- [Disagg][Perf] Use CUDA event sync instead of blocking
tolist
to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT by @liuzijing2014 in #22760 - [CI/Build] Fix typo in #23561 by @DarkLight1337 in #23616
- [fix] fix seed-oss-parser by @FoolPlayer in #23560
- [mypy] Fix incorrect type hint for EAGLE3 support by @DarkLight1337 in #23617
- [Benchmarks] add benchmark for embedding models by @ZJY0516 in #23000
- [Docs] Fix titles for multi-file examples that are rendered in the docs by @hmellor in #23573
- Fix CLI parameter documentation inconsistency in pooling_models.md by @oneraghavan in #23630
- [Bugfix] Fix Qwen25VL packed_modules_mapping by @jeejeelee in #23604
- [Bugfix] Fix scheduling when repeated images in one request by @ywang96 in #23544
- [V1] Enable V1 for compute capability < 8.0 + FP32 by @DarkLight1337 in #23614
- Fix nits from #20059 by @hmellor in #23548
- Fix writing benchmark results with tuple keys by @huydhn in #23633
- [Perf] Remove duplicated NVFP4 blockscales to save memory by @mgoin in #23379
- [Model] fix DeepSeek e_score_correction_bias dtype to fp32 by @jeejeelee in #23640
- [Bugfix] Add missing enable_log_outputs parameter to init_app_state function by @lordmathis in #23634
- feat: add usage to TranscriptionResponse (text and json response_format) by @gcalmettes in #23576
- Support FlashAttention Backend for Hybrid SSM Models by @heheda12345 in #23299
- [Docs] Fix broken links to
docs/api/summary.md
by @hmellor in #23637 - [Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) by @OYE93 in #23565
- [Kernel] Added flashinfer fp8 per-tensor gemms by @nvjullin in #22895
- [Doc]: fix various spelling issues in multiple files by @didier-durand in #23636
- [CPU] add cpu fused moe pytorch native implementation by @TianyuLi0 in #23146
- [ROCm] Starting to add AMD code reviewers for ROCm components by @hongxiayang in #23496
- [Docs] Reduce requirements for docs build by @hmellor in #23651
- [Bugfix] fix bf16 multimodal model hash by @yuekaizhang in #23623
- [model] support qwen2audio embedding input by @yuekaizhang in #23625
- [Misc] Add override for allreduce fusion thresholds by @nvjullin in #23639
- [CI] [Doc]: Add GH Action for auto labeling issues with
rocm
tag by @vllmellm in #20988 - [Bugfix] Fix cuda event usage with CPU model runner by @bigPYJ1151 in #23643
- [Docs] Fix warnings in
mkdocs build
by @Zerohertz in #23649 - [Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models by @tdoublep in #23665
- [v1] Add cross-attention KV cache support for encoder-decoder models by @russellb in #23664
- [Bugfix] Fix incorrect original shape in hashing by @DarkLight1337 in #23672
- [Misc] Fix comments in
tests/kernels/quantization
by @ZJY0516 in #23675 - [Model] Enable video support for InternVL3.5 models by @Isotr0py in #23658
- [doc] Hybrid KV Cache Manager design doc by @heheda12345 in #22688
- Enhance the pre-notification policy by @sidhpurwala-huzaifa in #23532
- [Docs] Move quant supported hardware table to README by @hmellor in #23663
- [V1][P/D]P2pNcclConnector supports flashinfer by @Abatom in #23536
- [V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 by @tdoublep in #22594
- [Compile] Fix Cmake Warning by @yewentao256 in #23689
- [Bugfix] UnboundLocalError when GptOss reasoning specified by @coval3nte in #23054
- feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 by @zixuanzhang226 in #23695
- [Feature][Responses API] Support MCP tool in background mode by @wuhang2014 in #23494
- fix pynccl reduce_scatter by @youzhedian in #23648
- [quantization] use channel scales for w4a8 + misc fixes by @czhu-cohere in #23570
- [gpt-oss] Enable unit test for response API harmony integration by @heheda12345 in #23533
- [Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 by @mgoin in #23678
- [Docs] Fix math rendering in docs by @hmellor in #23676
- [Bugfix][gpt-oss] passing the cache config in gpt-oss by @frank-wei in #23613
- [Bugfix]: Qwen3 Coder Tool Parser by @ranpox in #23099
- [Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. by @huachenheli in #23686
- [Model] Add Ernie4.5 VL Model Support by @CSWYF3634076 in #22514
- [Frontend] Add --log-error-stack to print stack trace for error response by @heheda12345 in #22960
- [Frontend] Optimize beam search performance by limiting concurrency by @heheda12345 in #23599
- [Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs by @dsikka in #22674
- [XPU] Add xpu torch.compile support by @jikunshang in #22609
- [CI/Build] Remove redundant LoRA model tests by @jeejeelee in #23706
- [Bugfix] fix when config.yaml config value is list parse error by @lengrongfu in #23528
- [Core] Use key-only cache for
BaseMultiModalProcessor
by @DarkLight1337 in #23018 - [XPU]fix cuda event used in XPU model runner by @jikunshang in #23708
- [CI/Build] Remove redundant register in model init tests by @DarkLight1337 in #23715
- [Docs] Fix an admonition important by @windsonsea in #23726
- Optimize input preparation for FlashInfer [2/N] by @WoosukKwon in #23174
- [Misc] Move CpuGpuBuffer to vllm/v1/utils.py by @WoosukKwon in #23728
- [FlashInfer] Cache hyper params in metadata builder by @WoosukKwon in #23732
- [CI/Build] Reduce LoRA layer test cases by @jeejeelee in #23721
- [XPU] Fix OOM issue for data parallel with Ray backend by @faaany in #22500
- [Docs] Fix a 1-2-3 list and style issues in tpu.md by @windsonsea in #23729
- [model] Support MiniCPM-V 4.5 by @tc-mb in #23586
- [Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled by @cndoit18 in #23718
- [Misc] Remove unnecessary
_send_reconfig_message()
incore_client.py
by @njhill in #23127 - [V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models by @tdoublep in #23716
- [Model] Explicit
default_pooling_type
interface by @DarkLight1337 in #23736 - Add vLLM Korea Meetup in the README.md and meetups.md by @rebel-hongseok in #23746
- Fix pre-commit on main by @hmellor in #23747
- [Model] Interface to enable batch-level DP support by @DarkLight1337 in #23733
- Only run
get_attr_docs
if generating help text by @hmellor in #23723 - [Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt by @yewentao256 in #23666
- [Model] Enable native HF format InternVL support by @Isotr0py in #23742
- [Doc]: upgrade version of crate-ci tool for improved typo detection by @didier-durand in #23755
- [LogitsProcs] Deduplicate built-in LP implementation logic by @njhill in #23362
- [Docs] Remove in-tree Gaudi install instructions by @hmellor in #23628
- [BugFix] Fix topk_softmax assert by @ProExpertProg in #19764
- [Model] Merge
SupportsMultiModalWithRawInput
withSupportsMultiModal
by @DarkLight1337 in #23749 - [V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models by @tdoublep in #22589
- [Docs] Fix warnings in
mkdocs build
(continued) by @Zerohertz in #23743 - ci: Add arm64 docker build to release pipeline by @seemethere in #23210
- Disable
torch.compile
for dynamic rope models in Transformers backend by @hmellor in #23738 - [Multimodal] Generate mm_hash based on request metadata when caching is turned off by @ywang96 in #23690
- [V1][Mamba] - Enable V1 by default for Mamba Models by @Josephasafg in #23650
- DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 by @zyongye in #23608
- [Bugfix] Fix Marlin NVFP4 for modelopt by @mgoin in #23659
- [Feature] Add
VLLM_DISABLE_PAD_FOR_CUDAGRAPH
to Avoid Hang Issue by @yewentao256 in #23595 - [Bugfix] Fix for V1 priority scheduling crashes at preemption by @Hanchenli in #23713
- Migrate Qwen inputs to TensorSchema by @bbeckca in #23473
- [Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 by @Shrey1306 in #23556
- [Perf] Tune configs for triton block fp8 gemm H100/H200 by @mgoin in #23748
- Gracefully handle edge cases in harmony utils by @Ithanil in #23155
- [CI] make all multi-gpu weight loading tests run nightly by @killershrimp in #23792
- Add deprecation warning for lora_extra_vocab_size by @ahengljh in #23635
- [Transform] [Quantization] Add transforms to compressed tensors by @kylesayrs in #22486
- [CI] enable idefics3 and fuyu-8b test in multimodal test by @ZJY0516 in #23790
- [Bugfix] when set offline model running error by @lengrongfu in #23711
- [Kernel] cuda kernels for upcoming decode context parallel feature by @youzhedian in #23791
- [New Model]: Support GteNewModelForSequenceClassification by @noooop in #23524
- [Model] Add PP support and VLM backbone compatability for GPT-OSS by @Isotr0py in #23680
- [FIXBUG] Add return_success parameter to moe_wna16_weight_loader function by @JartX in #22797
- [Doc]: fix typos in .md files (including those of #23751) by @didier-durand in #23825
- [CI/Build][Bugfix] Fix Qwen VL tests on CPU by @bigPYJ1151 in #23818
- [BugFix][Spec Decode] Use float64 for uniform_probs by @WoosukKwon in #23803
- [Model] [gpt-oss] fix gpt-oss pp support by @ZJY0516 in #23815
- [Doc]: fix typos in Python scripts by @didier-durand in #23828
- [Bugfix] Fix benchmark_moe.py for blockwise fp8. by @crischeng in #23823
- [CI] Fix linting error on main by @tdoublep in #23835
- [Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE by @nvpohanh in #23819
- [Bugfix] Add fake mode around passes by @angelayi in #23349
- [ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime by @jeanschmidt in #23757
- Add scale_config.yml file for Meta autoscalers for GH Actions by @jeanschmidt in #23840
- Migrate Llama4ImagePatchInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22021
- [ROCm][Aiter] Add triton fp8 bmm kernel for mla by @divakar-amd in https://github.com/vllm-project/vllm/pull/23264
- [bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) by @He-Jingkai in https://github.com/vllm-project/vllm/pull/23829
- [NVIDIA] Support SiluMul + NVFP4 quant fusion by @elvischenv in https://github.com/vllm-project/vllm/pull/23671
- chore: build release image by default by @simon-mo in https://github.com/vllm-project/vllm/pull/23852
- [BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23737
- [V1] Enable prefill optimization for Gemma3n by @sarckk in https://github.com/vllm-project/vllm/pull/22628
- [Log] Use Debug Once for DeepGEMM E8M0 When not Enabled by @yewentao256 in https://github.com/vllm-project/vllm/pull/23858
- [V0 Deprecation] Remove V0 Samplers test by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23862
- [XPU] support data parallel for MoE models on XPU by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/22887
- [Models] Improve iteration over layers by @lgeiger in https://github.com/vllm-project/vllm/pull/19497
- [ROCm][Fix] Fix rocm build caused by #23791 by @charlifu in https://github.com/vllm-project/vllm/pull/23847
- [tests] Improve speed and reliability of test_transcription_api_correctness by @russellb in https://github.com/vllm-project/vllm/pull/23854
- [Bugfix] Use
ReplicatedLinear
for SequenceClassification head by @Isotr0py in https://github.com/vllm-project/vllm/pull/23836 - [BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD by @KingsleyZhang123 in https://github.com/vllm-project/vllm/pull/23864
- [Platform] import activation_quant_fusion for CUDA only by @wangxiyuan in https://github.com/vllm-project/vllm/pull/23882
- Fix(async): Add support for truncate_prompt_tokens in AsyncLLM by @oneraghavan in https://github.com/vllm-project/vllm/pull/23800
- [CI/Build] Clean up LoRA test by @jeejeelee in https://github.com/vllm-project/vllm/pull/23890
- [mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. by @huachenheli in https://github.com/vllm-project/vllm/pull/23895
- [Misc] Fix warnings for mistral model by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23552
- Better errors for Transformers backend missing features by @hmellor in https://github.com/vllm-project/vllm/pull/23759
- [V0 Deprecation] Remove pooling model support in V0 by @maxdebayser in https://github.com/vllm-project/vllm/pull/23434
- [CPU] Enable data parallel for CPU backend by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/23903
- [Performance] V1 Classify Models E2E Performance Optimization by @noooop in https://github.com/vllm-project/vllm/pull/23541
- [Multimodal] Consolidate mm inputs into MultiModalFeatureSpec by @sfeng33 in https://github.com/vllm-project/vllm/pull/23779
- Update PyTorch to 2.8.0 by @huydhn in https://github.com/vllm-project/vllm/pull/20358
- Adds
json_count_leaves
utility function by @aditchawdhary in https://github.com/vllm-project/vllm/pull/23899 - [MODEL]
Apertus
andXIELU
by @EduardDurech in https://github.com/vllm-project/vllm/pull/23068 - [Models] Use in-place adds in Idefics2Vision by @lgeiger in https://github.com/vllm-project/vllm/pull/23932
- [BugFix] Async scheduling and PP compatibility with DP by @njhill in https://github.com/vllm-project/vllm/pull/23770
- [CI] Add
aiter
to matching list of issue auto labeller forrocm
tag by @vllmellm in https://github.com/vllm-project/vllm/pull/23942 - [BUGFIX ] fix undefined silu_and_mul_nvfp4_quant by @youzhedian in https://github.com/vllm-project/vllm/pull/23929
- [RL][BugFix] Fix missing tokenizer error for token-in-token-out by @22quinn in https://github.com/vllm-project/vllm/pull/23904
- Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj by @mgoin in https://github.com/vllm-project/vllm/pull/23939
- [Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models by @tdoublep in https://github.com/vllm-project/vllm/pull/23824
- Revert gemma3n fast prefill changes by @sarckk in https://github.com/vllm-project/vllm/pull/23897
- [Misc] Make
download_weights_from_hf
more reliable by @hmellor in https://github.com/vllm-project/vllm/pull/23863 - [CI] Fix unavailable image remote URL by @ywang96 in https://github.com/vllm-project/vllm/pull/23966
- [Bugfix] Fix --config arg expansion called from api_server.py by @dubejf in https://github.com/vllm-project/vllm/pull/23944
- Add routed_scaling_factor to MoE grouped topk by @xyang16 in https://github.com/vllm-project/vllm/pull/23123
- [CI] Move testing image from remote URL to S3 by @ywang96 in https://github.com/vllm-project/vllm/pull/23980
- [CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion by @sarckk in https://github.com/vllm-project/vllm/pull/23973
- [Core] Cleanup TPU model runner for MM by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23894
- [V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba by @tdoublep in https://github.com/vllm-project/vllm/pull/23831
- [Bugfix] Fix test_lora_resolvers.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/23984
- [UT] fix unify_kv_cache_configs when kv cache config needs sort by @andyxning in https://github.com/vllm-project/vllm/pull/23843
- [Model] Enable encoder DP for MiniCPM-V by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23948
- Add LoRA support for DeepSeek models (V2, V3, R1-0528) by @sadeghja1070 in https://github.com/vllm-project/vllm/pull/23971
- [Misc] add reorder_batch AttentionMetadataBuilder by @andyxning in https://github.com/vllm-project/vllm/pull/23798
- [Refactor] refactor freezing_value/cuda_event initialize outside try finally by @andyxning in https://github.com/vllm-project/vllm/pull/23758
- [Misc] enhance type hint for rearrange return value by @andyxning in https://github.com/vllm-project/vllm/pull/23519
- [LoRA] Much faster startup when LoRA is enabled by @andylolu2 in https://github.com/vllm-project/vllm/pull/23777
- Fix wrong truncate_prompt_tokens type hint by @gmarinho2 in https://github.com/vllm-project/vllm/pull/22761
- [Core][Multimodal] Allow passing
multi_modal_uuids
as multimodal identifiers. by @ywang96 in https://github.com/vllm-project/vllm/pull/23394 - [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24001
- vllm fix check on max vocab size by @xw285cornell in https://github.com/vllm-project/vllm/pull/22471
- [Minor] Fix some random typos in comments by @njhill in https://github.com/vllm-project/vllm/pull/24009
- v1: Support KV events from connectors by @orozery in https://github.com/vllm-project/vllm/pull/19737
- [BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) by @JartX in https://github.com/vllm-project/vllm/pull/23994
- [Misc] Avoid redundant copy for encoder-only models by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24012
- Fix the bug related to loading GPTP INT3 weights. by @Jun-Howie in https://github.com/vllm-project/vllm/pull/23328
- [Misc] Move fast prefill logic to separate method by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24013
- [CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization by @Isotr0py in https://github.com/vllm-project/vllm/pull/23357
- [Misc] refactor code by import as for torch._inductor.config by @andyxning in https://github.com/vllm-project/vllm/pull/23677
- Migrate Phi4 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23471
- [Misc] IO Processor plugins for pooling models by @christian-pinto in https://github.com/vllm-project/vllm/pull/22820
- [Bugfix] Add support for
<tool_call>
format in streaming mode for XLAM Tool Parser by @DevonPeroutky in https://github.com/vllm-project/vllm/pull/22769 - [Misc] add hash_function doc string by @andyxning in https://github.com/vllm-project/vllm/pull/24014
- [Misc] Enable V1 FP16 inference on pre-Ampere GPUs by @Isotr0py in https://github.com/vllm-project/vllm/pull/24022
- [Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN by @noooop in https://github.com/vllm-project/vllm/pull/20904
- [Kernel] Update DeepGEMM to latest commit by @jeejeelee in https://github.com/vllm-project/vllm/pull/23915
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24026
- [Frontend] Gemma3n audio
transcriptions
/translations
endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/23735 - [Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors by @yankay in https://github.com/vllm-project/vllm/pull/24033
- [Model]: support KeyeVL-1_5-8B by @Kwai-Keye in https://github.com/vllm-project/vllm/pull/23838
- Document multi-proc method selection for profiling by @hypdeb in https://github.com/vllm-project/vllm/pull/23802
- [Misc] Minor code simplification for spec decode by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24053
- [docs][misc] IOProcessor plugins fixes by @christian-pinto in https://github.com/vllm-project/vllm/pull/24046
- [Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 by @david6666666 in https://github.com/vllm-project/vllm/pull/23817
- [Chore][V0 Deprecation] Move LogProb to a separate file by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24055
- [bugfix]fix MTP hidden states by @luccafong in https://github.com/vllm-project/vllm/pull/24056
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24042
- [V1][Mamba1] - FP32 SSM Kernel Support by @Josephasafg in https://github.com/vllm-project/vllm/pull/23506
- [Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… by @DamonJiang777 in https://github.com/vllm-project/vllm/pull/24028
- Remove runtime checks based on pooling params by @maxdebayser in https://github.com/vllm-project/vllm/pull/24051
- Migrate OvisImagePatchInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/22024
- [XPU][Feature] fp8 online quantization support for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/23148
- Migrate Interns1 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23510
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24077
- [Model] Support dp on ViT on GLM-4.5V by @david6666666 in https://github.com/vllm-project/vllm/pull/23168
- [CI]: reduce HTTP calls inside entrypoints openai tests by @AzizCode92 in https://github.com/vllm-project/vllm/pull/23646
- correct LWS deployment yaml by @cberge908 in https://github.com/vllm-project/vllm/pull/23104
- [Gemma3n] Fix audio batching by @NickLucche in https://github.com/vllm-project/vllm/pull/24052
- [BugFix] Fix EXAONE4 rotary embeddings by @lkm2835 in https://github.com/vllm-project/vllm/pull/23918
- [Model] Classification models support logit_bias / sigmoid_normalize by @noooop in https://github.com/vllm-project/vllm/pull/24031
- [CI Failure] Skip failing nvfp4 silu test by @mgoin in https://github.com/vllm-project/vllm/pull/23959
- [docs] add SYS_NICE cap &
security-opt
for docker/k8s by @panpan0000 in https://github.com/vllm-project/vllm/pull/24017 - [Benchmark] Add support for local hf dataset path in benchmark by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23999
- [Bugfix] Fix transform_config parsing in Compressed Tensors by @kylesayrs in https://github.com/vllm-project/vllm/pull/23945
- Run ruff format on a few files. by @huachenheli in https://github.com/vllm-project/vllm/pull/24075
- [Bugfix] Fix packed_factor missing attribute error by @kyuyeunk in https://github.com/vllm-project/vllm/pull/23902
- [Metrics] Deprecate TPOT in favor of ITL by @markmc in https://github.com/vllm-project/vllm/pull/24110
- Fix weights loading for Apertus by @nathanrchn in https://github.com/vllm-project/vllm/pull/24100
- [Log] Only Print Profiler Results on Rank 0 by @yewentao256 in https://github.com/vllm-project/vllm/pull/23370
- [CI] Enable all hf transformers baselines in test_hybrid by @tdoublep in https://github.com/vllm-project/vllm/pull/23936
- [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault by @rasmith in https://github.com/vllm-project/vllm/pull/23692
- [Bug] R1 Accuracy: Fix
routed_scaling_factor
Double Mul Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24119 - [CI/Build] Disable SiluMul NVFP4 quant fusion tests by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24121
- [XPU] Fix the bug of LoRA logits on the XPU platform by @chaojun-zhang in https://github.com/vllm-project/vllm/pull/24081
- Update release pipeline post PyTorch 2.8.0 update by @youkaichao in https://github.com/vllm-project/vllm/pull/24073
- Upgrade xgrammar to 0.1.23 by @russellb in https://github.com/vllm-project/vllm/pull/22988
- [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing by @afeldman-nm in https://github.com/vllm-project/vllm/pull/23656
- fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24071
- [Compile] Fix Compile Warning for
w4a8_mm_entry.cu
by @yewentao256 in https://github.com/vllm-project/vllm/pull/23660 - [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24093
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24115
- [Misc] Add check for dual_chunk_attention by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24070
- [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models by @sarckk in https://github.com/vllm-project/vllm/pull/24132
- [distributed][rl] remove nccl cumem env var override by @youkaichao in https://github.com/vllm-project/vllm/pull/24141
- [Nixl] Heterogeneous TP support FlashInfer by @NickLucche in https://github.com/vllm-project/vllm/pull/20189
- [CI/Build] Serve images used by multimodal tests through local HTTP Server by @divyanshsinghvi in https://github.com/vllm-project/vllm/pull/23907
- [Misc] Clean up deadcode for legacy processing pipeline by @Isotr0py in https://github.com/vllm-project/vllm/pull/24153
- [CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant by @noooop in https://github.com/vllm-project/vllm/pull/24088
- Support add_generation_prompt in embeddings endpoint with chat request by @biba10 in https://github.com/vllm-project/vllm/pull/23931
- Fix MiniMax attention module prefix and remove useless code by @qscqesze in https://github.com/vllm-project/vllm/pull/23982
- FIX: Add libnuma-dev to Dockerfile for dev stage by @dongbo910220 in https://github.com/vllm-project/vllm/pull/20388
- [Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 by @bringlein in https://github.com/vllm-project/vllm/pull/23424
- [V1] v1 engine + full CUDA graph support for PLaMo2 by @nopperl in https://github.com/vllm-project/vllm/pull/23998
- [Kernels] Overlap shared experts with send/recv by @bnellnm in https://github.com/vllm-project/vllm/pull/23273
- Migrate whisper inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23505
- [Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23289
- [Feature][P/D]: Optimize NIXL Connector xfer Launch by @david6666666 in https://github.com/vllm-project/vllm/pull/23887
- [Bugfix][DP] DP distribution does not require ray[default] by @kebe7jun in https://github.com/vllm-project/vllm/pull/23822
- [Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking by @NagyGeorge in https://github.com/vllm-project/vllm/pull/23460
- Remove deprecated
PyNcclConnector
by @panpan0000 in https://github.com/vllm-project/vllm/pull/24151 - [Feature][Responses API]Support MCP tools with streaming mode + background mode by @wuhang2014 in https://github.com/vllm-project/vllm/pull/23927
- [Kernel][Bugfix] Fix grouped topk cu by @mayuyuace in https://github.com/vllm-project/vllm/pull/24146
- [Refactor] Introduce basic Renderer for completion-style request by @sfeng33 in https://github.com/vllm-project/vllm/pull/24010
- Migrate ultravox inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23503
- [CPU] Refactor CPU unquantized linear by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24150
- [Misc] Enhance output readability of helper script by @wdhongtw in https://github.com/vllm-project/vllm/pull/24214
- [Model] Add MiDashengLM model support by @bingchen-mi in https://github.com/vllm-project/vllm/pull/23652
- [Core][Model] Terratorch backend integration by @mgazz in https://github.com/vllm-project/vllm/pull/23513
- Improve flexibility of auto_tune.sh execution. by @anthonsu in https://github.com/vllm-project/vllm/pull/23766
- [Attention][Platform] Refactor MLA to support Custom Op by @whx-sjtu in https://github.com/vllm-project/vllm/pull/23332
- [Bugfix] Fix Incremental Detokenization with
tokenizers == 0.22.0
by @faaany in https://github.com/vllm-project/vllm/pull/24159 - [Attention] FlashAttn MLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/14258
- [Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon by @ignaciosica in https://github.com/vllm-project/vllm/pull/24200
- [Feature][Response API] Add streaming support for non-harmony by @kebe7jun in https://github.com/vllm-project/vllm/pull/23741
- [Doc] Update vLLM Singapore Meetup info by @tjtanaa in https://github.com/vllm-project/vllm/pull/24234
- [Model] Add pp support for hunyuan by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24212
- Use hidden_size_per_head as head_size fallback by @nopperl in https://github.com/vllm-project/vllm/pull/24221
- [XPU] support Triton Attention backend on Intel GPU by @jikunshang in https://github.com/vllm-project/vllm/pull/24149
- [LoRA]: Add lora support to qwen-2.5-omni by @pratapyash in https://github.com/vllm-project/vllm/pull/24231
- [Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp by @nvjullin in https://github.com/vllm-project/vllm/pull/23725
- [Perf] Freeze core engine proc heap after init by @njhill in https://github.com/vllm-project/vllm/pull/24008
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24173
- [Misc] Slight improve deepgemm print by @jeejeelee in https://github.com/vllm-project/vllm/pull/24085
- Upgrade FlashInfer to v0.3.0 by @nvpohanh in https://github.com/vllm-project/vllm/pull/24086
- QWEN3 Coder Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24266
- [Misc] Have AsyncLLM
custom_stat_loggers
extend default logger list by @eicherseiji in https://github.com/vllm-project/vllm/pull/20952 - [Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files by @elvischenv in https://github.com/vllm-project/vllm/pull/23727
- [CI/Build] Reduce the number of redundant cases to test for LoRA by @zhuohan123 in https://github.com/vllm-project/vllm/pull/24276
- [Frontend] Skip unnecessary detokenization when token_id is requested by @NickLucche in https://github.com/vllm-project/vllm/pull/24236
- [gpt-oss] tool parser supports for /chat/completions [1/n] by @aarnphm in https://github.com/vllm-project/vllm/pull/22386
- [XPU][P/D] Add XPU support in NixlConnector by @zhenwei-intel in https://github.com/vllm-project/vllm/pull/22436
- Adding int4 and int8 models for CPU benchmarking by @louie-tsai in https://github.com/vllm-project/vllm/pull/23709
- [docs] add shenzhen meetup by @youkaichao in https://github.com/vllm-project/vllm/pull/24326
- [gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24306
- [Bugfix] Fix silu_mul+quant fusion test by @elvischenv in https://github.com/vllm-project/vllm/pull/24341
- [RFC] allow cancelation after shutdown in blocking collective_rpc by @842974287 in https://github.com/vllm-project/vllm/pull/23390
- [CI] Add timeouts to tests by @rafvasq in https://github.com/vllm-project/vllm/pull/24260
- [Perf][V1] Fully overlap model execution by @benchislett in https://github.com/vllm-project/vllm/pull/23569
- Add @22quinn as code reviewer for RL related components by @22quinn in https://github.com/vllm-project/vllm/pull/24346
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24294
- [KV Sharing] Raise error if using eagle with fast prefill by @sarckk in https://github.com/vllm-project/vllm/pull/24350
- [Feature] Support Decode Context Parallel (DCP) for MLA by @youzhedian in https://github.com/vllm-project/vllm/pull/23734
- [Bugfix] Catch and log invalid token ids in detokenizer by @njhill in https://github.com/vllm-project/vllm/pull/24351
- [Core] Allow disabling TP sharding for parallel Linear layer by @Isotr0py in https://github.com/vllm-project/vllm/pull/23024
- [New Model]: google/embeddinggemma-300m by @noooop in https://github.com/vllm-project/vllm/pull/24318
- refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/24345
- [Multimodal] Improve max video embedding length estimation in V1 by @ywang96 in https://github.com/vllm-project/vllm/pull/24312
- [CI] Disable flaky structured output test from CI by @ywang96 in https://github.com/vllm-project/vllm/pull/24366
- Add @benchislett to codeowner for spec decode and structured outputs by @benchislett in https://github.com/vllm-project/vllm/pull/24362
- [Bugfix] Avoid uninitialized usage of azp_val when AZP is false. by @mohankku in https://github.com/vllm-project/vllm/pull/24335
- [Bugfix] Fix broken deepseek fp8 TP weights loading by @Isotr0py in https://github.com/vllm-project/vllm/pull/24367
- [Bugfix] Fix test_mixtral_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/24371
- Lora bias(enable_lora_bias) deprecate warning by @ashwin-phadke in https://github.com/vllm-project/vllm/pull/24339
- [Fix] [gpt-oss] fix non-tool calling path for chat completion by @aarnphm in https://github.com/vllm-project/vllm/pull/24324
- [Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24285
- [Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test by @elvischenv in https://github.com/vllm-project/vllm/pull/24370
- break execute_model in gpu_model_runner into sub-functions for custom scopes by @bangshengtang in https://github.com/vllm-project/vllm/pull/24265
- [V0 deprecation] Deprecate V0 Neuron backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/21159
- [attention][DCP] use AttentionImpl.need_to_return_lse_for_decode by @youkaichao in https://github.com/vllm-project/vllm/pull/24372
- Migrate Qwen2 inputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/23475
- [CI][Fix] deterministic seed for flaky CI runs on structured outputs by @aarnphm in https://github.com/vllm-project/vllm/pull/24380
- [Benchmark] add benchmark for custom activation op by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23908
- QWEN3 Thinking Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24330
- [Misc] collect flashinfer version in collect_env.py by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24378
- [Bugfix] Fix Qwen3-coder moe tuned config by @jeejeelee in https://github.com/vllm-project/vllm/pull/24072
- [TPU] Remove TopKTopPSampler dependency for TPU sampler by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24391
- Add renderer-based prompt processing for embedding and classification endpoints by @sfeng33 in https://github.com/vllm-project/vllm/pull/24356
- Skip MM Encoder for non-first PP ranks by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24387
- Add @luccafong to codeowner for spec decode by @luccafong in https://github.com/vllm-project/vllm/pull/24397
- [Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA by @minosfuture in https://github.com/vllm-project/vllm/pull/24385
- [xpu] upgrade ipex/python3.12 for xpu by @yma11 in https://github.com/vllm-project/vllm/pull/23830
- [Sampler] Support returning all prompt logprobs by @charlotte12l in https://github.com/vllm-project/vllm/pull/23868
- [CI/Build] Disable flaky test_structured_output tests by @22quinn in https://github.com/vllm-project/vllm/pull/24404
- [CI/Build] Fix local image inputs in test_pixtral.py by @huachenheli in https://github.com/vllm-project/vllm/pull/24401
- [Doc] Fix UTF-8 encoding issues in documentation generation on Windows by @alhridoy in https://github.com/vllm-project/vllm/pull/24361
- [P/D] Add a shutdown method to the Connector API by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22699
- [Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess by @what-in-the-nim in https://github.com/vllm-project/vllm/pull/24332
- [Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess by @what-in-the-nim in https://github.com/vllm-project/vllm/pull/24334
- [gpt-oss][Responses API] Fix the function call id format by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24409
- [Docs] Fix a tip indentation and typo by @windsonsea in https://github.com/vllm-project/vllm/pull/24419
- [Doc]: fix typos in Python comments by @didier-durand in https://github.com/vllm-project/vllm/pull/24417
- [Doc] Fix issues in integrations/llamastack.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24428
- [Bugfix] Fix get_quant_config when using modelscope by @Potabk in https://github.com/vllm-project/vllm/pull/24421
- [Bugfix] Fix mamba2 prefill chunking by @tomeras91 in https://github.com/vllm-project/vllm/pull/23279
- [Misc] Terratorch related fixes by @christian-pinto in https://github.com/vllm-project/vllm/pull/24337
- Move
KVEventsConfig
fromconfig/__init__.py
toconfig/kv_events.py
by @hmellor in https://github.com/vllm-project/vllm/pull/24433 - [Frontend] User-provided uuids for medias in chat. (RFC #22044) by @huachenheli in https://github.com/vllm-project/vllm/pull/23449
- [Docs] Move feature compatibility tables to README by @hmellor in https://github.com/vllm-project/vllm/pull/24431
- [Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure by @didier-durand in https://github.com/vllm-project/vllm/pull/24438
- [Docs]add eplb_config param use docs by @lengrongfu in https://github.com/vllm-project/vllm/pull/24213
- [Model] Enable BNB support for qwen2_5_omni_thinker by @jeejeelee in https://github.com/vllm-project/vllm/pull/24420
- [Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/23563
- [Spec Decode][Benchmark] Add Blitzedit dataset by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/23605
- [Model] Remove quantized mixtral by @jeejeelee in https://github.com/vllm-project/vllm/pull/24437
- [CI] Enable encoder model compilation test by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24442
- [Model loader]: support multi-thread model weight loading by @BraveY in https://github.com/vllm-project/vllm/pull/23928
- [Spec Decode] Fix offline spec_decode.py by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24257
- [Attention] FlashAttention MLA cudagraph support by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/23958
- [Bugfix] Disable the statslogger if the api_server_count is greater than 1 by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22227
- [Hardware][IBM Z] Fix Outlines Core issue for s390x by @R3hankhan123 in https://github.com/vllm-project/vllm/pull/24034
- [CI] Add nightly multiarch manifests to dockerhub by @csahithi in https://github.com/vllm-project/vllm/pull/24102
- Update reviewers for modelopt related files by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/24468
- [Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24134
- [gpt-oss] Harmony changes with container tool support by @morgendave in https://github.com/vllm-project/vllm/pull/23386
- Bump actions/setup-python from 5.4.0 to 6.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24414
- [doc] update
vllm serve
cli args documentation by @cjackal in https://github.com/vllm-project/vllm/pull/24329 - Bump actions/stale from 9.1.0 to 10.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24412
- Bump actions/github-script from 7.0.1 to 8.0.0 by @dependabot[bot] in https://github.com/vllm-project/vllm/pull/24413
- Move
KVTransferConfig
fromconfig/__init__.py
toconfig/kv_transfer.py
by @hmellor in https://github.com/vllm-project/vllm/pull/24434 - [BugFix][Model] Fix Ernie4.5-VL hanging on long inputs by @CSWYF3634076 in https://github.com/vllm-project/vllm/pull/24074
- [Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/23647
- [Core] Use sha256 bytes instead of BlockHash to reduce GC overhead by @linzebing in https://github.com/vllm-project/vllm/pull/23673
- Add data_parallel_size to VllmConfig string representation by @Prowindy in https://github.com/vllm-project/vllm/pull/24298
- [Bugfix] Fix Apertus HF repo name by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24447
- [Misc] Improve Worker process title and logging prefix by @22quinn in https://github.com/vllm-project/vllm/pull/22205
- [Doc] mention fpdb for multiprocess breakpoints by @mickaelseznec in https://github.com/vllm-project/vllm/pull/24452
- [Misc] Support bench serve long context by @minosfuture in https://github.com/vllm-project/vllm/pull/24373
- [Doc]: fixing typos to improve docs by @didier-durand in https://github.com/vllm-project/vllm/pull/24480
- [Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer by @david6666666 in https://github.com/vllm-project/vllm/pull/24443
- [Misc] Add claude settings to gitignore by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24492
- [Misc] Add Codex settings to gitignore by @ywang96 in https://github.com/vllm-project/vllm/pull/24493
- [gpt-oss] Validate gpt-oss python tool during initialization by @heheda12345 in https://github.com/vllm-project/vllm/pull/23856
- [RL] fast weight update with zmq + ipc handles by @weixiao-huang in https://github.com/vllm-project/vllm/pull/24295
- [CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24411
- [Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT by @yewentao256 in https://github.com/vllm-project/vllm/pull/24123
- [Model] Systematic support for fp32 head, pooling models part by @noooop in https://github.com/vllm-project/vllm/pull/23810
- [Bugfix] Handle the edge case in detokenizer where processed tokens contain both
stop
str andeos
token by @dtransposed in https://github.com/vllm-project/vllm/pull/23938 - [Core] Run garbage collector after CUDA graph capture to fix throughput regression by @micah-wil in https://github.com/vllm-project/vllm/pull/24128
- [Kernels] Add Flash Linear Attention Kernels by @youkaichao in https://github.com/vllm-project/vllm/pull/24518
- [ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork by @gshtras in https://github.com/vllm-project/vllm/pull/24279
- [Bugfix] Fix hidden_size for multimodal classification model by @jeejeelee in https://github.com/vllm-project/vllm/pull/24501
- Extend renderer with embedding support and integrate completion endpoint by @sfeng33 in https://github.com/vllm-project/vllm/pull/24405
- [Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 by @serihiro in https://github.com/vllm-project/vllm/pull/24368
- [Docs] Gemma3n
transcriptions
endpoint support by @NickLucche in https://github.com/vllm-project/vllm/pull/24512 - [TPU] Fix tpu structured decoding in mixed batches by @Chenyaaang in https://github.com/vllm-project/vllm/pull/24458
- [CI] execute all piecewise compilation tests together by @ZJY0516 in https://github.com/vllm-project/vllm/pull/24502
- [Feature] Disallow FlashMLA on Blackwell by @yewentao256 in https://github.com/vllm-project/vllm/pull/24521
- [Log] Use a relative path in debug-level logs to distinguish files with identical names by @ZJY0516 in https://github.com/vllm-project/vllm/pull/23846
- [Benchmark] Update bench doc with mtbench, blazedit, spec bench by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24450
- [Benchmark] Add option to skip oversampling in benchmark by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/24457
- [ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm by @charlifu in https://github.com/vllm-project/vllm/pull/24275
- [Bugfix] Improve EPLB config validation error message by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24524
- [Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. by @bnellnm in https://github.com/vllm-project/vllm/pull/24538
- [Perf] Convert np array to torch tensor to index into block table for attn chunking by @sarckk in https://github.com/vllm-project/vllm/pull/24474
- Add @heheda12345 to CODEOWNERS of KVCacheManager related code by @heheda12345 in https://github.com/vllm-project/vllm/pull/24546
- [CI] Retry flaky fp8 cutlass mla tests by @njhill in https://github.com/vllm-project/vllm/pull/24536
- [Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) by @ignaciosica in https://github.com/vllm-project/vllm/pull/24129
- [BugFix] Fix async core engine client finalizer by @njhill in https://github.com/vllm-project/vllm/pull/24540
- [CI] Adjust threshold for flaky ngram spec decoding test by @njhill in https://github.com/vllm-project/vllm/pull/24528
- [KV Connector] More async support for
get_num_new_matched_tokens
by @ApostaC in https://github.com/vllm-project/vllm/pull/23620 - [P/D] MultiConnector supports shutdown by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24425
- [BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 by @wwl2755 in https://github.com/vllm-project/vllm/pull/24392
- [gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading by @frank-wei in https://github.com/vllm-project/vllm/pull/24154
- [Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. by @huachenheli in https://github.com/vllm-project/vllm/pull/24271
- [Bugfix] Update Run:AI Model Streamer Loading Integration by @pwschuurman in https://github.com/vllm-project/vllm/pull/23845
- [Docs] Enable relative links in examples to function when rendered in the docs by @hmellor in https://github.com/vllm-project/vllm/pull/24041
- [docs] promo pytorch conf and ray summit by @simon-mo in https://github.com/vllm-project/vllm/pull/24562
- [Bugfix] Guard
_may_reorder_batch
for encoder-only models on CPU (#24319) by @comsky in https://github.com/vllm-project/vllm/pull/24348 - Consolidate rendering parameters into RenderConfig dataclass by @sfeng33 in https://github.com/vllm-project/vllm/pull/24543
- [Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. by @li-jinpeng in https://github.com/vllm-project/vllm/pull/24519
- [Attention] add DCP support for FLASH_ATTN_MLA backend by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24453
- [ROCm][Bugfix] Fix Aiter RMSNorm by @vllmellm in https://github.com/vllm-project/vllm/pull/23412
- [Docs] Improve organisation of API Reference nav by @hmellor in https://github.com/vllm-project/vllm/pull/24569
- [Docs] Document the extra memory footprint overhead when using EPLB by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/24537
- Support for NemotronH Nano VLM by @danielafrimi in https://github.com/vllm-project/vllm/pull/23644
- Feature/vit attention unification# 23880 by @baonudesifeizhai in https://github.com/vllm-project/vllm/pull/23978
- [LoRA]: Add LoRA support to Mistral's Voxtral models by @pratapyash in https://github.com/vllm-project/vllm/pull/24517
- Move
LoadConfig
fromconfig/__init__.py
toconfig/load.py
by @hmellor in https://github.com/vllm-project/vllm/pull/24566 - [BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo by @wwl2755 in https://github.com/vllm-project/vllm/pull/24559
- [BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat by @lacora in https://github.com/vllm-project/vllm/pull/24549
- [BugFix] Ensure integrity of reused CPU tensors during async scheduling by @njhill in https://github.com/vllm-project/vllm/pull/24527
- [CI/Build] split true unit tests to Entrypoints Unit Tests by @yeqcharlotte in https://github.com/vllm-project/vllm/pull/24418
- [rocm] enable torchao quantization for rocm by @draftbk in https://github.com/vllm-project/vllm/pull/24400
- [CI] Add PPL test for generation models by @noooop in https://github.com/vllm-project/vllm/pull/24485
- [CI/Build] bump timm dependency by @dtrifiro in https://github.com/vllm-project/vllm/pull/24189
- fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24167
- Fix Auto_Round Quatization Loading on SM75 and Lower GPUs by @RoadToNowhereX in https://github.com/vllm-project/vllm/pull/24217
- [Docs] Fix warnings in
mkdocs build
(continued) by @Zerohertz in https://github.com/vllm-project/vllm/pull/24092 - [BugFix]
python collect_env.py
andvllm collect-env
compatibility with uv venv by @yankay in https://github.com/vllm-project/vllm/pull/24066 - [Platform] Custom ops support for LMhead and LogitsProcessor by @zzhx1 in https://github.com/vllm-project/vllm/pull/23564
- [CI] Fix tensorizer test assertion by @pwschuurman in https://github.com/vllm-project/vllm/pull/24545
- [Core] Split LoRA layers by @jeejeelee in https://github.com/vllm-project/vllm/pull/24574
- [Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser by @WangErXiao in https://github.com/vllm-project/vllm/pull/24589
- [Logging] allow config logging stream by @842974287 in https://github.com/vllm-project/vllm/pull/24336
- [Bugfix] fix modelopt exclude_modules name mapping by @tomeras91 in https://github.com/vllm-project/vllm/pull/24178
- [Bugfix] Fix DeepEP config for DP4TP4 by @minosfuture in https://github.com/vllm-project/vllm/pull/23619
- [Core] Support configuration parsing plugin by @charlotte12l in https://github.com/vllm-project/vllm/pull/24277
- [Misc] update log level debug to warning when process port is used by by @lengrongfu in https://github.com/vllm-project/vllm/pull/24226
- [Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs by @gau-nernst in https://github.com/vllm-project/vllm/pull/24577
- [CI] Fail subprocess tests with root-cause error by @njhill in https://github.com/vllm-project/vllm/pull/23795
- [v1] Add Whisper model support (encoder-decoder) by @russellb in https://github.com/vllm-project/vllm/pull/21088
- [torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends by @gshtras in https://github.com/vllm-project/vllm/pull/19767
- [gpt-oss] raise error for flashinfer backend without trtllm by @heheda12345 in https://github.com/vllm-project/vllm/pull/24482
- [Perf] Warmup FlashInfer attention during startup by @mgoin in https://github.com/vllm-project/vllm/pull/23439
- [Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration by @hjjq in https://github.com/vllm-project/vllm/pull/21078
- [Misc] Make timeout passable in init_distributed_environment by @jberkhahn in https://github.com/vllm-project/vllm/pull/24522
- [Models][Quantization] Add quantization configuration update in Voxtral model by @anmarques in https://github.com/vllm-project/vllm/pull/24122
- [distributed] update known issues by @youkaichao in https://github.com/vllm-project/vllm/pull/24624
- Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/24406
- [Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers by @wwl2755 in https://github.com/vllm-project/vllm/pull/24613
- [Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides by @petersalas in https://github.com/vllm-project/vllm/pull/24131
- [Bugfix] Add missing VIT backend dispatch on CPU by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/24623
- [BugFix] Fix pipeline parallel by @njhill in https://github.com/vllm-project/vllm/pull/24621
- [Engine][Chore] use local variable and remove output var assignment by @GuyStone in https://github.com/vllm-project/vllm/pull/24554
- Kimi K2 Fused MoE kernels Optimization configs by @samanamp in https://github.com/vllm-project/vllm/pull/24597
- Enable --profile in 'vllm bench throughput' by @tomasruizt in https://github.com/vllm-project/vllm/pull/24575
- [Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre by @shengshiqi-google in https://github.com/vllm-project/vllm/pull/24469
- [Doc]: fixing doc typos by @didier-durand in https://github.com/vllm-project/vllm/pull/24635
- [Model] New model support for Motif-1-Tiny by @ca1207 in https://github.com/vllm-project/vllm/pull/23414
- Remove redundant all gather + split by @chenxi-yang in https://github.com/vllm-project/vllm/pull/23441
- [torchao] Support quantization configs using module swap by @jerryzh168 in https://github.com/vllm-project/vllm/pull/21982
- Add the support for the qwen3 next model (a hybrid attention model). by @sighingnow in https://github.com/vllm-project/vllm/pull/24526
- [Bugfix] Fix incorrect import of CacheConfig by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/24631
- [Docs] Revise frameworks/anything-llm.md by @windsonsea in https://github.com/vllm-project/vllm/pull/24489
- [Docs] Update V1 doc to reflect whisper support by @russellb in https://github.com/vllm-project/vllm/pull/24606
- [Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ by @windsonsea in https://github.com/vllm-project/vllm/pull/24633
- [CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test by @charlotte12l in https://github.com/vllm-project/vllm/pull/24615
- [Bugfix] Fix _synced_weight_loader by @kyuyeunk in https://github.com/vllm-project/vllm/pull/24565
- [CI] Split pooling from entrypoints Test by @noooop in https://github.com/vllm-project/vllm/pull/24632
- [Misc] Add @NickLucche to codeowners by @NickLucche in https://github.com/vllm-project/vllm/pull/24647
- [CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py by @noooop in https://github.com/vllm-project/vllm/pull/24636
- Fix typing for
safetensors_load_strategy
by @hmellor in https://github.com/vllm-project/vllm/pull/24641 - Move
LoRAConfig
fromconfig/__init__.py
toconfig/lora.py
by @hmellor in https://github.com/vllm-project/vllm/pull/24644 - [XPU] add missing dependency tblib for XPU CI by @faaany in https://github.com/vllm-project/vllm/pull/24639
- [Docs] Fixes a typo in the qwen3next model name. by @sighingnow in https://github.com/vllm-project/vllm/pull/24654
- [build] add torch to tool.uv no-build-isolation-package by @youkaichao in https://github.com/vllm-project/vllm/pull/24303
- [Bench] Add qwen-next in benchmark_moe.py by @jeejeelee in https://github.com/vllm-project/vllm/pull/24661
- [CI] Split mteb test from Language Models Test by @noooop in https://github.com/vllm-project/vllm/pull/24634
- Allow users to specify kv cache memory size by @BoyuanFeng in https://github.com/vllm-project/vllm/pull/21489
- [HybridKVCache][Platform] Add support_hybrid_kv_cache for platform by @MengqingCao in https://github.com/vllm-project/vllm/pull/24646
- [Bugifx] Fix qwen-next packed_modules_mapping by @jeejeelee in https://github.com/vllm-project/vllm/pull/24656
- [Docs] Add transcription support to model by @NickLucche in https://github.com/vllm-project/vllm/pull/24664
- [Doc] Fix Markdown Pre-commit Error by @yewentao256 in https://github.com/vllm-project/vllm/pull/24670
- [Docs] Fix typos in EP deployment doc by @hmellor in https://github.com/vllm-project/vllm/pull/24669
- [VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames by @Isotr0py in https://github.com/vllm-project/vllm/pull/24161
- [Kernels] Enable Torch Symmetric Memory All-Reduce By Default by @ilmarkov in https://github.com/vllm-project/vllm/pull/24111
- [Bugfix] Fix platform-specific routing in CustomOp implementations by @kzawora-intel in https://github.com/vllm-project/vllm/pull/24444
- Fix model name included in responses by @hmellor in https://github.com/vllm-project/vllm/pull/24663
- fix some typos by @co63oc in https://github.com/vllm-project/vllm/pull/24616
- [Docs] Fix formatting of transcription doc by @hmellor in https://github.com/vllm-project/vllm/pull/24676
- [VLM] Migrate remain DP-supported ViT models to use
disable_tp
by @Isotr0py in https://github.com/vllm-project/vllm/pull/24363 - [Ultravox] Use wrapped_model_config to instantiate inner model by @petersalas in https://github.com/vllm-project/vllm/pull/24679
- [Doc] Remove Useless Comments by @yewentao256 in https://github.com/vllm-project/vllm/pull/24687
- [Qwen3-Next] Add MoE Config for H200 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24688
- [BugFix] Fix tokenize asyncio task leak by @njhill in https://github.com/vllm-project/vllm/pull/24677
- update spec decode metrics to use throughput by @qandrew in https://github.com/vllm-project/vllm/pull/24127
- [Kernel][B200]
mxfp4
fused cutlass moe by @djmmoss in https://github.com/vllm-project/vllm/pull/23696 - [flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention by @mxz297 in https://github.com/vllm-project/vllm/pull/24197
- [Bugfix] Set
VLLM_ALLREDUCE_USE_SYMM_MEM
default to False by @yewentao256 in https://github.com/vllm-project/vllm/pull/24696 - [Qwen3-Next] MoE configs for H200 TP=1,2,4 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/24695
- [CI/Build] Add bc-linter to vLLM CI by @zhewenl in https://github.com/vllm-project/vllm/pull/21234
- [Qwen3-Next] Add B200 MoE configs for Qwen3-next by @vadiklyutiy in https://github.com/vllm-project/vllm/pull/24698
- [Bugfix][Attention] Fix FlashInfer MLA block size logic by @MatthewBonanni in https://github.com/vllm-project/vllm/pull/24692
- [Perf] Use upstream CUTLASS for SM90 Block FP8 kernel by @mgoin in https://github.com/vllm-project/vllm/pull/23280
- [Qwen3-Next] MOE configs for H100 TP4 by @heheda12345 in https://github.com/vllm-project/vllm/pull/24699
- [Doc] Clarify cudagraph capture size logic and default behavior in scheduler by @Zazzle516 in https://github.com/vllm-project/vllm/pull/18698
- [Bug] Fix Layer
weight_block_size
Assertion Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/24674 - [Startup] Make DeepGEMM warmup scale with max-num-batched-tokens by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/24693
- [V1] feat:add engine v1 tracing by @RichardoMrMu in https://github.com/vllm-project/vllm/pull/20372
- [Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases by @sighingnow in https://github.com/vllm-project/vllm/pull/24680
New Contributors
- @DoubleVII made their first contribution in #23058
- @carlory made their first contribution in #23090
- @nikheal2 made their first contribution in #22725
- @Tialo made their first contribution in #23172
- @myselvess made their first contribution in #23084
- @yiz-liu made their first contribution in #23169
- @ultmaster made their first contribution in #22587
- @KilJaeeun made their first contribution in #22790
- @wzshiming made their first contribution in #23242
- @misrasaurabh1 made their first contribution in #20413
- @yannqi made their first contribution in #23246
- @jaredoconnell made their first contribution in #23306
- @paulpak58 made their first contribution in #22845
- @zhuangqh made their first contribution in #23309
- @tvalentyn made their first contribution in #23270
- @arjunbreddy22 made their first contribution in #22495
- @philipchung made their first contribution in #17149
- @FoolPlayer made their first contribution in #23241
- @namanlalitnyu made their first contribution in #23375
- @hickeyma made their first contribution in #23353
- @PapaGoose made their first contribution in #23337
- @bppps made their first contribution in #23366
- @AzizCode92 made their first contribution in #23416
- @fengli1702 made their first contribution in #22527
- @FFFfff1FFFfff made their first contribution in #23408
- @ayushsatyam146 made their first contribution in #23171
- @patemotter made their first contribution in #23574
- @Terrencezzj made their first contribution in #23584
- @Copilot made their first contribution in #23385
- @oneraghavan made their first contribution in #23630
- @lordmathis made their first contribution in #23634
- @OYE93 made their first contribution in #23565
- @TianyuLi0 made their first contribution in #23146
- @yuekaizhang made their first contribution in #23623
- @coval3nte made their first contribution in #23054
- @youzhedian made their first contribution in #23648
- @frank-wei made their first contribution in #23613
- @faaany made their first contribution in #22500
- @cndoit18 made their first contribution in #23718
- @rebel-hongseok made their first contribution in #23746
- @Hanchenli made their first contribution in #23713
- @Shrey1306 made their first contribution in #23556
- @Ithanil made their first contribution in #23155
- @killershrimp made their first contribution in #23792
- @crischeng made their first contribution in #23823
- @angelayi made their first contribution in #23349
- @jeanschmidt made their first contribution in #23757
- @He-Jingkai made their first contribution in https://github.com/vllm-project/vllm/pull/23829
- @aditchawdhary made their first contribution in https://github.com/vllm-project/vllm/pull/23899
- @EduardDurech made their first contribution in https://github.com/vllm-project/vllm/pull/23068
- @dubejf made their first contribution in https://github.com/vllm-project/vllm/pull/23944
- @sadeghja1070 made their first contribution in https://github.com/vllm-project/vllm/pull/23971
- @DevonPeroutky made their first contribution in https://github.com/vllm-project/vllm/pull/22769
- @hypdeb made their first contribution in https://github.com/vllm-project/vllm/pull/23802
- @DamonJiang777 made their first contribution in https://github.com/vllm-project/vllm/pull/24028
- @cberge908 made their first contribution in https://github.com/vllm-project/vllm/pull/23104
- @lkm2835 made their first contribution in https://github.com/vllm-project/vllm/pull/23918
- @nathanrchn made their first contribution in https://github.com/vllm-project/vllm/pull/24100
- @co63oc made their first contribution in https://github.com/vllm-project/vllm/pull/24071
- @divyanshsinghvi made their first contribution in https://github.com/vllm-project/vllm/pull/23907
- @biba10 made their first contribution in https://github.com/vllm-project/vllm/pull/23931
- @dongbo910220 made their first contribution in https://github.com/vllm-project/vllm/pull/20388
- @NagyGeorge made their first contribution in https://github.com/vllm-project/vllm/pull/23460
- @wdhongtw made their first contribution in https://github.com/vllm-project/vllm/pull/24214
- @bingchen-mi made their first contribution in https://github.com/vllm-project/vllm/pull/23652
- @anthonsu made their first contribution in https://github.com/vllm-project/vllm/pull/23766
- @whx-sjtu made their first contribution in https://github.com/vllm-project/vllm/pull/23332
- @pratapyash made their first contribution in https://github.com/vllm-project/vllm/pull/24231
- @samanamp made their first contribution in https://github.com/vllm-project/vllm/pull/24266
- @mohankku made their first contribution in https://github.com/vllm-project/vllm/pull/24335
- @ashwin-phadke made their first contribution in https://github.com/vllm-project/vllm/pull/24339
- @bangshengtang made their first contribution in https://github.com/vllm-project/vllm/pull/24265
- @charlotte12l made their first contribution in https://github.com/vllm-project/vllm/pull/23868
- @alhridoy made their first contribution in https://github.com/vllm-project/vllm/pull/24361
- @what-in-the-nim made their first contribution in https://github.com/vllm-project/vllm/pull/24332
- @BraveY made their first contribution in https://github.com/vllm-project/vllm/pull/23928
- @R3hankhan123 made their first contribution in https://github.com/vllm-project/vllm/pull/24034
- @csahithi made their first contribution in https://github.com/vllm-project/vllm/pull/24102
- @Prowindy made their first contribution in https://github.com/vllm-project/vllm/pull/24298
- @micah-wil made their first contribution in https://github.com/vllm-project/vllm/pull/24128
- @pwschuurman made their first contribution in https://github.com/vllm-project/vllm/pull/23845
- @comsky made their first contribution in https://github.com/vllm-project/vllm/pull/24348
- @li-jinpeng made their first contribution in https://github.com/vllm-project/vllm/pull/24519
- @baonudesifeizhai made their first contribution in https://github.com/vllm-project/vllm/pull/23978
- @lacora made their first contribution in https://github.com/vllm-project/vllm/pull/24549
- @RoadToNowhereX made their first contribution in https://github.com/vllm-project/vllm/pull/24217
- @zzhx1 made their first contribution in https://github.com/vllm-project/vllm/pull/23564
- @hjjq made their first contribution in https://github.com/vllm-project/vllm/pull/21078
- @tomasruizt made their first contribution in https://github.com/vllm-project/vllm/pull/24575
- @shengshiqi-google made their first contribution in https://github.com/vllm-project/vllm/pull/24469
- @ca1207 made their first contribution in https://github.com/vllm-project/vllm/pull/23414
- @qandrew made their first contribution in https://github.com/vllm-project/vllm/pull/24127
- @Zazzle516 made their first contribution in https://github.com/vllm-project/vllm/pull/18698
- @RichardoMrMu made their first contribution in https://github.com/vllm-project/vllm/pull/20372
Full Changelog: v0.10.1.1...v0.10.2rc3