vllm-project/vllm v0.10.1 on GitHub

Highlights

v0.10.1 release includes 727 commits, 245 committers (105 new contributors).

Model Support

New model families: GPT-OSS with comprehensive tool calling and streaming support (#22327, #22330, #22332, #22335, #22339, #22340, #22342), Command-A-Vision (#22660), mBART (#22883), and SmolLM3 using Transformers backend (#22665).
Vision-language models: Official Eagle multimodal support with Llama4 backend (#20788), Step3 vision-language models (#21998), Gemma3n multimodal (#20495), MiniCPM-V 4.0 (#22166), HyperCLOVAX-SEED-Vision-Instruct-3B (#20931), Emu3 with Transformers backend (#21319), Intern-S1 (#21628), and Prithvi in online serving mode (#21518).
Enhanced existing models: NemotronH support (#22349), Ernie 4.5 Base 0.3B model name change (#21735), GLM-4.5 series improvements (#22215), Granite models with fused MoE configurations (#21332) and quantized checkpoint loading (#22925), Ultravox support for Llama 4 and Gemma 3 backends (#17818), Mamba1 and Jamba model support in V1 (without CUDA graphs) (#21249)
Advanced model capabilities: Qwen3 EPLB (#20815) and dual-chunk attention support (#21924), Qwen native Eagle3 target support (#22333).
Architecture expansions: Encoder-only models without KV-cache enabling BERT-style architectures (#21270), expanded tensor parallelism support in Transformers backend (#22651), tensor parallelism for Deepseek_vl2 vision transformer (#21494), and tensor/pipeline parallelism with Mamba2 kernel for PLaMo2 (#19674).
V1 engine compatibility: Extended support for additional pooling models (#21747) and Step3VisionEncoder distributed processing option (#22697).

Engine Core

CUDA graph performance: Full CUDA graph support with separate attention routines, adding FA2 and FlashInfer compatibility (#20059), plus 6% end-to-end throughput improvement from Cutlass MLA (#22763).
Attention system advances: Multiple attention metadata builders per KV cache specification (#21588), tree attention backend for v1 engine (experimental) (#20401), FlexAttention encoder-only support (#22273), upgraded FlashAttention 3 with attention sink support (#22313), and multiple attention groups for KV sharing patterns (#22672).
Speculative decoding optimizations: N-gram speculative decoding with single KMP token proposal algorithm (#22437), explicit EAGLE3 interface for enhanced compatibility (#22642).
Default behavior improvements: Pooling models now default to chunked prefill and prefix caching (#20930), disabled chunked local attention by default for Llama4 for better performance (#21761).
Extensibility and configuration: Model loader plugin system (#21067), custom operations support for FusedMoe (#22509), rate limiting with bucket algorithm for proxy server (#22643), torch.compile support for bailing MoE (#21664).
Performance optimizations: Improved startup time by disabling C++ compilation of symbolic shapes (#20836), enhanced headless models for pooling in Transformers backend (#21767).

Hardware & Performance

NVIDIA Blackwell (SM100) optimizations: CutlassMLA as default backend (#21626), FlashInfer MoE per-tensor scale FP8 backend (#21458), SM90 CUTLASS FP8 GEMM with kernel tuning and swap AB support (#20396).
NVIDIA RTX 5090/RTX PRO 6000 (SM120) support: Block FP8 quantization (#22131) and CUTLASS NVFP4 4-bit weights/activations support (#21309).
AMD ROCm platform enhancements: Flash Attention backend for Qwen-VL models (#22069), AITER HIP block quantization kernels (#21242), reduced device-to-host transfers (#22683), and optimized kernel performance for small batch sizes 1-4 (#21350).
Attention and compute optimizations: FlashAttention 3 attention sinks performance boost (#22478), Triton-based multi-dimensional RoPE replacing PyTorch implementation (#22375), async tensor parallelism for scaled matrix multiplication (#20155), optimized FlashInfer metadata building (#21137).
Memory and throughput improvements: Mamba2 reduced device-to-device copy overhead (#21075), fused Triton kernels for RMSNorm (#20839, #22184), improved multimodal hasher performance for repeated image prompts (#22825), multithreaded async multimodal loading (#22710).
Parallelization and MoE optimizations: Guided decoding throughput improvements (#21862), balanced expert sharding for MoE models (#21497), expanded fused kernel support for topk softmax (#22211), fused MoE for nomic-embed-text-v2-moe (#18321).
Hardware compatibility and kernels: ARM CPU build fixes for systems without BF16 support (#21848), Machete memory-bound performance improvements (#21556), FlashInfer TRT-LLM prefill attention kernel support (#22095), optimized reshape_and_cache_flash CUDA kernel (#22036), CPU transfer support in NixlConnector (#18293).
Specialized CUDA kernels: GPT-OSS activation functions (#22538), RLHF weight loading acceleration (#21164).

Quantization

Advanced quantization techniques: MXFP4 and bias support for Marlin kernel (#22428), NVFP4 GEMM FlashInfer backends (#22346), compressed-tensors mixed-precision model loading (#22468), FlashInfer MoE support for NVFP4 (#21639).
Hardware-optimized quantization: Dynamic 4-bit quantization with Kleidiai kernels for CPU inference (#17112), TensorRT-LLM FP4 quantization optimized for MoE low-latency inference (#21331).
Expanded model quantization support: BitsAndBytes quantization for InternS1 (#21953) and additional MoE models (#21370, #21548), Gemma3n quantization compatibility (#21974), calibration-free RTN quantization for MoE models (#20766), ModelOpt Qwen3 NVFP4 support (#20101).
Performance and compatibility improvements: CUDA kernel optimization for Int8 per-token group quantization (#21476), non-contiguous tensor support in FP8 quantization (#21961), automatic detection of ModelOpt quantization formats (#22073).
Breaking change: Removed AQLM quantization support (#22943) - users should migrate to alternative quantization methods.

API & Frontend

OpenAI API compatibility: Unix domain socket support for local communication (#18097), improved error response format matching upstream specification (#22099), aligned tool_choice="required" behavior with OpenAI when tools list is empty (#21052).
New API capabilities: Dedicated LLM.reward interface for reward models (#21720), chunked processing for long inputs in embedding models (#22280), AsyncLLM proper response handling for aborted requests (#22283).
Configuration and environment: Multiple API keys support for enhanced authentication (#18548), custom vLLM tuned configuration paths (#22791), environment variable control for logging statistics (#22905), multimodal cache size (#22441), and DeepGEMM E8M0 scaling behavior (#21968).
CLI and tooling improvements: V1 API support for run-batch command (#21541), custom process naming for better monitoring (#21445), improved help display showing available choices (#21760), optional memory profiling skip for multimodal models (#22950), enhanced logging of non-default arguments (#21680).
Tool and parser support: HermesToolParser for models without special tokens (#16890), multi-turn conversation benchmarking tool (#20267).
Distributed serving enhancements: Enhanced hybrid distributed serving with multiple API servers in load balancing mode (#21510), request_id support for external load balancers (#21009).
User experience enhancements: Improved error messaging for multimodal items (#22114), per-request pooling control via PoolingParams (#20538).

Dependencies

FlashInfer updates: Updated to v0.2.8 for improved performance (#21385), moved to optional dependency install with pip install vllm[flashinfer] for flexible installation (#21959).
Mamba SSM restructuring: Updated to version 2.2.5 (#21421), removed from core requirements to reduce installation complexity (#22541).
Docker and deployment: Docker-aware precompiled wheel support for easier containerized deployment (#21127, #22106).
Python package updates: OpenAI Python dependency updated to latest version for API compatibility (#22316).
Dependency optimizations: Removed xformers requirement for Mistral-format Pixtral and Mistral3 models (#21154), deprecation warnings added for old DeepGEMM version (#22194).

V0 Deprecation

Important: As part of the ongoing V0 engine cleanup, several breaking changes have been introduced:

CLI flag updates: Replaced --task with --runner and --convert options (#21470), deprecated --disable-log-requests in favor of --enable-log-requests for clearer semantics (#21739), renamed --expand-tools-even-if-tool-choice-none to --exclude-tools-when-tool-choice-none for consistency (#20544).
API cleanup: Removed previously deprecated arguments and methods as part of ongoing V0 engine codebase cleanup (#21907).

What's Changed

Deduplicate Transformers backend code using inheritance by @hmellor in #21461
[Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
[TPU][Bugfix] fix moe layer by @yaochengji in #21340
[v1][Core] Clean up usages of SpecializedManager by @zhouwfang in #21407
[Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
[Core] Support model loader plugins by @22quinn in #21067
remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 by @okdshin in #20544
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in #21501
[Feat] Allow custom naming of vLLM processes by @chaunceyjiang in #21445
bump flashinfer to v0.2.8 by @cjackal in #21385
[Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in #21137
[Model] Officially support Emu3 with Transformers backend by @hmellor in #21319
[Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in #21426
[Fix] Update mamba_ssm to 2.2.5 by @elvischenv in #21421
[Docs] Update Tensorizer usage documentation by @sangstar in #21190
[Docs] Rewrite Distributed Inference and Serving guide by @crypdick in #20593
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access by @yewentao256 in #21465
Update flashinfer CUTLASS MoE Kernel by @wenscarl in #21408
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in #21036
[P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in #21328
[P/D] Support CPU Transfer in NixlConnector by @juncgu in #18293
[Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in #21543
[Docs] Add Expert Parallelism Initial Documentation by @simon-mo in #21373
update flashinfer to v0.2.9rc1 by @weireweire in #21485
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in #21539
[MoE] More balanced expert sharding by @WoosukKwon in #21497
[Frontend] run-batch supports V1 by @DarkLight1337 in #21541
[Docs] Fix site_url for RunLLM by @hmellor in #21564
[Bug] Fix DeepGemm Init Error by @yewentao256 in #21554
Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in #21531
[Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in #21332
[Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in #21517
[DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in #21510
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in #21560
[Docs] Add requirements/common.txt to run unit tests by @zhouwfang in #21572
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in #21232
[CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in #21582
[Bugfix] fix modelscope snapshot_download serialization by @andyxning in #21536
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in #21494
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in #21479
[Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in #21321
[CI/Build] fix cpu_extension for apple silicon by @ignaciosica in #21195
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in #21262
[TPU][Bugfix] fix OOM issue in CI test by @yaochengji in #21550
[Tests] Harden DP tests by @njhill in #21508
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #21598
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in #21579
[Quantization] Enable BNB support for more MoE models by @jeejeelee in #21370
[V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in #21585
[Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in #21591
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in #21586
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in #20931
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in #21009
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in #20839
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in #20295
[Kernel] Improve machete memory bound perf by @czhu-cohere in #21556
Add support for Prithvi in Online serving mode by @mgazz in #21518
[CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in #21343
[Docs] add auto-round quantization readme by @wenhuach21 in #21600
[TPU][Test] Rollback PR-21550. by @QiliangCui in #21619
Add Unsloth to RLHF.md by @danielhanchen in #21636
[Perf] Cuda Kernel for Int8 Per Token Group Quant by @yewentao256 in #21476
Add interleaved RoPE test for Llama4 (Maverick) by @sarckk in #21478
[Bugfix] Fix sync_and_slice_intermediate_tensors by @ruisearch42 in #21537
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn by @ruisearch42 in #21540
[TPU] Update ptxla nightly version to 20250724 by @yaochengji in #21555
[Feature] Add support for MoE models in the calibration-free RTN-based quantization by @sakogan in #20766
[Model] Ultravox: Support Llama 4 and Gemma 3 backends by @farzadab in #17818
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL by @david6666666 in #21530
Correctly kill vLLM processes after finishing serving benchmarks by @huydhn in #21641
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison by @Mitix-EPI in #21612
[TPU][Test] Divide TPU v1 Test into 2 parts. by @QiliangCui in #21431
Support Intern-S1 by @lvhan028 in #21628
[Misc] remove unused try-except in pooling config check by @reidliu41 in #21618
[Take 2] Correctly kill vLLM processes after benchmarks by @huydhn in #21646
Migrate AriaImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21620
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21622
[Bugfix] Investigate Qwen2-VL failing test by @Isotr0py in #21527
Support encoder-only models without KV-Cache by @maxdebayser in #21270
[Bug] Fix has_flashinfer_moe Import Error when it is not installed by @yewentao256 in #21634
[Misc] Improve memory profiling debug message by @yeqcharlotte in #21429
[BugFix] Fix shared storage connector load kv only load attention layer by @david6666666 in #21428
[Refactor] Remove moe_align_block_size_triton by @yewentao256 in #21335
[Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon by @zhouyeju in #21380
[CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI by @yeqcharlotte in #21355
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels by @kaixih in #21411
Remove xformers requirement for Mistral-format Pixtral and Mistral3 by @wenchen76 in #21154
support torch.compile for bailing moe by @jinzhen-lin in #21664
Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema by @bbeckca in #21656
Migrate DeepseekVL2ImageInputs to TensorSchema by @bbeckca in #21658
Migrate FuyuImagePatchInputs to TensorSchema by @bbeckca in #21662
Migrate ChameleonImagePixelInputs to TensorSchema by @bbeckca in #21657
[VLM] Support HF format Phi-4-MM model by @Isotr0py in #17121
Handle non-serializable objects in vllm bench by @huydhn in #21665
[CI/Build][Doc] Clean up more docs that point to old bench scripts by @yeqcharlotte in #21667
Refactor: Remove numpy dependency from LoggingStatLogger by @skyloevil in #20529
[Misc] add default value for file pattern arg by @andyxning in #21659
Migrate Florence2ImagePixelInputs to TensorSchema by @bbeckca in #21663
[VLM] Add video support for Intern-S1 by @Isotr0py in #21671
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor by @yewentao256 in #21631
Fix CUDA permute/unpermute for use with DeepGemm Moe by @CalebDu in #17934
[Misc] Refactor vllm config str by @andyxning in #21666
[Attention] Make CutlassMLA the default backend for SM100 (blackwell) by @alexm-redhat in #21626
[Deprecation][2/N] Replace --task with --runner and --convert by @DarkLight1337 in #21470
Fix typo for limit-mm-per-prompt in docs by @joa-stdn in #21697
Fix GLM tool parser by @zRzRzRzRzRzRzR in #21668
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 by @jeejeelee in #21700
[V1] Exception Handling when Loading KV Cache from Remote Store by @liuyumoye in #21534
[Model] Support TP/PP/mamba2 kernel for PLaMo2 by @Alnusjaponica in #19674
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel by @tjtanaa in #21242
Migrate Gemma3ImagePixelInputs to TensorSchema by @bbeckca in #21676
Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema by @bbeckca in #21678
Migrate GLMVImagePixelInputs to TensorSchema by @bbeckca in #21679
Migrate GraniteSpeechAudioInputs to TensorSchema by @bbeckca in #21682
Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … by @bbeckca in #21683
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled by @hsliuustc0106 in #21573
[bugfix] fix profile impact benchmark results by @lengrongfu in #21507
[Bugfix] Fix shape checking for Fuyu by @DarkLight1337 in #21709
[Bugfix] fix max-file-size type from str to int by @andyxning in #21675
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled by @LucasWilkinson in #21707
[v1][mamba] Added mamba_type into MambaSpec by @Josephasafg in #21715
Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema by @bbeckca in #21686
[Model] Prioritize Transformers fallback over suffix matching by @DarkLight1337 in #21719
[feature] add log non default args in LLM by @lengrongfu in #21680
[Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts by @jeejeelee in #21717
[Bugfix] Fix environment variable setting in CPU Dockerfile by @bigPYJ1151 in #21730
[Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme by @Isotr0py in #21744
[PD] let p2p nccl toy proxy handle /chat/completions by @chaunceyjiang in #21734
[Ernie 4.5] Name Change for Base 0.3B Model by @vasqu in #21735
[Bugfix] Improve JSON extraction in LlamaToolParser by @key4ng in #19024
[Docs] Add revision date to rendered docs by @hmellor in #21752
[Bugfix]check health for engine core process exiting unexpectedly by @wuhang2014 in #21728
[Bugfix][CI/Build] Update peft version in test requirement by @Isotr0py in #21729
[Logs] Change flashinfer sampler logs to once by @mgoin in #21759
[Misc] Reduce logs for model resolution by @DarkLight1337 in #21765
[Bugfix] Mistral crashes on tool with no description by @HugoMichard in #21167
[CI/Build] Fix plugin tests by @DarkLight1337 in #21758
[XPU] IPEX-optimized Punica Wrapper on XPU by @chaojun-zhang in #21703
[Bugfix] Fix granite speech shape validation by @DarkLight1337 in #21762
[P/D] Log warnings related to prefill KV expiry by @njhill in #21753
Use metavar to list the choices for a CLI arg when custom values are also accepted by @hmellor in #21760
update flashinfer to v0.2.9rc2 by @weireweire in #21701
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile by @rasmith in #21350
[Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant by @yewentao256 in #21773
[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure by @houseroad in #21647
Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" by @KuntaiDu in #21778
[Bugfix] DeepGEMM is not enabled on B200 due to _lazy_init() by @smarterclayton in #21472
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels by @nikhil-arm in #17112
[Perf] Disable chunked local attention by default with llama4 by @LucasWilkinson in #21761
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning by @LyrisZhong in #20396
[Docs] Minimize spacing for supported_hardware.md table by @mgoin in #21779
[Refactor] Merge Compressed Tensor FP8 CompressedTensorsW8A8Fp8MoEMethod and CompressedTensorsW8A8Fp8MoECutlassMethod by @yewentao256 in #21775
[CI] Parallelize Kernels MoE Test by @mgoin in #21764
skip fusedmoe layer for start_load_kv by @calvin0327 in #21378
[AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM by @gshtras in #21766
Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema by @bbeckca in #21684
[Misc] Rework process titles by @njhill in #21780
[Doc] Link to RFC for pooling optimizations by @DarkLight1337 in #21806
[Model]: Fused MoE for nomic-embed-text-v2-moe by @Isotr0py in #18321
[V0 deprecation] Guided decoding by @rzabarazesh in #21347
[Model] Refactor JambaForCausalLM by @jeejeelee in #21394
[Docs] Fix the outdated URL for installing from vLLM binaries by @yankay in #21523
[KVCache] Make KVCacheSpec hashable by @heheda12345 in #21791
[Doc] Update compatibility matrix for pooling and multimodal models by @DarkLight1337 in #21831
[Bugfix] VLLM_V1 supports passing other compilation levels by @zou3519 in #19340
[Docs] Merge design docs for a V1 only future by @hmellor in #21832
[TPU] Add an optimization doc on TPU by @bvrockwell in #21155
[Bugfix]fix mixed bits and visual language model quantization in AutoRound by @wenhuach21 in #21802
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend by @elvischenv in #21525
[Docs] use uv in GPU installation docs by @davidxia in #20277
[Doc] Add FusedMoE Modular Kernel Documentation by @varun-sundar-rabindranath in #21623
[Doc] update Contributing page's testing section by @davidxia in #18272
Add flashinfer_python to CUDA wheel requirements by @mgoin in #21389
docker: docker-aware precompiled wheel support by @dougbtv in #21127
Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)" by @gshtras in #21850
[BugFix] Fix interleaved sliding window not set for Gemma3n by @sarckk in #21863
[ci] add b200 test placeholder by @simon-mo in #21866
[ci] mark blackwell test optional for now by @simon-mo in #21878
[Bugfix] Correct max tokens for non-contiguous embeds by @milesial in #21798
[v1][attention] Support Hybrid Allocator + FlashInfer by @heheda12345 in #21412
[Docs] Switch to better markdown linting pre-commit hook by @hmellor in #21851
[DOC] Fix path of v1 related figures by @heheda12345 in #21868
[Docs] Update docker.md with HF_TOKEN, new model, and podman fix by @mgoin in #21856
Expose PyTorch profiler configuration to environment variables by @Csrayz in #21803
[Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization by @sydarb in #21808
[Bugfix] Fix comment typo of get_num_common_prefix_blocks() by @MingzhenHan in #21827
[Bugfix] Actually disable processing cache when API server is scaled out by @DarkLight1337 in #21839
[Perf] Using __nv_fp8_e4m3 instead of c10::e4m3 for per_token_group_quant by @yewentao256 in #21867
[Frontend] Add LLM.reward specific to reward models by @noooop in #21720
[XPU] use ZE_AFFINITY_MASK for device select on xpu by @jikunshang in #21815
Add @sighingnow as maintainer of qwen's related files. by @sighingnow in #21895
[CI/Build] Fix pre-commit failure in docs by @DarkLight1337 in #21897
[Docs] Expand introduction to Ray in Multi-node deployment section by @crypdick in #21584
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release by @louie-tsai in #21486
[Misc] Remove redundant config definitions by @DarkLight1337 in #21891
[Doc] Update Intern-S1 info by @jeejeelee in #21908
[CI] rollback lint-and-deploy pipeline using amd machine by @kebe7jun in #21912
[Tests] Fixing bug inside MultiModalProfiler. by @shenoyvvarun in #21842
[Model] Remove DSV2 unused code by @jeejeelee in #21903
[benchmark] add max-concurrency in result table by @panpan0000 in #21095
[Doc] Update partial support by @DarkLight1337 in #21916
[Docs] Fix the example code of streaming chat completions in reasoning by @hsliuustc0106 in #21825
Add @patrickvonplaten as maintainer of mistral's related files. by @patrickvonplaten in #21928
[Hardware][CPU] Build fix for ARM without BF16 by @ericcurtin in #21848
[Feature][EPLB] Add eplb support for Qwen3 by @aladerran in #20815
[Doc] Remove vLLM prefix and add citation for PagedAttention by @DarkLight1337 in #21910
[Bugfix] we should use metavar is not choices by @lengrongfu in #21902
[Feature] Support multiple api keys in server by @Yanpas in #18548
[misc] skip p2p check by default by @youkaichao in #21904
[Test] Add Benchmark and Unit Test for per_token_group_quant by @yewentao256 in #21860
[CI/Build] Only run markdownlint in CI by @DarkLight1337 in #21892
Reduce time wasted in GitHub Actions using concurrency by @hmellor in #21919
[Misc] Improve code readability of KVCacheManager by @tanruixiang in #21673
[NVIDIA] Fix Llama4 Scout FP4 functionality issues by @nvpohanh in #21499
[Docs] Reduce the size of the built docs by @hmellor in #21920
[Bugfix] Fix OOM tests in initialization test by @Isotr0py in #21921
[Bugfix] Fix multi-api server not working for text models by @DarkLight1337 in #21933
Override attention metadata for fast prefill in some KV sharing setups by @sarckk in #21590
[Bugfix] Fix TypeError in scheduler when comparing mixed request_id types by @chi2liu in #21816
[CI/Build] Fix registry tests by @DarkLight1337 in #21934
[Bugfix] SharedStorage Connector for V1 PD multimodal by @fake0fan in #21611
feat(distributed): add get_required_kvcache_layout class method to kv connector api by @wxsms in #20433
[TPU] Support Pathways in vLLM by @wenxindongwork in #21417
[Misc] Support more collective_rpc return types by @njhill in #21845
For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted by @dougbtv in #21964
[Misc] Use dracut on CentOS and skip clone if repo exists for EP kernel installation by @minosfuture in #21635
[Feature] Add async tensor parallelism for scaled mm by @cascade812 in #20155
[Bugfix] Fix None value handling in trace span creation for cancelled requests by @br4mm in #20272
[Core] Move EngineCoreRequest to Request conversion out of EngineCore by @linzebing in #21627
[Example] Add async_llm_streaming.py example for AsyncLLM streaming in python by @mgoin in #21763
[Bugfix] Relax lang pin for voxtral by @sanchit-gandhi in #21833
[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA by @mgoin in #21966
[Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels by @jeejeelee in #21818
[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes by @mgoin in #21973
[Bugfix]: fix metadata file copy in test_sharded_state_loader by @andyxning in #21830
[Deprecation] Remove deprecated args and methods by @DarkLight1337 in #21907
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES by @dtrifiro in #21599
[Model][CI] Let more pooling models support v1 by @noooop in #21747
[BugFix] Fix case where collective_rpc returns None by @njhill in #22006
[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend by @amirkl94 in #21458
[Model] Add step3 vl by @Oliver-ss in #21998
[ez] Remove a trailing space from compilation/decorators.py by @zhxchen17 in #22028
fix(setup): improve precompiled wheel setup for Docker builds by @dougbtv in #22025
Removing amdproduction Tests by @Alexei-V-Ivanov-AMD in #22027
Update torch_xla pin to 20250730 by @vanbasten23 in #21956
[Meta] Official Eagle mm support, first enablement on llama4 by @morgendave in #20788
[Misc] Add unit tests for chunked local attention by @sarckk in #21692
[Bugfix] Fix MTP weight loading by @benchislett in #21941
Add FlashInfer allreduce RMSNorm Quant fusion by @ilmarkov in #21069
[Feature] Add Flashinfer MoE Support for Compressed Tensor NVFP4 by @yewentao256 in #21639
Add DeepGEMM to Dockerfile in vllm-base image by @MatthewBonanni in #21533
Move flashinfer-python to optional extra vllm[flashinfer] by @mgoin in #21959
[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM by @yewentao256 in #21787
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache by @charent in #20873
[Misc] Automatically resolve HF processor init kwargs by @DarkLight1337 in #22005
[BugFix] fix: aot passes kvcache dtype information by @mickaelseznec in #19750
[Model] [Quantization] Support quantization for Gemma3n by @kylesayrs in #21974
[Doc] Add Voxtral to Supported Models page by @DarkLight1337 in #22059
Update sampling_metadata.py by @Aviadr-neureality in #21937
[Doc] Fix a syntax error of example code in structured_outputs.md by @hsliuustc0106 in #22045
[Bugfix] Disable multi-modal preprocessor cache for DP by @DarkLight1337 in #21896
[Core] Avoid repeated len(block_token_ids) check in hash_request_tokens by @linzebing in #21781
[Frontend] Align tool_choice="required" behavior with OpenAI when tools is empty by @n0gu-furiosa in #21052
Revert precompile wheel changes by @simon-mo in #22055
[Doc] Add example for Step3-VL by @ywang96 in #22061
[Bugfix] Add log prefix in non-dp mode engine core by @wuhang2014 in #21889
[Misc] Remove upper bound in openai package version by @WoosukKwon in #22060
[Doc] Added warning of speculating with draft model by @david6666666 in #22047
[Quantization] Enable BNB support for InternS1 by @jeejeelee in #21953
Revert "Update sampling_metadata.py (#21937)" by @hmellor in #22088
[Speculative Decoding] Add speculators config support by @dsikka in #21345
[BUG] [ROCm] Fix import bug on ROCm by @tjtanaa in #22083
Fix get_kwargs for case where type hint is list[Union[str, type]] by @hmellor in #22016
[Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels by @mgoin in #21893
feat(multimodal): Add customizable background color for RGBA to RGB conversion by @ahengljh in #22052
[Bugfix][PD] set max_completion_tokens=1 if req has this value by @Abirdcfly in #21841
[Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21462
[BugFix] Update AttnFusionPass cache key by @zou3519 in #21947
[BugFix] Don't change title of top-level process by @njhill in #22032
[Docs] use uv in CPU installation docs by @davidxia in #22089
Deprecate --disable-log-requests and replace with --enable-log-requests by @hmellor in #21739
Improve documentation of ModelConfig.try_get_generation_config to prevent future confusion by @hmellor in #21526
[Bugfix] Fix glm4.1v video inference issue by @Isotr0py in #22067
[Bugfix] fix when skip tokenizer init by @lengrongfu in #21922
security policy: take 1 by @sidhpurwala-huzaifa in #21119
[Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch by @varun-sundar-rabindranath in #21837
Enable headless models for pooling in the Transformers backend by @hmellor in #21767
[Misc] Minor enhancement of benchmark_moe by @jeejeelee in #22068
Fix pre-commit failure for SECURTIY.md by @mgoin in #22102
[compile][startup] Disable C++ compilation of symbolic shapes by @anijain2305 in #20836
Introduce RayPPCommunicator for ray-based PP by @ruisearch42 in #21660
Add lora test for tp>1 case for TPU. by @vanbasten23 in #21970
[BugFix] Harden distributed DP startup by @njhill in #21538
[CI] Initial tests for SM100 Blackwell runner by @mgoin in #21877
[Perf] Optimize reshape_and_cache_flash CUDA Kernel by @yewentao256 in #22036
feat: Add Support GPTQ Quantization MOE on ROCM vllm serve by @JartX in #21733
[V1][CUDA] Full cudagraph support for FlashInfer by @fhl2000 in #21367
[Model] Qwen2.5 VL SiLU-and-Mul by @vllmellm in #22066
[Misc] VLLM_TARGET_DEVICE.lower() by @NickLucche in #22101
[Misc] DeepGemmExperts : Avoid JIT generation in the hot-path by @varun-sundar-rabindranath in #21955
[Speculators][Speculative Decoding] Add Qwen Eagle3 Support by @dsikka in #21835
[BugFix] Improve internal DP load balancing by @njhill in #21617
[Test] Add Unit Test for Batched DeepGEMM by @yewentao256 in #21559
[Attention][DBO] Add support for "splitting" the CommonAttentionMetadata by @SageMoore in #21153
[FEAT][ROCm] Enable running Flash Attention as ViT attn backend for Qwen-VL models on ROCm platform. by @vllmellm in #22069
[Misc] Getting and passing ray runtime_env to workers by @ruisearch42 in #22040
Fix test_kv_sharing_fast_prefill flakiness by @sarckk in #22038
[Bugfix] Mamba2 remove bugged initial state condition in chunk scan by @cyang49 in #22034
docs: remove deprecated disable-log-requests flag by @ywang96 in #22113
[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion by @vadiklyutiy in #20000
for glm-4.1V update by @zRzRzRzRzRzRzR in #22000
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead by @cyang49 in #21075
[Frontend] Improve error message for too many mm items by @DarkLight1337 in #22114
[V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time by @tdoublep in #21557
[xpu]support moe models on XPU platform by @yma11 in #21643
Revert "[compile][startup] Disable C++ compilation of symbolic shapes" by @xiszishu in #22122
[Misc] Bump ray to 2.48.0 by @ruisearch42 in #22123
[Fix] Fix llama4 modelopt weight loading error by @jiahanc in #22107
[Misc] Add tensor schema test coverage for multimodal models by @Isotr0py in #21754
[Benchmark] Support ready check timeout in vllm bench serve by @yeqcharlotte in #21696
Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120) by @LopezCastroRoberto in #21309
[Misc] update doc comment for send by @andyxning in #22026
[executor] feat: add supports_pp attr to executors by @eric-haibin-lin in #21786
[V1] [P/D] Refactor KV Connector Path by @sdavidbd in #21980
[Responses API] Disable response store by default by @WoosukKwon in #22137
[CI/Build][Bugfix] Fix Qwen2.5 tests in CPU CI via fallback silu_and_mul to torch native implementation by @bigPYJ1151 in #22145
Add chat doc in quick start by @TankNee in #21213
fuse fp32 for GLM-4.5 e_score_correction_bias by @zRzRzRzRzRzRzR in #22143
[Bugfix] Fix failing multimodal standard test by @Isotr0py in #22153
Use aiohttp connection pool for benchmarking by @eicherseiji in #21981
[fix] fix correct assertion syntax error in attention utils. by @skyloevil in #22154
[RLHF] Fix torch.dtype not serializable in example by @22quinn in #22158
[PD] add test for chat completions endpoint by @Abirdcfly in #21925
remove duplicate code within cleanup_dist_env_and_memory by @andyxning in #22147
Add tree attention backend for v1 (part 1) by @TheEpicDolphin in #20401
[refactor] improve ConstantList exception specificity by @skyloevil in #22156
Remove index_put from MM embeddings merging by @chenxi-yang in #22105
[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes by @tlrmchlsmth in #22163
[Misc] Modify the organization of GLM series by @jeejeelee in #22171
[feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading by @weixiao-huang in #21164
[Bugfix] Fix failing GGUF models test by @Isotr0py in #22174
[Sampler] Support returning all logprobs or logits by @22quinn in #21792
[Doc] Update pooling model docs by @DarkLight1337 in #22186
Fix Arcee model weight loading: Add custom load_weights by @alyosha-swamy in #21725
[Responses API] Ignore store=True and process the request by default by @WoosukKwon in #22185
[Bug] Update auto_tune.sh to separate benchmarking and profiling. by @ericehanley in #21629
[Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector by @Abatom in #21819
[NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading by @nvpohanh in #22073
[Bugfix] V1 Fix the cursor leakage issue during request scheduling. by @CLFutureX in #21173
Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." by @WoosukKwon in #22223
[V1] reduce block size for tree attention correctness test to fix 'ou… by @TheEpicDolphin in #22207
[V0 deprecation][P/D] Deprecate v0 KVConnectorBase code (1/2) by @lk-chen in #21785
[FEAT] Refactor ROPE into module by @tjtanaa in #22192
[ROCm][Bugfix] Compilation passes fix by @gshtras in #22202
self.gate dtype update for GLM-4.5 by @zRzRzRzRzRzRzR in #22203
[Log] DeepGEMM Update Log for Unaligned Problem Size by @yewentao256 in #22208
fix: kimi_k2 return empty tool call list by @tlipoca9 in #22149
[Misc] Remove pass_config from CompilationConfig dump_json excluded by @elvischenv in #21911
[Doc] add backend to doc string of initialize_model_parallel by @andyxning in #22142
[Misc] log more detailed message for ensure_model_parallel_initialized by @andyxning in #22144
Optimize configuration access with LRU cache in custom ops by @skyloevil in #22204
[Bugfix] Misaligned params in TreeAttentionImpl by @DarkLight1337 in #22226
[UX] Fail if an invalid attention backend is specified by @mgoin in #22217
[Core] Factor out common logic for MM budget calculation by @DarkLight1337 in #22228
[Model] Pooling model activation supports per request control by PoolingParams by @noooop in #20538
[Docs][TPU] Highlight TPU Software version selection by @NickLucche in https://github.com/vllm-project/vllm/pull/22242
Migrate KimiVLImagePixelInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21769
[Feature] Non-contiguous Support for FP8 Quantization by @yewentao256 in https://github.com/vllm-project/vllm/pull/21961
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel by @elvischenv in https://github.com/vllm-project/vllm/pull/22095
[Misc] correct static type check for GroupCoordinator by @andyxning in https://github.com/vllm-project/vllm/pull/21946
[V0 Deprecation][TPU] Remove V1 flag check from tests by @NickLucche in https://github.com/vllm-project/vllm/pull/22248
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail by @mgoin in https://github.com/vllm-project/vllm/pull/22128
[CI/Build] Update flashinfer to 0.2.9 by @mgoin in https://github.com/vllm-project/vllm/pull/22233
[Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING by @yewentao256 in https://github.com/vllm-project/vllm/pull/22199
[V1] port xformers backend to v1 by @TheEpicDolphin in https://github.com/vllm-project/vllm/pull/21342
[bugfix] fix blackwell deepep installation by @youkaichao in https://github.com/vllm-project/vllm/pull/22255
[CI][TPU] Fix docker clean up by @lsy323 in https://github.com/vllm-project/vllm/pull/22271
[Bugfix] Remove faulty test for oot attention backend by @mgoin in https://github.com/vllm-project/vllm/pull/22286
[Bugfix] Fix 3D input passed into cutlass_scaled_mm by @mgoin in https://github.com/vllm-project/vllm/pull/22278
[Bugfix] Fix MoE BNB version by @jeejeelee in https://github.com/vllm-project/vllm/pull/22260
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding by @benchislett in https://github.com/vllm-project/vllm/pull/21862
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation by @ruisearch42 in https://github.com/vllm-project/vllm/pull/22275
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm by @gshtras in https://github.com/vllm-project/vllm/pull/22264
Upgrade FA3 for attention sink by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22313
Increase openai-python version by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22316
Add attention sink in attention backends by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22320
Update transformers to v4.55 by @hmellor in https://github.com/vllm-project/vllm/pull/21931
Add GPT-OSS model code and config [1/N] by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22327
[ROCm] Add attention sink to use_rocm_custom_paged_attention by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22329
[GptOss] Add GptOss reasoning parser to support structure output by @heheda12345 in https://github.com/vllm-project/vllm/pull/22322
[gpt-oss] flashinfer attention sink init by @zyongye in https://github.com/vllm-project/vllm/pull/22330
[gpt-oss] Add openai-harmony as default dependency by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22332
[Misc] Clean up duplicated hf overrides by @Isotr0py in https://github.com/vllm-project/vllm/pull/22311
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22340
[gpt-oss] add model to supported models doc by @ywang96 in https://github.com/vllm-project/vllm/pull/22336
[gpt-oss] Support chat completion api by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22342
[Minor] Fix type by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22347
[BugFix] Fix FA2 RuntimeError when sinks is provided by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22365
add the codes to check AMD Instinct GPU number by @zhangnju in https://github.com/vllm-project/vllm/pull/22367
[BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22368
[Bugfix] Make condition in triton kernel constexpr by @gshtras in https://github.com/vllm-project/vllm/pull/22370
[gpt-oss] Add loop for built-in tool call by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22374
[gpt-oss] attention sink init fix gemini by @zyongye in https://github.com/vllm-project/vllm/pull/22335
[gpt-oss] flashinfer mxfp4 by @zyongye in https://github.com/vllm-project/vllm/pull/22339
[v1] - Mamba1 Attention Metadata by @Josephasafg in https://github.com/vllm-project/vllm/pull/21249
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue by @yewentao256 in https://github.com/vllm-project/vllm/pull/22399
[gpt-oss] add demo tool server by @heheda12345 in https://github.com/vllm-project/vllm/pull/22393
[gpt-oss] fix model config with hf_config by @zyongye in https://github.com/vllm-project/vllm/pull/22401
Fix trtllm-gen attention env and add attention sink by @IwakuraRein in https://github.com/vllm-project/vllm/pull/22378
Update flashinfer-python==0.2.10 by @mgoin in https://github.com/vllm-project/vllm/pull/22389
[model] Support MiniCPM-V 4.0 by @tc-mb in https://github.com/vllm-project/vllm/pull/22166
Support encoder_only attention for FlexAttention by @maxdebayser in https://github.com/vllm-project/vllm/pull/22273
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21588
[XPU]Fix flash_attn_varlen_func interface on xpu by @jikunshang in https://github.com/vllm-project/vllm/pull/22350
[Qwen3] Enable dual-chunk-attention support for Qwen3 models. by @sighingnow in https://github.com/vllm-project/vllm/pull/21924
[Bugfix] Fix wrong method name in Intern-S1 image processor by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22417
Use float32 for test_completion.py by @mgoin in https://github.com/vllm-project/vllm/pull/22385
[Bugfix]: Fix the streaming output for function calls in the minimax by @qscqesze in https://github.com/vllm-project/vllm/pull/22015
[Bugfix] Add proper comparison for package versions by @syedmba in https://github.com/vllm-project/vllm/pull/22314
Update hf_xet pin to resolve hangs by @hmellor in https://github.com/vllm-project/vllm/pull/22356
Optimize logger init performance by using module-level constants by @skyloevil in https://github.com/vllm-project/vllm/pull/22373
preload heavy modules when mp method is forkserver by @lionelvillard in https://github.com/vllm-project/vllm/pull/22214
[gpt-oss] Convert user input to harmony format by @heheda12345 in https://github.com/vllm-project/vllm/pull/22402
[Bugfix] EPLB load statistics problem by @david6666666 in https://github.com/vllm-project/vllm/pull/22167
[CI] Skip the pooling models that do not support transformers v4.55 by @noooop in https://github.com/vllm-project/vllm/pull/22411
[Bench] Split serve.py:main into async/async versions by @lk-chen in https://github.com/vllm-project/vllm/pull/22405
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. by @vllmellm in https://github.com/vllm-project/vllm/pull/22184
[Frontend] Update OpenAI error response to upstream format by @msanft in https://github.com/vllm-project/vllm/pull/22099
[Misc] Support routing logic simulation by @minosfuture in https://github.com/vllm-project/vllm/pull/21990
feat: Add --enable-log-outputs flag for logging model generations by @mizadri in https://github.com/vllm-project/vllm/pull/20707
[Docs] Add missing dependency for docs build by @hmellor in https://github.com/vllm-project/vllm/pull/22435
Add H20-3e fused MoE kernel tuning configs for GLM-4.5 by @JaceyShao in https://github.com/vllm-project/vllm/pull/22433
[Misc] Enhance code formatting in mxfp4.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22423
[Doc] Fix link to prefix caching design by @sarckk in https://github.com/vllm-project/vllm/pull/22384
[Docs] Factor out troubleshooting to its own guide; add section for Ray Observability by @crypdick in https://github.com/vllm-project/vllm/pull/21578
[Doc] update docs for nightly benchmarks by @andrewkchan in https://github.com/vllm-project/vllm/pull/12022
[Docs] Update features/disagg_prefill, add v1 examples and development by @david6666666 in https://github.com/vllm-project/vllm/pull/22165
[Core] Store only the keys for multi-modal data in P0 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22198
[Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM by @fxmarty-amd in https://github.com/vllm-project/vllm/pull/22352
[Tool] Fix auto tool call by @heheda12345 in https://github.com/vllm-project/vllm/pull/22434
[gpt-oss] Generate ResponseOutputItem from Harmony Message by @heheda12345 in https://github.com/vllm-project/vllm/pull/22410
Fix pre-commit error in main by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22462
[Core] Simplify mm processing cache by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22457
[Frontend] Use engine argument to control MM cache size by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22441
Remove from_dict from SpeculativeConfig by @hmellor in https://github.com/vllm-project/vllm/pull/22451
[Misc] normalize multiprocessing Queue usage by @andyxning in https://github.com/vllm-project/vllm/pull/22371
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine by @tjtanaa in https://github.com/vllm-project/vllm/pull/21496
[PERF] Use pybase64 to more quickly decode prompt embeddings by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/22469
Add ModelOpt Qwen3 nvfp4 support by @Edwardf0t1 in https://github.com/vllm-project/vllm/pull/20101
Support Tensorrt-LLM MoE fp4 for low-latency by @wenscarl in https://github.com/vllm-project/vllm/pull/21331
Fix Flashinfer CUTLASS MOE Allgather by @wenscarl in https://github.com/vllm-project/vllm/pull/21963
[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) by @0xjunhao in https://github.com/vllm-project/vllm/pull/22131
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/22065
not tie_word_embeddings for glm-4.5 and glm-4.5v by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/22460
Optimize MiniCPMO mask creation with vectorized implementation by @skyloevil in https://github.com/vllm-project/vllm/pull/22464
Fix pre-commit by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22487
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 by @nvpohanh in https://github.com/vllm-project/vllm/pull/22426
[Doc] Sleep mode documentation by @iAmir97 in https://github.com/vllm-project/vllm/pull/22310
[bench] Fix benchmark/serve.py to ignore unavailable results by @lk-chen in https://github.com/vllm-project/vllm/pull/22382
[CI/Build] Fix multimodal tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22491
[Misc] Begin deprecation of get_tensor_model_*_group by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22494
[Misc] fix openai version by @lengrongfu in https://github.com/vllm-project/vllm/pull/22485
[BugFix] Don't cancel asyncio tasks directly from destructors by @njhill in https://github.com/vllm-project/vllm/pull/22476
[Docs] Improve API docs (+small tweaks) by @hmellor in https://github.com/vllm-project/vllm/pull/22459
Remove exception for Python 3.8 typing from linter by @hmellor in https://github.com/vllm-project/vllm/pull/22506
[gpt-oss] triton kernel mxfp4 by @zyongye in https://github.com/vllm-project/vllm/pull/22421
[Benchmark] Add benchmark tool for multi turn conversations by @pliops-daniels in https://github.com/vllm-project/vllm/pull/20267
[gpt-oss] guard import when triton kernel is not installed by @zyongye in https://github.com/vllm-project/vllm/pull/22529
[Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling” by @crypdick in https://github.com/vllm-project/vllm/pull/22466
[gpt-oss] Support tool call and implement MCP tool server by @heheda12345 in https://github.com/vllm-project/vllm/pull/22427
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/21691
[Misc] DeepGEMM : Avoid JIT generation in the hot-path by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/22215
[Bugfix] Update FA commit hash by @tdoublep in https://github.com/vllm-project/vllm/pull/22546
Skip Qwen 1 in CI because remote code is no longer compatible with Transformers by @hmellor in https://github.com/vllm-project/vllm/pull/22536
[Docs] fix broken links in metrics.md by @GuyStone in https://github.com/vllm-project/vllm/pull/22315
[Frontend] Add unix domain socket support by @yyweiss in https://github.com/vllm-project/vllm/pull/18097
Extract CompilationConfig from config.py by @hmellor in https://github.com/vllm-project/vllm/pull/22524
Drop flaky test_healthcheck_response_time by @russellb in https://github.com/vllm-project/vllm/pull/22539
[XPU] upgrade torch 2.8 on for XPU by @jikunshang in https://github.com/vllm-project/vllm/pull/22300
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D by @Pradyun92 in https://github.com/vllm-project/vllm/pull/22317
[Bugfix] Fix ModernBert cuda graph capturing in v1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/21901
Implicit language-model-only mode via limit-mm-per-prompt by @ywang96 in https://github.com/vllm-project/vllm/pull/22299
[Doc] Add usage of implicit text-only mode by @ywang96 in https://github.com/vllm-project/vllm/pull/22561
Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation by @tdoublep in https://github.com/vllm-project/vllm/pull/22541
[Log] Add Warning for Deprecation of DeepGEMM old version by @yewentao256 in https://github.com/vllm-project/vllm/pull/22194
[V1] [Hybrid] Support Minimax-Text-01 in V1 by @tdoublep in https://github.com/vllm-project/vllm/pull/22151
v1: Pass KVConnectorOutput to scheduler-side by @orozery in https://github.com/vllm-project/vllm/pull/22157
[Misc] Use config definitions from Transformers library by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/21913
Fix loading of quantized BigCode models by @eldarkurtic in https://github.com/vllm-project/vllm/pull/22463
[TPU] Add support for online w8a8 quantization by @kyuyeunk in https://github.com/vllm-project/vllm/pull/22425
[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel by @charlifu in https://github.com/vllm-project/vllm/pull/22097
[Bugfix] Fix failing GPT-OSS initialization test by @Isotr0py in https://github.com/vllm-project/vllm/pull/22557
[Bugfix] Fix CI moe kernel failure by @jeejeelee in https://github.com/vllm-project/vllm/pull/22556
Update docs for Minimax-Text support by @tdoublep in https://github.com/vllm-project/vllm/pull/22562
GLM-4.5V with new class name at transformers by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/22520
[CI] [Hybrid] Speed up hybrid models test by removing large models by @tdoublep in https://github.com/vllm-project/vllm/pull/22563
[Docs] Reduce noise in docs and --help from the JSON tip by @hmellor in https://github.com/vllm-project/vllm/pull/22567
Move ParallelConfig from config/__init__.py to config/parallel.py by @hmellor in https://github.com/vllm-project/vllm/pull/22565
[Model] Gemma3n MM by @NickLucche in https://github.com/vllm-project/vllm/pull/20495
[Bugfix] Fix basic models tests hanging due to mm processor creation by @Isotr0py in https://github.com/vllm-project/vllm/pull/22571
[FEAT] [Performance] Add triton mrope to replace the torch code path by @tjtanaa in https://github.com/vllm-project/vllm/pull/22375
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers by @tdoublep in https://github.com/vllm-project/vllm/pull/21401
[oss] Init gpt-oss bf16 support by @jeejeelee in https://github.com/vllm-project/vllm/pull/22508
[Config] add "qwen" as a native eagle3 target supported model by @lec77 in https://github.com/vllm-project/vllm/pull/22333
Improve fast_topk function with type hints and documentation by @skyloevil in https://github.com/vllm-project/vllm/pull/22530
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block by @yaochengji in https://github.com/vllm-project/vllm/pull/22394
Refactor sliding window configuration to Transformers best practice by @hmellor in https://github.com/vllm-project/vllm/pull/21927
[Minor] Fix pre-commit error on main by @Isotr0py in https://github.com/vllm-project/vllm/pull/22579
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config by @andyxning in https://github.com/vllm-project/vllm/pull/22566
[Doc] Fix API doc link in side navigation by @22quinn in https://github.com/vllm-project/vllm/pull/22585
[Misc] Further refine type annotations in parallel state by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22499
[Docs] Fix warnings in docs build by @hmellor in https://github.com/vllm-project/vllm/pull/22588
[Misc] Replace flaky image urls in pixtral test by @Isotr0py in https://github.com/vllm-project/vllm/pull/22574
Move CacheConfig from config/__init__.py to config/cache.py by @hmellor in https://github.com/vllm-project/vllm/pull/22586
[doc] add beijing meetup links by @youkaichao in https://github.com/vllm-project/vllm/pull/22596
[doc] add alibaba cloud as sponsor by @youkaichao in https://github.com/vllm-project/vllm/pull/22597
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel by @Isotr0py in https://github.com/vllm-project/vllm/pull/22593
Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks by @h-brenoskuk in https://github.com/vllm-project/vllm/pull/22534
Migrate LlavaNextImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21774
Remove redundant row_indices unsqueeze operation in MiniCPMO by @skyloevil in https://github.com/vllm-project/vllm/pull/22528
Fix TensorSchema validation test for symbolic dims by @bbeckca in https://github.com/vllm-project/vllm/pull/22366
enable Docker-aware precompiled wheel setup by @dougbtv in https://github.com/vllm-project/vllm/pull/22106
Migrate LlavaNextVideoPixelInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21843
Migrate LlavaImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21770
[CI/Build] Fix tensorizer test for load_format change by @22quinn in https://github.com/vllm-project/vllm/pull/22583
[BugFix] Fix KVConnectorOutput TPU breakage by @njhill in https://github.com/vllm-project/vllm/pull/22598
[Misc][gpt-oss] Add rules to label gpt-oss related PRs by @draftbk in https://github.com/vllm-project/vllm/pull/22600
[Misc][gpt-oss] guard import when triton kernel when not up to date by @zhewenl in https://github.com/vllm-project/vllm/pull/22584
[BugFix] Fix logits repetition penalty cuda check by @PicoCreator in https://github.com/vllm-project/vllm/pull/22592
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. by @vllmellm in https://github.com/vllm-project/vllm/pull/22521
Support token_type_ids in V1 with less code changes by @maxdebayser in https://github.com/vllm-project/vllm/pull/21985
[Misc] benchmark_moe supports expert parallel by @jeejeelee in https://github.com/vllm-project/vllm/pull/22251
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm by @JartX in https://github.com/vllm-project/vllm/pull/22017
[Docs] Add comprehensive CLI reference for all large vllm subcommands by @hmellor in https://github.com/vllm-project/vllm/pull/22601
[Misc] Move tensor schema tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22612
[Misc] Move jsontree to utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22622
[Model] NemotronH Support by @danielafrimi in https://github.com/vllm-project/vllm/pull/22349
Document aarch64 CPU support works by @ericcurtin in https://github.com/vllm-project/vllm/pull/22646
[Misc] Further clean up some redundant config definitions by @Isotr0py in https://github.com/vllm-project/vllm/pull/22649
[Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale by @yewentao256 in https://github.com/vllm-project/vllm/pull/21968
fix: NIXL connector transfers partial block to pass full multi-modal context by @GuanLuo in https://github.com/vllm-project/vllm/pull/21074
[Model] Pooling models default to using chunked prefill & prefix caching if supported. by @noooop in https://github.com/vllm-project/vllm/pull/20930
[CI/Build] Skip Mllama HF runner tests with Transformers v4.55.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/22659
[BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI by @22quinn in https://github.com/vllm-project/vllm/pull/22611
[CI] Skip Tree Attn Test in test_max_len.py to unblock CI by @tjtanaa in https://github.com/vllm-project/vllm/pull/22664
Support more parallel styles in Transformers backend TP by @hmellor in https://github.com/vllm-project/vllm/pull/22651
[gpt-oss] Support streaming in response API by @heheda12345 in https://github.com/vllm-project/vllm/pull/22431
[gpt-oss] Add test for response API + harmony (but skipped) by @heheda12345 in https://github.com/vllm-project/vllm/pull/22554
Enable 4bit bnb prequant MOE by @py-andy-c in https://github.com/vllm-project/vllm/pull/21548
Re-enable Xet on TPU tests now that hf_xet has been updated by @hmellor in https://github.com/vllm-project/vllm/pull/22666
Upgrade FlashInfer to v0.2.11 by @nvpohanh in https://github.com/vllm-project/vllm/pull/22613
[CI Failure] Use float32 for tests/entrypoints/openai/test_audio.py by @mgoin in https://github.com/vllm-project/vllm/pull/22686
[CI] Increase timeout for test_completion_with_image_embeds by @mgoin in https://github.com/vllm-project/vllm/pull/22670
Migrate MiniCPMVImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21939
[gpt-oss] Fix mxfp4 support by @heheda12345 in https://github.com/vllm-project/vllm/pull/22700
[gpt-oss] Small bug fixes for frontend by @heheda12345 in https://github.com/vllm-project/vllm/pull/22512
Fix passing SpeculativeConfig from the CLI by @hmellor in https://github.com/vllm-project/vllm/pull/22652
[Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models by @hsliuustc0106 in https://github.com/vllm-project/vllm/pull/21737
[doc] Update x86 CPU-inference installation doc to reflect optionality of AVX512f by @sooraj-satheesh in https://github.com/vllm-project/vllm/pull/22707
[Bugfix] Fix ModernBert load & Enable sliding window attention for bidirectional attention. by @noooop in https://github.com/vllm-project/vllm/pull/22637
Move SchedulerConfig from config/__init__.py to config/scheduler.py by @hmellor in https://github.com/vllm-project/vllm/pull/22626
[DOC] update v1_guide with INTEL HW by @xuechendi in https://github.com/vllm-project/vllm/pull/22679
[New Model] Support Command-A-Vision by @dongluw in https://github.com/vllm-project/vllm/pull/22660
[V0] Correct CUDA Graph capture for encoder-decoder models by @Sugar-zsg in https://github.com/vllm-project/vllm/pull/22630
[Bugfix] Fix erroneous randomly generated cases in bad word testing by @phantomlei3 in https://github.com/vllm-project/vllm/pull/22170
Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert" by @Jun-Howie in https://github.com/vllm-project/vllm/pull/21888
[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 by @RishiAstra in https://github.com/vllm-project/vllm/pull/21783
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing by @zejunchen-zejun in https://github.com/vllm-project/vllm/pull/21161
[Misc] remove GH discussions link by @jeejeelee in https://github.com/vllm-project/vllm/pull/22722
[gpt-oss] Enable gpt-oss on ampere by @zyongye in https://github.com/vllm-project/vllm/pull/22714
[Docs] Improve docs navigation by @hmellor in https://github.com/vllm-project/vllm/pull/22720
[BugFix][Nixl][PD] Fix heterogenous TP by @NickLucche in https://github.com/vllm-project/vllm/pull/22663
Officially support SmolLM3 using the Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/22665
[CI Failure] fix tests/entrypoints/openai/test_skip_tokenizer.py by @noooop in https://github.com/vllm-project/vllm/pull/22708
Fix Llama4 FlashInfer FP4 MoE issues by @nvpohanh in https://github.com/vllm-project/vllm/pull/22511
[Bugfix][CI] Fix test_remote_decode_lifecycle.py::test_short_prompt_lifecycle by @NickLucche in https://github.com/vllm-project/vllm/pull/22727
[Benchmark] Fix terminal colors in benchmark_serving_multi_turn (python 3.12) by @pliops-daniels in https://github.com/vllm-project/vllm/pull/22730
Add: SupportsEagle3 interface for explicit EAGLE3 support by @rahul-tuli in https://github.com/vllm-project/vllm/pull/22642
Add more test scenario for tensor schema by @teekenl in https://github.com/vllm-project/vllm/pull/22733
[Chore] Update CODEOWNERS to include @yewentao256 for CUDA kernels, attention backends, quantization, and related tests by @yewentao256 in https://github.com/vllm-project/vllm/pull/22741
[Kernel][AMD] Avoid D2H copy and cumsum kernel by @mxz297 in https://github.com/vllm-project/vllm/pull/22683
[CI][Nixl] Check kv cache layout during handshake by @NickLucche in https://github.com/vllm-project/vllm/pull/22745
Fix torch version check for SM100 mxfp4 by @zifeitong in https://github.com/vllm-project/vllm/pull/22535
[Misc] parametrize 'dtype' in test_flash_mla by @RUTHLESS-BOT in https://github.com/vllm-project/vllm/pull/22641
[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues by @frankwang28 in https://github.com/vllm-project/vllm/pull/22606
[Docs] Hide the navigation and toc sidebars on home page by @hmellor in https://github.com/vllm-project/vllm/pull/22749
Fix Transformers backend tensor parallel for multimodal models by @hmellor in https://github.com/vllm-project/vllm/pull/22673
[Model] Decouple glm4v by @jeejeelee in https://github.com/vllm-project/vllm/pull/22751
Add hardware plugins to installation doc by @mgoin in https://github.com/vllm-project/vllm/pull/22732
[V0 Deprecation] Remove multi-step scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22138
[Misc] Remove tests/multi_step/init.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22778
[V0 Deprecation] Remove args for multi-step scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22779
Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op by @nvpohanh in https://github.com/vllm-project/vllm/pull/22701
[Bugfix] Fix default enable for CUTLASS MLA on SM100 by @mgoin in https://github.com/vllm-project/vllm/pull/22738
Force TRTLLM attention for gpt-oss on SM100 by @mgoin in https://github.com/vllm-project/vllm/pull/22678
Remove unneeded ROCm platform import when using CUDA by @mgoin in https://github.com/vllm-project/vllm/pull/22765
[Bug] Fix Unexpected Keyword Argument 'w1_bias' by @yewentao256 in https://github.com/vllm-project/vllm/pull/22757
[Perf] Support topk softmax fused kernel for broader num_experts by @shixianc in https://github.com/vllm-project/vllm/pull/22211
[gpt-oss] upgrade gpt-oss to v0.0.3 and add version check by @heheda12345 in https://github.com/vllm-project/vllm/pull/22768
[Model] Add option to run Step3VisionEncoder in DP by @zzh142857 in https://github.com/vllm-project/vllm/pull/22697
[Model] Add missing prefix to glm4_1v by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/22716
[Bugfix] Fix Nemotron VL image processing by @ducviet00 in https://github.com/vllm-project/vllm/pull/22739
[Doc] Add max_lora_rank configuration guide by @chi2liu in https://github.com/vllm-project/vllm/pull/22782
[V1] Add tree drafting tests for eagle spec decoding by @TheEpicDolphin in https://github.com/vllm-project/vllm/pull/22705
[Platform] Custom ops support for FusedMoe by @wangxiyuan in https://github.com/vllm-project/vllm/pull/22509
[Frontend] Add chunked processing to handle long inputs in embedding models by @x22x22 in https://github.com/vllm-project/vllm/pull/22280
[FEATURE] support custom vllm tuned config path by @vermouth1992 in https://github.com/vllm-project/vllm/pull/22791
[Nixl][CI] Fix tests by @NickLucche in https://github.com/vllm-project/vllm/pull/22806
[Bugfix][mamba] Fix type annotation of Mamba2Metadata by @heheda12345 in https://github.com/vllm-project/vllm/pull/22787
Remove unnecessary CUDA sync of qwen image and video preprocess by @cyyever in https://github.com/vllm-project/vllm/pull/22792
Fix GGUF loader for Qwen3 MoE. by @Gh0u1L5 in https://github.com/vllm-project/vllm/pull/22785
[Frontend] Multithreaded async multimodal load_bytes by @milesial in https://github.com/vllm-project/vllm/pull/22710
[Core] Use individual MM items in P0/P1 cache and model runner by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22570
[Misc] clear and separate error messages for input too long and input + max-tokens too long by @ywang96 in https://github.com/vllm-project/vllm/pull/22803
[Bugfix] Fix MiniCPMV Image input inference failed by @jio-H in https://github.com/vllm-project/vllm/pull/22813
[CI/Build] Update VLM common tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22841
[CI] Fix tests/v1/e2e/test_kv_sharing_fast_prefill.py import on test by @NickLucche in https://github.com/vllm-project/vllm/pull/22815
[CI/Build] Fix param mismatch in test_eagle_correctness by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22847
[CI/Build] Skip gpt_big model test because of broken HF model by @Isotr0py in https://github.com/vllm-project/vllm/pull/22848
[ROCm][Bugfix] Fix compilation error in topk softmax fused kernel by @kliuae in https://github.com/vllm-project/vllm/pull/22819
Move checklist in PR template by @ProExpertProg in https://github.com/vllm-project/vllm/pull/22852
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP by @Jialin in https://github.com/vllm-project/vllm/pull/22437
[CI/Build] Increase pooling tolerance to pass CI by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22844
[CI][Entrypoints]: add filter to generation to filter out invalid tool calls by @wseaton in https://github.com/vllm-project/vllm/pull/22826
[CI] Fix tests/distributed/test_ca_buffer_sharing.py by @ilmarkov in https://github.com/vllm-project/vllm/pull/22849
[CI] remove flaky v0 test by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/22864
vLLM Benchmark suite improvement by @louie-tsai in https://github.com/vllm-project/vllm/pull/22119
[Bugfix] Fix PixtralHFImagePixelInputs dynamic shape check by @Isotr0py in https://github.com/vllm-project/vllm/pull/22827
[BugFix] Threadsafe close async zmq sockets by @njhill in https://github.com/vllm-project/vllm/pull/22877
Remove Phi 4 Flash configuration workaround by @hmellor in https://github.com/vllm-project/vllm/pull/22723
[Bugfix] Add reset prefix cache for online serving by @iAmir97 in https://github.com/vllm-project/vllm/pull/22726
[Doc] fix dead link by @dtrifiro in https://github.com/vllm-project/vllm/pull/22898
[CI] Re-enable transcriptions test_long_audio_request by @NickLucche in https://github.com/vllm-project/vllm/pull/22890
[Perf] Dont create unnecessary pooling params by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22876
[Model] Modify the gate implementation of glm4_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/22832
[Bugfix] Replace custom Encoding class with BatchEncoding in MistralTokenizer by @ZJY0516 in https://github.com/vllm-project/vllm/pull/22786
[Bugfix] Fix parsing of --disable-mm-preprocessor-cache by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22909
[CI] [Hybrid] Bump min transformers version for Bamba and Jamba by @tdoublep in https://github.com/vllm-project/vllm/pull/22908
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/22428
docs: update fastsafetensors usage instructions by @NirLevy98 in https://github.com/vllm-project/vllm/pull/22891
[CI] Temporarily disable flaky test by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22930
[Kernel] Add nvfp4 gemm flashinfer backends by @nvjullin in https://github.com/vllm-project/vllm/pull/22346
[Quantization]: Support compressed-tensors mixed-precision model loading by @dsikka in https://github.com/vllm-project/vllm/pull/22468
[Core] Return final response for aborted requests from AsyncLLM.generate by @njhill in https://github.com/vllm-project/vllm/pull/22283
[BugFix] Fix initial DP request load imbalance by @njhill in https://github.com/vllm-project/vllm/pull/22910
[Bugfix] use flash attn on sm90 by @zyongye in https://github.com/vllm-project/vllm/pull/22933
[Kernel] Add cuda kernel for gpt_oss activation by @jeejeelee in https://github.com/vllm-project/vllm/pull/22538
Revert "[Kernel] Add cuda kernel for gpt_oss activation" by @simon-mo in https://github.com/vllm-project/vllm/pull/22948
[BugFix][KVConn] Fix use of get_required_kvcache_layout by @njhill in https://github.com/vllm-project/vllm/pull/22734
[BugFix] Fix port lookup in internal DP LB tests by @njhill in https://github.com/vllm-project/vllm/pull/22252
[CI Perf] Prune tests in tests/kernels/quantization/ by @mgoin in https://github.com/vllm-project/vllm/pull/22942
[CI Perf] Prune tests in tests/kernels/moe/ by @mgoin in https://github.com/vllm-project/vllm/pull/22939
[CI Perf] Prune tests in tests/kernels/attention/ by @mgoin in https://github.com/vllm-project/vllm/pull/22936
refactor: Change scaling factors calculation for flashinfer FusedMoE by @amirkl94 in https://github.com/vllm-project/vllm/pull/22812
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement by @yewentao256 in https://github.com/vllm-project/vllm/pull/22763
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn by @Josephasafg in https://github.com/vllm-project/vllm/pull/22818
Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module." by @tjtanaa in https://github.com/vllm-project/vllm/pull/22956
[P/D]Provide bucket algorithm rate limiter for proxy_server by @frankie-ys in https://github.com/vllm-project/vllm/pull/22643
[CI] Pooling models mteb test uses enforce_eager by @noooop in https://github.com/vllm-project/vllm/pull/22878
[V1] - Split Prefill and Decode for Mamba1 models by @amirai21 in https://github.com/vllm-project/vllm/pull/22653
[Bugfix] Unquote file uri before reading image by @sayandipdutta in https://github.com/vllm-project/vllm/pull/22912
[Bugfix] fix cuda 12.6 and 11.8 build by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/22952
[MM] Allow skipping memory profiling for multimodal models. by @ywang96 in https://github.com/vllm-project/vllm/pull/22950
Improve multimodal hasher performance for re-used Image prompts by @p88h in https://github.com/vllm-project/vllm/pull/22825
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) by @tdoublep in https://github.com/vllm-project/vllm/pull/22928
[Misc] Ignore ep_kernels_workspace by @jeejeelee in https://github.com/vllm-project/vllm/pull/22807
[CI] Remove duplicated docs build from buildkite by @hmellor in https://github.com/vllm-project/vllm/pull/22924
[Frontend] Expose do_log_stats interval to env by @Csrayz in https://github.com/vllm-project/vllm/pull/22905
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer by @fhl2000 in https://github.com/vllm-project/vllm/pull/20059
[V0 Deprecation] Remove advance_step by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22969
[BugFix] Skip the Q component for QKVParallelLinear in the case of QKVCrossParallelLinear since its width is 0 by @sstamenk in https://github.com/vllm-project/vllm/pull/22369
[FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches by @JartX in https://github.com/vllm-project/vllm/pull/22896
[Benchmarks] Include image data when ShareGPT4V dataset is used. by @huachenheli in https://github.com/vllm-project/vllm/pull/22955
[Structured Output] Make the output of structured output example more complete by @shen-shanshan in https://github.com/vllm-project/vllm/pull/22481
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods. by @bnellnm in https://github.com/vllm-project/vllm/pull/22035
[Model] Granite-4 support loading quantized checkpoint by @cyang49 in https://github.com/vllm-project/vllm/pull/22925
[Log] Debug Once for Randomizing dummy data for DP Rank by @yewentao256 in https://github.com/vllm-project/vllm/pull/22860
[Core] direct indexing on self.block_table_np in compute_slot_mapping by @linzebing in https://github.com/vllm-project/vllm/pull/22940
[Bugfix] Added more env vars to hash by @nvjullin in https://github.com/vllm-project/vllm/pull/22449
Use regex in convert-results-json-to-markdown.py by @mgoin in https://github.com/vllm-project/vllm/pull/22989
[CI] Speed up Whisper tests by reusing server by @mgoin in https://github.com/vllm-project/vllm/pull/22859
[Fix] enable swap_ab for pplx problem size computation by @shixianc in https://github.com/vllm-project/vllm/pull/22991
Add PrefixRepetitionRandomDataset to vllm bench serve datasets by @eicherseiji in https://github.com/vllm-project/vllm/pull/20638
minor: zero workspace buffer init for flashinfer trtllm-gen attn by @yyihuang in https://github.com/vllm-project/vllm/pull/22603
[Attention] FA3 Attention Sinks Perf Boost by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22478
[BugFix] Fix regression caused by mamba state dtype PR by @tdoublep in https://github.com/vllm-project/vllm/pull/22998
ci: Add CUDA + arm64 release builds by @seemethere in https://github.com/vllm-project/vllm/pull/21201
[Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits by @rishitdholakia13 in https://github.com/vllm-project/vllm/pull/22963
[BugFix] Handle case where async utility call is cancelled by @njhill in https://github.com/vllm-project/vllm/pull/22996
[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) by @orozery in https://github.com/vllm-project/vllm/pull/19728
Support multiple attention groups for KV sharing by @sarckk in https://github.com/vllm-project/vllm/pull/22672
[BugFix] Make run_once thread-safe by @oraluben in https://github.com/vllm-project/vllm/pull/22978
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() by @njhill in https://github.com/vllm-project/vllm/pull/22944
[Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit by @NickLucche in https://github.com/vllm-project/vllm/pull/22735
[Bugfix] Fix DeepSeek MTP by @benchislett in https://github.com/vllm-project/vllm/pull/22934
[Frontend] Avoid list copies in serving_chat.py by @njhill in https://github.com/vllm-project/vllm/pull/22947
[V1] support min_tokens for detokener by @calvin0327 in https://github.com/vllm-project/vllm/pull/22014
[misc] nsys profile output kernel classifier and visualizer by @gracehonv in https://github.com/vllm-project/vllm/pull/22971
[XPU]avoid circular import during XPU init by @jikunshang in https://github.com/vllm-project/vllm/pull/23017
[Build] Env var to disable sccache by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22968
[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/22962
[Misc] Add --save-dir option to benchmark_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/23020
[Multimodal] Update Tensor schema test to cover arbitrary shape mm inputs by @Isotr0py in https://github.com/vllm-project/vllm/pull/22867
[Core] Make cudagraph check cuda platform only by @yaochengji in https://github.com/vllm-project/vllm/pull/23005
[CI][Bugfix] Skip Ovis2 generation test because of broken remote code by @Isotr0py in https://github.com/vllm-project/vllm/pull/22954
Add docs for PrefixRepetitionDataset + enable usage with vllm bench throughput by @eicherseiji in https://github.com/vllm-project/vllm/pull/23012
[Refactor] Allow optional MultiModalKwargsItem in IPC by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23022
[New Model]mBART model by @princepride in https://github.com/vllm-project/vllm/pull/22883
Fix handling of max_num_batched_tokens for pooling tasks by @maxdebayser in https://github.com/vllm-project/vllm/pull/23004
[Frontend] Added support for HermesToolParser for models without special tokens by @minpeter in https://github.com/vllm-project/vllm/pull/16890
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support by @mgoin in https://github.com/vllm-project/vllm/pull/23016
[Flaky CI] Increase timeout tolerance for test_mp_crash_detection+test_default_mm_lora_chat_completions by @mgoin in https://github.com/vllm-project/vllm/pull/23028
[Kernel/Quant] Remove AQLM by @mgoin in https://github.com/vllm-project/vllm/pull/22943
[V1] Logits processors extensibility by @afeldman-nm in https://github.com/vllm-project/vllm/pull/19912
[Bugfix] fix qwen3 moe fp8 accuracy issue by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/23031
[UX] Separate marlin moe config logic from triton moe by @mgoin in https://github.com/vllm-project/vllm/pull/23006
[Refactor] Defer tensor data construction in MultiModalKwargs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23030
[Misc] method name typo fix by @andyxning in https://github.com/vllm-project/vllm/pull/23042
[Kernel] Add cuda kernel for gpt_oss activation by @jeejeelee in https://github.com/vllm-project/vllm/pull/22951
[Bugfix] should use stack instead of concat by @947132885 in https://github.com/vllm-project/vllm/pull/22972
[Misc] fix typo in the multimodal doc by @KevinZeng08 in https://github.com/vllm-project/vllm/pull/23051
[BugFix] Fix for IMA in FA3 varlen combine by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22967
[Misc] Remove dead return by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23061
[Misc] Convert use_structured_output property into constant by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23060
[XPU] fix xpu to set cudagraph batch sizes by @calvin0327 in https://github.com/vllm-project/vllm/pull/23044
fix: gptq marlin weight loading failure by @simon-mo in https://github.com/vllm-project/vllm/pull/23066

New Contributors

@zhouwfang made their first contribution in #21407
@juncgu made their first contribution in #18293
@weireweire made their first contribution in #21485
@bbeckca made their first contribution in #21232
@wzqd made their first contribution in #21494
@hfan made their first contribution in #21479
@ignaciosica made their first contribution in #21195
@xyxinyang made their first contribution in #21586
@bigshanedogg made their first contribution in #20931
@fsx950223 made their first contribution in #20295
@mgazz made their first contribution in #21518
@Mitix-EPI made their first contribution in #21612
@lvhan028 made their first contribution in #21628
@zhouyeju made their first contribution in #21380
@wenchen76 made their first contribution in #21154
@skyloevil made their first contribution in #20529
@joa-stdn made their first contribution in #21697
@liuyumoye made their first contribution in #21534
@hsliuustc0106 made their first contribution in #21573
@Josephasafg made their first contribution in #21715
@vasqu made their first contribution in #21735
@key4ng made their first contribution in #19024
@wuhang2014 made their first contribution in #21728
@HugoMichard made their first contribution in #21167
@smarterclayton made their first contribution in #21472
@nikhil-arm made their first contribution in #17112
@LyrisZhong made their first contribution in #20396
@rzabarazesh made their first contribution in #21347
@milesial made their first contribution in #21798
@Csrayz made their first contribution in #21803
@MingzhenHan made their first contribution in #21827
@aladerran made their first contribution in #20815
@Yanpas made their first contribution in #18548
@tanruixiang made their first contribution in #21673
@nvpohanh made their first contribution in #21499
@chi2liu made their first contribution in #21816
@fake0fan made their first contribution in #21611
@wxsms made their first contribution in #20433
@wenxindongwork made their first contribution in #21417
@br4mm made their first contribution in #20272
@linzebing made their first contribution in #21627
@sanchit-gandhi made their first contribution in #21833
@amirkl94 made their first contribution in #21458
@zhxchen17 made their first contribution in #22028
@charent made their first contribution in #20873
@Aviadr-neureality made their first contribution in #21937
@n0gu-furiosa made their first contribution in #21052
@ahengljh made their first contribution in #22052
@sidhpurwala-huzaifa made their first contribution in #21119
@anijain2305 made their first contribution in #20836
@JartX made their first contribution in #21733
@xiszishu made their first contribution in #22122
@LopezCastroRoberto made their first contribution in #21309
@TankNee made their first contribution in #21213
@TheEpicDolphin made their first contribution in #20401
@chenxi-yang made their first contribution in #22105
@weixiao-huang made their first contribution in #21164
@CLFutureX made their first contribution in #21173
@tlipoca9 made their first contribution in #22149
@zyongye made their first contribution in https://github.com/vllm-project/vllm/pull/22330
@zhangnju made their first contribution in https://github.com/vllm-project/vllm/pull/22367
@tc-mb made their first contribution in https://github.com/vllm-project/vllm/pull/22166
@syedmba made their first contribution in https://github.com/vllm-project/vllm/pull/22314
@msanft made their first contribution in https://github.com/vllm-project/vllm/pull/22099
@mizadri made their first contribution in https://github.com/vllm-project/vllm/pull/20707
@JaceyShao made their first contribution in https://github.com/vllm-project/vllm/pull/22433
@andrewkchan made their first contribution in https://github.com/vllm-project/vllm/pull/12022
@iAmir97 made their first contribution in https://github.com/vllm-project/vllm/pull/22310
@pliops-daniels made their first contribution in https://github.com/vllm-project/vllm/pull/20267
@yyweiss made their first contribution in https://github.com/vllm-project/vllm/pull/18097
@Pradyun92 made their first contribution in https://github.com/vllm-project/vllm/pull/22317
@kyuyeunk made their first contribution in https://github.com/vllm-project/vllm/pull/22425
@lec77 made their first contribution in https://github.com/vllm-project/vllm/pull/22333
@h-brenoskuk made their first contribution in https://github.com/vllm-project/vllm/pull/22534
@zhewenl made their first contribution in https://github.com/vllm-project/vllm/pull/22584
@PicoCreator made their first contribution in https://github.com/vllm-project/vllm/pull/22592
@danielafrimi made their first contribution in https://github.com/vllm-project/vllm/pull/22349
@GuanLuo made their first contribution in https://github.com/vllm-project/vllm/pull/21074
@sooraj-satheesh made their first contribution in https://github.com/vllm-project/vllm/pull/22707
@dongluw made their first contribution in https://github.com/vllm-project/vllm/pull/22660
@Sugar-zsg made their first contribution in https://github.com/vllm-project/vllm/pull/22630
@phantomlei3 made their first contribution in https://github.com/vllm-project/vllm/pull/22170
@RishiAstra made their first contribution in https://github.com/vllm-project/vllm/pull/21783
@zejunchen-zejun made their first contribution in https://github.com/vllm-project/vllm/pull/21161
@teekenl made their first contribution in https://github.com/vllm-project/vllm/pull/22733
@mxz297 made their first contribution in https://github.com/vllm-project/vllm/pull/22683
@RUTHLESS-BOT made their first contribution in https://github.com/vllm-project/vllm/pull/22641
@frankwang28 made their first contribution in https://github.com/vllm-project/vllm/pull/22606
@zzh142857 made their first contribution in https://github.com/vllm-project/vllm/pull/22697
@ducviet00 made their first contribution in https://github.com/vllm-project/vllm/pull/22739
@x22x22 made their first contribution in https://github.com/vllm-project/vllm/pull/22280
@Gh0u1L5 made their first contribution in https://github.com/vllm-project/vllm/pull/22785
@jio-H made their first contribution in https://github.com/vllm-project/vllm/pull/22813
@ZJY0516 made their first contribution in https://github.com/vllm-project/vllm/pull/22786
@NirLevy98 made their first contribution in https://github.com/vllm-project/vllm/pull/22891
@nvjullin made their first contribution in https://github.com/vllm-project/vllm/pull/22346
@frankie-ys made their first contribution in https://github.com/vllm-project/vllm/pull/22643
@amirai21 made their first contribution in https://github.com/vllm-project/vllm/pull/22653
@sayandipdutta made their first contribution in https://github.com/vllm-project/vllm/pull/22912
@yyihuang made their first contribution in https://github.com/vllm-project/vllm/pull/22603
@rishitdholakia13 made their first contribution in https://github.com/vllm-project/vllm/pull/22963
@oraluben made their first contribution in https://github.com/vllm-project/vllm/pull/22978
@minpeter made their first contribution in https://github.com/vllm-project/vllm/pull/16890
@947132885 made their first contribution in https://github.com/vllm-project/vllm/pull/22972
@KevinZeng08 made their first contribution in https://github.com/vllm-project/vllm/pull/23051

Full Changelog: v0.10.0...v0.10.1