What's Changed
- Deduplicate Transformers backend code using inheritance by @hmellor in #21461
- [Bugfix][ROCm] Fix for warp_size uses on host by @gshtras in #21205
- [TPU][Bugfix] fix moe layer by @yaochengji in #21340
- [v1][Core] Clean up usages of
SpecializedManager
by @zhouwfang in #21407 - [Misc] Fix duplicate FusedMoEConfig debug messages by @njhill in #21455
- [Core] Support model loader plugins by @22quinn in #21067
- remove GLM-4 quantization wrong Code by @zRzRzRzRzRzRzR in #21435
- Replace
--expand-tools-even-if-tool-choice-none
with--exclude-tools-when-tool-choice-none
for v0.10.0 by @okdshin in #20544 - [Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() by @ruisearch42 in #21501
- [Feat] Allow custom naming of vLLM processes by @chaunceyjiang in #21445
- bump
flashinfer
tov0.2.8
by @cjackal in #21385 - [Attention] Optimize FlashInfer MetadataBuilder Build call by @LucasWilkinson in #21137
- [Model] Officially support Emu3 with Transformers backend by @hmellor in #21319
- [Bugfix] Fix CUDA arch flags for MoE permute by @minosfuture in #21426
- [Fix] Update mamba_ssm to 2.2.5 by @elvischenv in #21421
- [Docs] Update Tensorizer usage documentation by @sangstar in #21190
- [Docs] Rewrite Distributed Inference and Serving guide by @crypdick in #20593
- [Bug] Fix Compressed Tensor NVFP4
cutlass_fp4_group_mm
illegal memory access by @yewentao256 in #21465 - Update flashinfer CUTLASS MoE Kernel by @wenscarl in #21408
- [XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform by @chaojun-zhang in #21036
- [P/D] Move FakeNixlWrapper to test dir by @ruisearch42 in #21328
- [P/D] Support CPU Transfer in NixlConnector by @juncgu in #18293
- [Docs][minor] Fix broken gh-file link in distributed serving docs by @crypdick in #21543
- [Docs] Add Expert Parallelism Initial Documentation by @simon-mo in #21373
- update flashinfer to v0.2.9rc1 by @weireweire in #21485
- [TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. by @QiliangCui in #21539
- [MoE] More balanced expert sharding by @WoosukKwon in #21497
- [Frontend]
run-batch
supports V1 by @DarkLight1337 in #21541 - [Docs] Fix
site_url
for RunLLM by @hmellor in #21564 - [Bug] Fix DeepGemm Init Error by @yewentao256 in #21554
- Fix GLM-4 PP Missing Layer When using with PP. by @zRzRzRzRzRzRzR in #21531
- [Kernel] adding fused_moe configs for upcoming granite4 by @bringlein in #21332
- [Bugfix] DeepGemm utils : Fix hardcoded type-cast by @varun-sundar-rabindranath in #21517
- [DP] Support api-server-count > 0 in hybrid DP LB mode by @njhill in #21510
- [TPU][Test] Temporarily suspend this MoE model in test_basic.py. by @QiliangCui in #21560
- [Docs] Add
requirements/common.txt
to run unit tests by @zhouwfang in #21572 - Integrate TensorSchema with shape validation for Phi3VImagePixelInputs by @bbeckca in #21232
- [CI] Update CODEOWNERS for CPU and Intel GPU by @bigPYJ1151 in #21582
- [Bugfix] fix modelscope snapshot_download serialization by @andyxning in #21536
- [Model] Support tensor parallel for timm ViT in Deepseek_vl2 by @wzqd in #21494
- [Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings by @hfan in #21479
- [Misc][Tools] make max-model-len a parameter in auto_tune script by @yaochengji in #21321
- [CI/Build] fix cpu_extension for apple silicon by @ignaciosica in #21195
- [Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS by @chenyang78 in #21262
- [TPU][Bugfix] fix OOM issue in CI test by @yaochengji in #21550
- [Tests] Harden DP tests by @njhill in #21508
- Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct by @Xu-Wenqing in #21598
- [Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' by @kebe7jun in #21579
- [Quantization] Enable BNB support for more MoE models by @jeejeelee in #21370
- [V1] Get supported tasks from model runner instead of model config by @DarkLight1337 in #21585
- [Bugfix][Logprobs] Fix logprobs op to support more backend by @MengqingCao in #21591
- [Model] Fix Ernie4.5MoE e_score_correction_bias parameter by @xyxinyang in #21586
- [MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B by @bigshanedogg in #20931
- [Frontend] Add request_id to the Request object so they can be controlled better via external load balancers by @kouroshHakha in #21009
- [Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel by @cyang49 in #20839
- [ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. by @fsx950223 in #20295
- [Kernel] Improve machete memory bound perf by @czhu-cohere in #21556
- Add support for Prithvi in Online serving mode by @mgazz in #21518
- [CI] Unifying Dockerfiles for ARM and X86 Builds by @kebe7jun in #21343
- [Docs] add auto-round quantization readme by @wenhuach21 in #21600
- [TPU][Test] Rollback PR-21550. by @QiliangCui in #21619
- Add Unsloth to RLHF.md by @danielhanchen in #21636
- [Perf] Cuda Kernel for Int8 Per Token Group Quant by @yewentao256 in #21476
- Add interleaved RoPE test for Llama4 (Maverick) by @sarckk in #21478
- [Bugfix] Fix sync_and_slice_intermediate_tensors by @ruisearch42 in #21537
- [Bugfix] Always set RAY_ADDRESS for Ray actor before spawn by @ruisearch42 in #21540
- [TPU] Update ptxla nightly version to 20250724 by @yaochengji in #21555
- [Feature] Add support for MoE models in the calibration-free RTN-based quantization by @sakogan in #20766
- [Model] Ultravox: Support Llama 4 and Gemma 3 backends by @farzadab in #17818
- [Docs] add offline serving multi-modal video input expamle Qwen2.5-VL by @david6666666 in #21530
- Correctly kill vLLM processes after finishing serving benchmarks by @huydhn in #21641
- [Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison by @Mitix-EPI in #21612
- [TPU][Test] Divide TPU v1 Test into 2 parts. by @QiliangCui in #21431
- Support Intern-S1 by @lvhan028 in #21628
- [Misc] remove unused try-except in pooling config check by @reidliu41 in #21618
- [Take 2] Correctly kill vLLM processes after benchmarks by @huydhn in #21646
- Migrate AriaImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21620
- Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation by @bbeckca in #21622
- [Bugfix] Investigate Qwen2-VL failing test by @Isotr0py in #21527
- Support encoder-only models without KV-Cache by @maxdebayser in #21270
- [Bug] Fix
has_flashinfer_moe
Import Error when it is not installed by @yewentao256 in #21634 - [Misc] Improve memory profiling debug message by @yeqcharlotte in #21429
- [BugFix] Fix shared storage connector load kv only load attention layer by @david6666666 in #21428
- [Refactor] Remove
moe_align_block_size_triton
by @yewentao256 in #21335 - [Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon by @zhouyeju in #21380
- [CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI by @yeqcharlotte in #21355
- [NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels by @kaixih in #21411
- Remove xformers requirement for Mistral-format Pixtral and Mistral3 by @wenchen76 in #21154
- support
torch.compile
for bailing moe by @jinzhen-lin in #21664 - Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema by @bbeckca in #21656
- Migrate DeepseekVL2ImageInputs to TensorSchema by @bbeckca in #21658
- Migrate FuyuImagePatchInputs to TensorSchema by @bbeckca in #21662
- Migrate ChameleonImagePixelInputs to TensorSchema by @bbeckca in #21657
- [VLM] Support HF format Phi-4-MM model by @Isotr0py in #17121
- Handle non-serializable objects in vllm bench by @huydhn in #21665
- [CI/Build][Doc] Clean up more docs that point to old bench scripts by @yeqcharlotte in #21667
- Refactor: Remove numpy dependency from LoggingStatLogger by @skyloevil in #20529
- [Misc] add default value for file pattern arg by @andyxning in #21659
- Migrate Florence2ImagePixelInputs to TensorSchema by @bbeckca in #21663
- [VLM] Add video support for Intern-S1 by @Isotr0py in #21671
- [Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor by @yewentao256 in #21631
- Fix CUDA permute/unpermute for use with DeepGemm Moe by @CalebDu in #17934
- [Misc] Refactor vllm config str by @andyxning in #21666
- [Attention] Make CutlassMLA the default backend for SM100 (blackwell) by @alexm-redhat in #21626
- [Deprecation][2/N] Replace
--task
with--runner
and--convert
by @DarkLight1337 in #21470 - Fix typo for limit-mm-per-prompt in docs by @joa-stdn in #21697
- Fix GLM tool parser by @zRzRzRzRzRzRzR in #21668
- [Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 by @jeejeelee in #21700
- [V1] Exception Handling when Loading KV Cache from Remote Store by @liuyumoye in #21534
- [Model] Support TP/PP/mamba2 kernel for PLaMo2 by @Alnusjaponica in #19674
- [FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel by @tjtanaa in #21242
- Migrate Gemma3ImagePixelInputs to TensorSchema by @bbeckca in #21676
- Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema by @bbeckca in #21678
- Migrate GLMVImagePixelInputs to TensorSchema by @bbeckca in #21679
- Migrate GraniteSpeechAudioInputs to TensorSchema by @bbeckca in #21682
- Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … by @bbeckca in #21683
- [Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled by @hsliuustc0106 in #21573
- [bugfix] fix profile impact benchmark results by @lengrongfu in #21507
- [Bugfix] Fix shape checking for Fuyu by @DarkLight1337 in #21709
- [Bugfix] fix max-file-size type from str to int by @andyxning in #21675
- [BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled by @LucasWilkinson in #21707
- [v1][mamba] Added mamba_type into MambaSpec by @Josephasafg in #21715
- Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema by @bbeckca in #21686
- [Model] Prioritize Transformers fallback over suffix matching by @DarkLight1337 in #21719
- [feature] add log non default args in LLM by @lengrongfu in #21680
- [Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts by @jeejeelee in #21717
- [Bugfix] Fix environment variable setting in CPU Dockerfile by @bigPYJ1151 in #21730
- [Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme by @Isotr0py in #21744
- [PD] let p2p nccl toy proxy handle /chat/completions by @chaunceyjiang in #21734
- [
Ernie 4.5
] Name Change for Base 0.3B Model by @vasqu in #21735 - [Bugfix] Improve JSON extraction in LlamaToolParser by @key4ng in #19024
- [Docs] Add revision date to rendered docs by @hmellor in #21752
- [Bugfix]check health for engine core process exiting unexpectedly by @wuhang2014 in #21728
- [Bugfix][CI/Build] Update peft version in test requirement by @Isotr0py in #21729
- [Logs] Change flashinfer sampler logs to once by @mgoin in #21759
- [Misc] Reduce logs for model resolution by @DarkLight1337 in #21765
- [Bugfix] Mistral crashes on tool with no description by @HugoMichard in #21167
- [CI/Build] Fix plugin tests by @DarkLight1337 in #21758
- [XPU] IPEX-optimized Punica Wrapper on XPU by @chaojun-zhang in #21703
- [Bugfix] Fix granite speech shape validation by @DarkLight1337 in #21762
- [P/D] Log warnings related to prefill KV expiry by @njhill in #21753
- Use
metavar
to list the choices for a CLI arg when custom values are also accepted by @hmellor in #21760 - update flashinfer to v0.2.9rc2 by @weireweire in #21701
- [AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile by @rasmith in #21350
- [Bug] Enforce contiguous input for
dynamic_scaled_fp8_quant
andstatic_scaled_fp8_quant
by @yewentao256 in #21773 - [AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure by @houseroad in #21647
- Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" by @KuntaiDu in #21778
- [Bugfix] DeepGEMM is not enabled on B200 due to
_lazy_init()
by @smarterclayton in #21472 - [Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels by @nikhil-arm in #17112
- [Perf] Disable chunked local attention by default with llama4 by @LucasWilkinson in #21761
- [Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning by @LyrisZhong in #20396
- [Docs] Minimize spacing for supported_hardware.md table by @mgoin in #21779
- [Refactor] Merge Compressed Tensor FP8
CompressedTensorsW8A8Fp8MoEMethod
andCompressedTensorsW8A8Fp8MoECutlassMethod
by @yewentao256 in #21775 - [CI] Parallelize Kernels MoE Test by @mgoin in #21764
- skip fusedmoe layer for start_load_kv by @calvin0327 in #21378
- [AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM by @gshtras in #21766
- Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema by @bbeckca in #21684
- [Misc] Rework process titles by @njhill in #21780
- [Doc] Link to RFC for pooling optimizations by @DarkLight1337 in #21806
- [Model]: Fused MoE for nomic-embed-text-v2-moe by @Isotr0py in #18321
- [V0 deprecation] Guided decoding by @rzabarazesh in #21347
- [Model] Refactor JambaForCausalLM by @jeejeelee in #21394
- [Docs] Fix the outdated URL for installing from vLLM binaries by @yankay in #21523
- [KVCache] Make KVCacheSpec hashable by @heheda12345 in #21791
- [Doc] Update compatibility matrix for pooling and multimodal models by @DarkLight1337 in #21831
- [Bugfix] VLLM_V1 supports passing other compilation levels by @zou3519 in #19340
- [Docs] Merge design docs for a V1 only future by @hmellor in #21832
- [TPU] Add an optimization doc on TPU by @bvrockwell in #21155
- [Bugfix]fix mixed bits and visual language model quantization in AutoRound by @wenhuach21 in #21802
- [Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend by @elvischenv in #21525
- [Docs] use
uv
in GPU installation docs by @davidxia in #20277 - [Doc] Add FusedMoE Modular Kernel Documentation by @varun-sundar-rabindranath in #21623
- [Doc] update Contributing page's testing section by @davidxia in #18272
- Add
flashinfer_python
to CUDA wheel requirements by @mgoin in #21389 - docker: docker-aware precompiled wheel support by @dougbtv in #21127
- Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure (#21647)" by @gshtras in #21850
- [BugFix] Fix interleaved sliding window not set for Gemma3n by @sarckk in #21863
- [ci] add b200 test placeholder by @simon-mo in #21866
- [ci] mark blackwell test optional for now by @simon-mo in #21878
- [Bugfix] Correct max tokens for non-contiguous embeds by @milesial in #21798
- [v1][attention] Support Hybrid Allocator + FlashInfer by @heheda12345 in #21412
- [Docs] Switch to better markdown linting pre-commit hook by @hmellor in #21851
- [DOC] Fix path of v1 related figures by @heheda12345 in #21868
- [Docs] Update docker.md with HF_TOKEN, new model, and podman fix by @mgoin in #21856
- Expose PyTorch profiler configuration to environment variables by @Csrayz in #21803
- [Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization by @sydarb in #21808
- [Bugfix] Fix comment typo of get_num_common_prefix_blocks() by @MingzhenHan in #21827
- [Bugfix] Actually disable processing cache when API server is scaled out by @DarkLight1337 in #21839
- [Perf] Using
__nv_fp8_e4m3
instead ofc10::e4m3
forper_token_group_quant
by @yewentao256 in #21867 - [Frontend] Add LLM.reward specific to reward models by @noooop in #21720
- [XPU] use
ZE_AFFINITY_MASK
for device select on xpu by @jikunshang in #21815 - Add @sighingnow as maintainer of qwen's related files. by @sighingnow in #21895
- [CI/Build] Fix pre-commit failure in docs by @DarkLight1337 in #21897
- [Docs] Expand introduction to Ray in Multi-node deployment section by @crypdick in #21584
- Update vLLM Benchmark Suite for Xeon based on 0.9.2 release by @louie-tsai in #21486
- [Misc] Remove redundant config definitions by @DarkLight1337 in #21891
- [Doc] Update Intern-S1 info by @jeejeelee in #21908
- [CI] rollback lint-and-deploy pipeline using amd machine by @kebe7jun in #21912
- [Tests] Fixing bug inside MultiModalProfiler. by @shenoyvvarun in #21842
- [Model] Remove DSV2 unused code by @jeejeelee in #21903
- [benchmark] add max-concurrency in result table by @panpan0000 in #21095
- [Doc] Update partial support by @DarkLight1337 in #21916
- [Docs] Fix the example code of streaming chat completions in reasoning by @hsliuustc0106 in #21825
- Add @patrickvonplaten as maintainer of mistral's related files. by @patrickvonplaten in #21928
- [Hardware][CPU] Build fix for ARM without BF16 by @ericcurtin in #21848
- [Feature][EPLB] Add eplb support for Qwen3 by @aladerran in #20815
- [Doc] Remove vLLM prefix and add citation for PagedAttention by @DarkLight1337 in #21910
- [Bugfix] we should use metavar is not choices by @lengrongfu in #21902
- [Feature] Support multiple api keys in server by @Yanpas in #18548
- [misc] skip p2p check by default by @youkaichao in #21904
- [Test] Add Benchmark and Unit Test for
per_token_group_quant
by @yewentao256 in #21860 - [CI/Build] Only run markdownlint in CI by @DarkLight1337 in #21892
- Reduce time wasted in GitHub Actions using
concurrency
by @hmellor in #21919 - [Misc] Improve code readability of KVCacheManager by @tanruixiang in #21673
- [NVIDIA] Fix Llama4 Scout FP4 functionality issues by @nvpohanh in #21499
- [Docs] Reduce the size of the built docs by @hmellor in #21920
- [Bugfix] Fix OOM tests in initialization test by @Isotr0py in #21921
- [Bugfix] Fix multi-api server not working for text models by @DarkLight1337 in #21933
- Override attention metadata for fast prefill in some KV sharing setups by @sarckk in #21590
- [Bugfix] Fix TypeError in scheduler when comparing mixed request_id types by @chi2liu in #21816
- [CI/Build] Fix registry tests by @DarkLight1337 in #21934
- [Bugfix] SharedStorage Connector for V1 PD multimodal by @fake0fan in #21611
- feat(distributed): add
get_required_kvcache_layout
class method to kv connector api by @wxsms in #20433 - [TPU] Support Pathways in vLLM by @wenxindongwork in #21417
- [Misc] Support more collective_rpc return types by @njhill in #21845
- For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted by @dougbtv in #21964
- [Misc] Use dracut on CentOS and skip clone if repo exists for EP kernel installation by @minosfuture in #21635
- [Feature] Add async tensor parallelism for scaled mm by @cascade812 in #20155
- [Bugfix] Fix None value handling in trace span creation for cancelled requests by @br4mm in #20272
- [Core] Move EngineCoreRequest to Request conversion out of EngineCore by @linzebing in #21627
- [Example] Add
async_llm_streaming.py
example for AsyncLLM streaming in python by @mgoin in #21763 - [Bugfix] Relax lang pin for voxtral by @sanchit-gandhi in #21833
- [UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA by @mgoin in #21966
- [Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels by @jeejeelee in #21818
- [CI Bugfix] Fix CI OOM for
test_shared_storage_connector_hashes
by @mgoin in #21973 - [Bugfix]: fix metadata file copy in test_sharded_state_loader by @andyxning in #21830
- [Deprecation] Remove deprecated args and methods by @DarkLight1337 in #21907
- [CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES by @dtrifiro in #21599
- [Model][CI] Let more pooling models support v1 by @noooop in #21747
- [BugFix] Fix case where
collective_rpc
returnsNone
by @njhill in #22006 - [NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend by @amirkl94 in #21458
- [Model] Add step3 vl by @Oliver-ss in #21998
- [ez] Remove a trailing space from compilation/decorators.py by @zhxchen17 in #22028
- fix(setup): improve precompiled wheel setup for Docker builds by @dougbtv in #22025
- Removing amdproduction Tests by @Alexei-V-Ivanov-AMD in #22027
- Update torch_xla pin to 20250730 by @vanbasten23 in #21956
- [Meta] Official Eagle mm support, first enablement on llama4 by @morgendave in #20788
- [Misc] Add unit tests for chunked local attention by @sarckk in #21692
- [Bugfix] Fix MTP weight loading by @benchislett in #21941
- Add FlashInfer allreduce RMSNorm Quant fusion by @ilmarkov in #21069
- [Feature] Add Flashinfer MoE Support for Compressed Tensor NVFP4 by @yewentao256 in #21639
- Add DeepGEMM to Dockerfile in vllm-base image by @MatthewBonanni in #21533
- Move flashinfer-python to optional extra
vllm[flashinfer]
by @mgoin in #21959 - [Refactor] Remove Duplicate
per_block_cast_to_fp8
, Remove Dependencies of DeepGEMM by @yewentao256 in #21787 - [Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache by @charent in #20873
- [Misc] Automatically resolve HF processor init kwargs by @DarkLight1337 in #22005
- [BugFix] fix: aot passes kvcache dtype information by @mickaelseznec in #19750
- [Model] [Quantization] Support quantization for Gemma3n by @kylesayrs in #21974
- [Doc] Add Voxtral to Supported Models page by @DarkLight1337 in #22059
- Update sampling_metadata.py by @Aviadr-neureality in #21937
- [Doc] Fix a syntax error of example code in structured_outputs.md by @hsliuustc0106 in #22045
- [Bugfix] Disable multi-modal preprocessor cache for DP by @DarkLight1337 in #21896
- [Core] Avoid repeated len(block_token_ids) check in hash_request_tokens by @linzebing in #21781
- [Frontend] Align tool_choice="required" behavior with OpenAI when tools is empty by @n0gu-furiosa in #21052
- Revert precompile wheel changes by @simon-mo in #22055
- [Doc] Add example for Step3-VL by @ywang96 in #22061
- [Bugfix] Add log prefix in non-dp mode engine core by @wuhang2014 in #21889
- [Misc] Remove upper bound in openai package version by @WoosukKwon in #22060
- [Doc] Added warning of speculating with draft model by @david6666666 in #22047
- [Quantization] Enable BNB support for InternS1 by @jeejeelee in #21953
- Revert "Update sampling_metadata.py (#21937)" by @hmellor in #22088
- [Speculative Decoding] Add
speculators
config support by @dsikka in #21345 - [BUG] [ROCm] Fix import bug on ROCm by @tjtanaa in #22083
- Fix
get_kwargs
for case where type hint islist[Union[str, type]]
by @hmellor in #22016 - [Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels by @mgoin in #21893
- feat(multimodal): Add customizable background color for RGBA to RGB conversion by @ahengljh in #22052
- [Bugfix][PD] set max_completion_tokens=1 if req has this value by @Abirdcfly in #21841
- [Refactor] Fix Compile Warning #1444-D by @yewentao256 in #21462
- [BugFix] Update AttnFusionPass cache key by @zou3519 in #21947
- [BugFix] Don't change title of top-level process by @njhill in #22032
- [Docs] use
uv
in CPU installation docs by @davidxia in #22089 - Deprecate
--disable-log-requests
and replace with--enable-log-requests
by @hmellor in #21739 - Improve documentation of
ModelConfig.try_get_generation_config
to prevent future confusion by @hmellor in #21526 - [Bugfix] Fix glm4.1v video inference issue by @Isotr0py in #22067
- [Bugfix] fix when skip tokenizer init by @lengrongfu in #21922
- security policy: take 1 by @sidhpurwala-huzaifa in #21119
- [Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch by @varun-sundar-rabindranath in #21837
- Enable headless models for pooling in the Transformers backend by @hmellor in #21767
- [Misc] Minor enhancement of benchmark_moe by @jeejeelee in #22068
- Fix pre-commit failure for SECURTIY.md by @mgoin in #22102
- [compile][startup] Disable C++ compilation of symbolic shapes by @anijain2305 in #20836
- Introduce RayPPCommunicator for ray-based PP by @ruisearch42 in #21660
- Add lora test for tp>1 case for TPU. by @vanbasten23 in #21970
- [BugFix] Harden distributed DP startup by @njhill in #21538
- [CI] Initial tests for SM100 Blackwell runner by @mgoin in #21877
- [Perf] Optimize
reshape_and_cache_flash
CUDA Kernel by @yewentao256 in #22036 - feat: Add Support GPTQ Quantization MOE on ROCM vllm serve by @JartX in #21733
- [V1][CUDA] Full cudagraph support for FlashInfer by @fhl2000 in #21367
- [Model] Qwen2.5 VL SiLU-and-Mul by @vllmellm in #22066
- [Misc]
VLLM_TARGET_DEVICE.lower()
by @NickLucche in #22101 - [Misc] DeepGemmExperts : Avoid JIT generation in the hot-path by @varun-sundar-rabindranath in #21955
- [Speculators][Speculative Decoding] Add Qwen Eagle3 Support by @dsikka in #21835
- [BugFix] Improve internal DP load balancing by @njhill in #21617
- [Test] Add Unit Test for Batched DeepGEMM by @yewentao256 in #21559
- [Attention][DBO] Add support for "splitting" the CommonAttentionMetadata by @SageMoore in #21153
- [FEAT][ROCm] Enable running Flash Attention as ViT attn backend for Qwen-VL models on ROCm platform. by @vllmellm in #22069
- [Misc] Getting and passing ray runtime_env to workers by @ruisearch42 in #22040
- Fix test_kv_sharing_fast_prefill flakiness by @sarckk in #22038
- [Bugfix] Mamba2 remove bugged initial state condition in chunk scan by @cyang49 in #22034
- docs: remove deprecated disable-log-requests flag by @ywang96 in #22113
- [PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion by @vadiklyutiy in #20000
- for glm-4.1V update by @zRzRzRzRzRzRzR in #22000
- [Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead by @cyang49 in #21075
- [Frontend] Improve error message for too many mm items by @DarkLight1337 in #22114
- [V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time by @tdoublep in #21557
- [xpu]support moe models on XPU platform by @yma11 in #21643
- Revert "[compile][startup] Disable C++ compilation of symbolic shapes" by @xiszishu in #22122
- [Misc] Bump ray to 2.48.0 by @ruisearch42 in #22123
- [Fix] Fix llama4 modelopt weight loading error by @jiahanc in #22107
- [Misc] Add tensor schema test coverage for multimodal models by @Isotr0py in #21754
- [Benchmark] Support ready check timeout in
vllm bench serve
by @yeqcharlotte in #21696 - Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120) by @LopezCastroRoberto in #21309
- [Misc] update doc comment for send by @andyxning in #22026
- [executor] feat: add supports_pp attr to executors by @eric-haibin-lin in #21786
- [V1] [P/D] Refactor KV Connector Path by @sdavidbd in #21980
- [Responses API] Disable response store by default by @WoosukKwon in #22137
- [CI/Build][Bugfix] Fix Qwen2.5 tests in CPU CI via fallback silu_and_mul to torch native implementation by @bigPYJ1151 in #22145
- Add chat doc in quick start by @TankNee in #21213
- fuse fp32 for GLM-4.5 e_score_correction_bias by @zRzRzRzRzRzRzR in #22143
- [Bugfix] Fix failing multimodal standard test by @Isotr0py in #22153
- Use
aiohttp
connection pool for benchmarking by @eicherseiji in #21981 - [fix] fix correct assertion syntax error in attention utils. by @skyloevil in #22154
- [RLHF] Fix torch.dtype not serializable in example by @22quinn in #22158
- [PD] add test for chat completions endpoint by @Abirdcfly in #21925
- remove duplicate code within cleanup_dist_env_and_memory by @andyxning in #22147
- Add tree attention backend for v1 (part 1) by @TheEpicDolphin in #20401
- [refactor] improve ConstantList exception specificity by @skyloevil in #22156
- Remove index_put from MM embeddings merging by @chenxi-yang in #22105
- [CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes by @tlrmchlsmth in #22163
- [Misc] Modify the organization of GLM series by @jeejeelee in #22171
- [feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading by @weixiao-huang in #21164
- [Bugfix] Fix failing GGUF models test by @Isotr0py in #22174
- [Sampler] Support returning all logprobs or logits by @22quinn in #21792
- [Doc] Update pooling model docs by @DarkLight1337 in #22186
- Fix Arcee model weight loading: Add custom load_weights by @alyosha-swamy in #21725
- [Responses API] Ignore
store=True
and process the request by default by @WoosukKwon in #22185 - [Bug] Update auto_tune.sh to separate benchmarking and profiling. by @ericehanley in #21629
- [Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector by @Abatom in #21819
- [NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading by @nvpohanh in #22073
- [Bugfix] V1 Fix the cursor leakage issue during request scheduling. by @CLFutureX in #21173
- Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." by @WoosukKwon in #22223
- [V1] reduce block size for tree attention correctness test to fix 'ou… by @TheEpicDolphin in #22207
- [V0 deprecation][P/D] Deprecate v0
KVConnectorBase
code (1/2) by @lk-chen in #21785 - [FEAT] Refactor ROPE into module by @tjtanaa in #22192
- [ROCm][Bugfix] Compilation passes fix by @gshtras in #22202
- self.gate dtype update for GLM-4.5 by @zRzRzRzRzRzRzR in #22203
- [Log] DeepGEMM Update Log for Unaligned Problem Size by @yewentao256 in #22208
- fix: kimi_k2 return empty tool call list by @tlipoca9 in #22149
- [Misc] Remove pass_config from CompilationConfig dump_json excluded by @elvischenv in #21911
- [Doc] add backend to doc string of initialize_model_parallel by @andyxning in #22142
- [Misc] log more detailed message for ensure_model_parallel_initialized by @andyxning in #22144
- Optimize configuration access with LRU cache in custom ops by @skyloevil in #22204
- [Bugfix] Misaligned params in TreeAttentionImpl by @DarkLight1337 in #22226
- [UX] Fail if an invalid attention backend is specified by @mgoin in #22217
- [Core] Factor out common logic for MM budget calculation by @DarkLight1337 in #22228
- [Model] Pooling model activation supports per request control by PoolingParams by @noooop in #20538
- [Docs][TPU] Highlight TPU Software version selection by @NickLucche in #22242
- Migrate KimiVLImagePixelInputs to TensorSchema by @bbeckca in #21769
- [Feature] Non-contiguous Support for FP8 Quantization by @yewentao256 in #21961
- [NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel by @elvischenv in #22095
- [Misc] correct static type check for GroupCoordinator by @andyxning in #21946
- [V0 Deprecation][TPU] Remove V1 flag check from tests by @NickLucche in #22248
- Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail by @mgoin in #22128
- [CI/Build] Update flashinfer to 0.2.9 by @mgoin in #22233
- [Refactor] Remove Unused Environment Variable
VLLM_NO_DEPRECATION_WARNING
by @yewentao256 in #22199 - [V1] port xformers backend to v1 by @TheEpicDolphin in #21342
- [bugfix] fix blackwell deepep installation by @youkaichao in #22255
- [CI][TPU] Fix docker clean up by @lsy323 in #22271
- [Bugfix] Remove faulty test for oot attention backend by @mgoin in #22286
- [Bugfix] Fix 3D input passed into cutlass_scaled_mm by @mgoin in #22278
- [Bugfix] Fix MoE BNB version by @jeejeelee in #22260
- [Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding by @benchislett in #21862
- [Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation by @ruisearch42 in #22275
- [Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm by @gshtras in #22264
- Upgrade FA3 for attention sink by @WoosukKwon in #22313
- Increase openai-python version by @WoosukKwon in #22316
- Add attention sink in attention backends by @WoosukKwon in #22320
- Update transformers to
v4.55
by @hmellor in #21931 - Add GPT-OSS model code and config [1/N] by @WoosukKwon in #22327
- [ROCm] Add attention sink to use_rocm_custom_paged_attention by @WoosukKwon in #22329
- [GptOss] Add GptOss reasoning parser to support structure output by @heheda12345 in #22322
- [gpt-oss] flashinfer attention sink init by @zyongye in #22330
- [gpt-oss] Add openai-harmony as default dependency by @WoosukKwon in #22332
- [Misc] Clean up duplicated hf overrides by @Isotr0py in #22311
- [gpt-oss] Add Tool/ConversationContext classes and harmony_utils by @WoosukKwon in #22340
- [gpt-oss] add model to supported models doc by @ywang96 in #22336
- [gpt-oss] Support chat completion api by @WoosukKwon in #22342
- [Minor] Fix type by @WoosukKwon in #22347
- [BugFix] Fix FA2 RuntimeError when sinks is provided by @LucasWilkinson in #22365
- add the codes to check AMD Instinct GPU number by @zhangnju in #22367
- [BugFix] Fix triton compile error in
kernel_unified_attention_2/3d
caused by attention sinks by @LucasWilkinson in #22368 - [Bugfix] Make condition in triton kernel constexpr by @gshtras in #22370
- [gpt-oss] Add loop for built-in tool call by @WoosukKwon in #22374
- [gpt-oss] attention sink init fix gemini by @zyongye in #22335
- [gpt-oss] flashinfer mxfp4 by @zyongye in #22339
- [v1] - Mamba1 Attention Metadata by @Josephasafg in #21249
- [Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue by @yewentao256 in #22399
- [gpt-oss] add demo tool server by @heheda12345 in #22393
- [gpt-oss] fix model config with hf_config by @zyongye in #22401
- Fix trtllm-gen attention env and add attention sink by @IwakuraRein in #22378
- Update
flashinfer-python==0.2.10
by @mgoin in #22389 - [model] Support MiniCPM-V 4.0 by @tc-mb in #22166
- Support encoder_only attention for FlexAttention by @maxdebayser in #22273
- [Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix by @LucasWilkinson in #21588
- [XPU]Fix
flash_attn_varlen_func
interface on xpu by @jikunshang in #22350 - [Qwen3] Enable dual-chunk-attention support for Qwen3 models. by @sighingnow in #21924
- [Bugfix] Fix wrong method name in Intern-S1 image processor by @DarkLight1337 in #22417
- Use float32 for test_completion.py by @mgoin in #22385
- [Bugfix]: Fix the streaming output for function calls in the minimax by @qscqesze in #22015
- [Bugfix] Add proper comparison for package versions by @syedmba in #22314
- Update
hf_xet
pin to resolve hangs by @hmellor in #22356 - Optimize logger init performance by using module-level constants by @skyloevil in #22373
- preload heavy modules when mp method is forkserver by @lionelvillard in #22214
- [gpt-oss] Convert user input to harmony format by @heheda12345 in #22402
- [Bugfix] EPLB load statistics problem by @david6666666 in #22167
- [CI] Skip the pooling models that do not support transformers v4.55 by @noooop in #22411
- [Bench] Split serve.py:main into async/async versions by @lk-chen in #22405
- [Model] Switch to Fused RMS norm in Qwen2.5_VL model. by @vllmellm in #22184
- [Frontend] Update OpenAI error response to upstream format by @msanft in #22099
- [Misc] Support routing logic simulation by @minosfuture in #21990
- feat: Add --enable-log-outputs flag for logging model generations by @mizadri in #20707
- [Docs] Add missing dependency for docs build by @hmellor in #22435
- Add H20-3e fused MoE kernel tuning configs for GLM-4.5 by @JaceyShao in #22433
- [Misc] Enhance code formatting in mxfp4.py by @WoosukKwon in #22423
- [Doc] Fix link to prefix caching design by @sarckk in #22384
- [Docs] Factor out troubleshooting to its own guide; add section for Ray Observability by @crypdick in #21578
- [Doc] update docs for nightly benchmarks by @andrewkchan in #12022
- [Docs] Update features/disagg_prefill, add v1 examples and development by @david6666666 in #22165
- [Core] Store only the keys for multi-modal data in P0 by @DarkLight1337 in #22198
- [Bugfix] Add missing
packed_modules_mapping
toDeepseekV2ForCausalLM
by @fxmarty-amd in #22352 - [Tool] Fix auto tool call by @heheda12345 in #22434
- [gpt-oss] Generate ResponseOutputItem from Harmony Message by @heheda12345 in #22410
- Fix pre-commit error in main by @WoosukKwon in #22462
- [Core] Simplify mm processing cache by @DarkLight1337 in #22457
- [Frontend] Use engine argument to control MM cache size by @DarkLight1337 in #22441
- Remove
from_dict
fromSpeculativeConfig
by @hmellor in #22451 - [Misc] normalize multiprocessing Queue usage by @andyxning in #22371
- [ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine by @tjtanaa in #21496
- [PERF] Use pybase64 to more quickly decode prompt embeddings by @qthequartermasterman in #22469
- Add ModelOpt Qwen3 nvfp4 support by @Edwardf0t1 in #20101
- Support Tensorrt-LLM MoE fp4 for low-latency by @wenscarl in #21331
- Fix Flashinfer CUTLASS MOE Allgather by @wenscarl in #21963
- [Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) by @0xjunhao in #22131
- [Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match by @chaunceyjiang in #22065
- not tie_word_embeddings for glm-4.5 and glm-4.5v by @zRzRzRzRzRzRzR in #22460
- Optimize MiniCPMO mask creation with vectorized implementation by @skyloevil in #22464
- Fix pre-commit by @DarkLight1337 in #22487
- [bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 by @nvpohanh in #22426
- [Doc] Sleep mode documentation by @iAmir97 in #22310
- [bench] Fix benchmark/serve.py to ignore unavailable results by @lk-chen in #22382
- [CI/Build] Fix multimodal tests by @DarkLight1337 in #22491
- [Misc] Begin deprecation of
get_tensor_model_*_group
by @DarkLight1337 in #22494 - [Misc] fix openai version by @lengrongfu in #22485
- [BugFix] Don't cancel asyncio tasks directly from destructors by @njhill in #22476
- [Docs] Improve API docs (+small tweaks) by @hmellor in #22459
- Remove exception for Python 3.8 typing from linter by @hmellor in #22506
- [gpt-oss] triton kernel mxfp4 by @zyongye in #22421
- [Benchmark] Add benchmark tool for multi turn conversations by @pliops-daniels in #20267
- [gpt-oss] guard import when triton kernel is not installed by @zyongye in #22529
- [Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling” by @crypdick in #22466
- [gpt-oss] Support tool call and implement MCP tool server by @heheda12345 in #22427
- [BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA by @LucasWilkinson in #21691
- [Misc] DeepGEMM : Avoid JIT generation in the hot-path by @varun-sundar-rabindranath in #22215
- [Bugfix] Update FA commit hash by @tdoublep in #22546
- Skip Qwen 1 in CI because remote code is no longer compatible with Transformers by @hmellor in #22536
- [Docs] fix broken links in metrics.md by @GuyStone in #22315
- [Frontend] Add unix domain socket support by @yyweiss in #18097
- Extract
CompilationConfig
fromconfig.py
by @hmellor in #22524 - Drop flaky test_healthcheck_response_time by @russellb in #22539
- [XPU] upgrade torch 2.8 on for XPU by @jikunshang in #22300
- [BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D by @Pradyun92 in #22317
- [Bugfix] Fix ModernBert cuda graph capturing in v1 by @Isotr0py in #21901
- Implicit language-model-only mode via limit-mm-per-prompt by @ywang96 in #22299
- [Doc] Add usage of implicit text-only mode by @ywang96 in #22561
- Remove mamba_ssm from vLLM requirements; install inside test container using
--no-build-isolation
by @tdoublep in #22541 - [Log] Add Warning for Deprecation of DeepGEMM old version by @yewentao256 in #22194
- [V1] [Hybrid] Support Minimax-Text-01 in V1 by @tdoublep in #22151
- v1: Pass KVConnectorOutput to scheduler-side by @orozery in #22157
- [Misc] Use config definitions from Transformers library by @DarkLight1337 in #21913
- Fix loading of quantized BigCode models by @eldarkurtic in #22463
- [TPU] Add support for online w8a8 quantization by @kyuyeunk in #22425
- [ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel by @charlifu in #22097
- [Bugfix] Fix failing GPT-OSS initialization test by @Isotr0py in #22557
- [Bugfix] Fix CI moe kernel failure by @jeejeelee in #22556
- Update docs for Minimax-Text support by @tdoublep in #22562
- GLM-4.5V with new class name at transformers by @zRzRzRzRzRzRzR in #22520
- [CI] [Hybrid] Speed up hybrid models test by removing large models by @tdoublep in #22563
- [Docs] Reduce noise in docs and
--help
from the JSON tip by @hmellor in https://github.com/vllm-project/vllm/pull/22567 - Move
ParallelConfig
fromconfig/__init__.py
toconfig/parallel.py
by @hmellor in https://github.com/vllm-project/vllm/pull/22565 - [Model] Gemma3n MM by @NickLucche in https://github.com/vllm-project/vllm/pull/20495
- [Bugfix] Fix basic models tests hanging due to mm processor creation by @Isotr0py in https://github.com/vllm-project/vllm/pull/22571
- [FEAT] [Performance] Add triton mrope to replace the torch code path by @tjtanaa in https://github.com/vllm-project/vllm/pull/22375
- [V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers by @tdoublep in https://github.com/vllm-project/vllm/pull/21401
- [oss] Init gpt-oss bf16 support by @jeejeelee in https://github.com/vllm-project/vllm/pull/22508
- [Config] add "qwen" as a native eagle3 target supported model by @lec77 in https://github.com/vllm-project/vllm/pull/22333
- Improve fast_topk function with type hints and documentation by @skyloevil in https://github.com/vllm-project/vllm/pull/22530
- [TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block by @yaochengji in https://github.com/vllm-project/vllm/pull/22394
- Refactor sliding window configuration to Transformers best practice by @hmellor in https://github.com/vllm-project/vllm/pull/21927
- [Minor] Fix pre-commit error on main by @Isotr0py in https://github.com/vllm-project/vllm/pull/22579
- [Misc] code clean duplicate set_current_vllm_config in _set_vllm_config by @andyxning in https://github.com/vllm-project/vllm/pull/22566
- [Doc] Fix API doc link in side navigation by @22quinn in https://github.com/vllm-project/vllm/pull/22585
- [Misc] Further refine type annotations in parallel state by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22499
- [Docs] Fix warnings in docs build by @hmellor in https://github.com/vllm-project/vllm/pull/22588
- [Misc] Replace flaky image urls in pixtral test by @Isotr0py in https://github.com/vllm-project/vllm/pull/22574
- Move
CacheConfig
fromconfig/__init__.py
toconfig/cache.py
by @hmellor in https://github.com/vllm-project/vllm/pull/22586 - [doc] add beijing meetup links by @youkaichao in https://github.com/vllm-project/vllm/pull/22596
- [doc] add alibaba cloud as sponsor by @youkaichao in https://github.com/vllm-project/vllm/pull/22597
- [Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel by @Isotr0py in https://github.com/vllm-project/vllm/pull/22593
- Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks by @h-brenoskuk in https://github.com/vllm-project/vllm/pull/22534
- Migrate LlavaNextImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21774
- Remove redundant row_indices unsqueeze operation in MiniCPMO by @skyloevil in https://github.com/vllm-project/vllm/pull/22528
- Fix TensorSchema validation test for symbolic dims by @bbeckca in https://github.com/vllm-project/vllm/pull/22366
- enable Docker-aware precompiled wheel setup by @dougbtv in https://github.com/vllm-project/vllm/pull/22106
- Migrate LlavaNextVideoPixelInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21843
- Migrate LlavaImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21770
- [CI/Build] Fix tensorizer test for load_format change by @22quinn in https://github.com/vllm-project/vllm/pull/22583
- [BugFix] Fix KVConnectorOutput TPU breakage by @njhill in https://github.com/vllm-project/vllm/pull/22598
- [Misc][gpt-oss] Add rules to label gpt-oss related PRs by @draftbk in https://github.com/vllm-project/vllm/pull/22600
- [Misc][gpt-oss] guard import when triton kernel when not up to date by @zhewenl in https://github.com/vllm-project/vllm/pull/22584
- [BugFix] Fix logits repetition penalty cuda check by @PicoCreator in https://github.com/vllm-project/vllm/pull/22592
- [ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. by @vllmellm in https://github.com/vllm-project/vllm/pull/22521
- Support token_type_ids in V1 with less code changes by @maxdebayser in https://github.com/vllm-project/vllm/pull/21985
- [Misc] benchmark_moe supports expert parallel by @jeejeelee in https://github.com/vllm-project/vllm/pull/22251
- [BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm by @JartX in https://github.com/vllm-project/vllm/pull/22017
- [Docs] Add comprehensive CLI reference for all large
vllm
subcommands by @hmellor in https://github.com/vllm-project/vllm/pull/22601 - [Misc] Move tensor schema tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22612
- [Misc] Move jsontree to utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22622
- [Model] NemotronH Support by @danielafrimi in https://github.com/vllm-project/vllm/pull/22349
- Document aarch64 CPU support works by @ericcurtin in https://github.com/vllm-project/vllm/pull/22646
- [Misc] Further clean up some redundant config definitions by @Isotr0py in https://github.com/vllm-project/vllm/pull/22649
- [Feature] Add
VLLM_USE_DEEP_GEMM_E8M0
Env to Control E8M0 Scale by @yewentao256 in https://github.com/vllm-project/vllm/pull/21968 - fix: NIXL connector transfers partial block to pass full multi-modal context by @GuanLuo in https://github.com/vllm-project/vllm/pull/21074
- [Model] Pooling models default to using chunked prefill & prefix caching if supported. by @noooop in https://github.com/vllm-project/vllm/pull/20930
- [CI/Build] Skip Mllama HF runner tests with Transformers v4.55.0 by @Isotr0py in https://github.com/vllm-project/vllm/pull/22659
- [BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI by @22quinn in https://github.com/vllm-project/vllm/pull/22611
- [CI] Skip Tree Attn Test in
test_max_len.py
to unblock CI by @tjtanaa in https://github.com/vllm-project/vllm/pull/22664 - Support more parallel styles in Transformers backend TP by @hmellor in https://github.com/vllm-project/vllm/pull/22651
- [gpt-oss] Support streaming in response API by @heheda12345 in https://github.com/vllm-project/vllm/pull/22431
- [gpt-oss] Add test for response API + harmony (but skipped) by @heheda12345 in https://github.com/vllm-project/vllm/pull/22554
- Enable 4bit bnb prequant MOE by @py-andy-c in https://github.com/vllm-project/vllm/pull/21548
- Re-enable Xet on TPU tests now that
hf_xet
has been updated by @hmellor in https://github.com/vllm-project/vllm/pull/22666 - Upgrade FlashInfer to v0.2.11 by @nvpohanh in https://github.com/vllm-project/vllm/pull/22613
- [CI Failure] Use float32 for tests/entrypoints/openai/test_audio.py by @mgoin in https://github.com/vllm-project/vllm/pull/22686
- [CI] Increase timeout for test_completion_with_image_embeds by @mgoin in https://github.com/vllm-project/vllm/pull/22670
- Migrate MiniCPMVImageInputs to TensorSchema by @bbeckca in https://github.com/vllm-project/vllm/pull/21939
- [gpt-oss] Fix mxfp4 support by @heheda12345 in https://github.com/vllm-project/vllm/pull/22700
- [gpt-oss] Small bug fixes for frontend by @heheda12345 in https://github.com/vllm-project/vllm/pull/22512
- Fix passing
SpeculativeConfig
from the CLI by @hmellor in https://github.com/vllm-project/vllm/pull/22652 - [Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models by @hsliuustc0106 in https://github.com/vllm-project/vllm/pull/21737
- [doc] Update x86 CPU-inference installation doc to reflect optionality of AVX512f by @sooraj-satheesh in https://github.com/vllm-project/vllm/pull/22707
- [Bugfix] Fix ModernBert load & Enable sliding window attention for bidirectional attention. by @noooop in https://github.com/vllm-project/vllm/pull/22637
- Move
SchedulerConfig
fromconfig/__init__.py
toconfig/scheduler.py
by @hmellor in https://github.com/vllm-project/vllm/pull/22626 - [DOC] update v1_guide with INTEL HW by @xuechendi in https://github.com/vllm-project/vllm/pull/22679
- [New Model] Support Command-A-Vision by @dongluw in https://github.com/vllm-project/vllm/pull/22660
- [V0] Correct CUDA Graph capture for encoder-decoder models by @Sugar-zsg in https://github.com/vllm-project/vllm/pull/22630
- [Bugfix] Fix erroneous randomly generated cases in bad word testing by @phantomlei3 in https://github.com/vllm-project/vllm/pull/22170
- Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert" by @Jun-Howie in https://github.com/vllm-project/vllm/pull/21888
- [Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 by @RishiAstra in https://github.com/vllm-project/vllm/pull/21783
- [LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing by @zejunchen-zejun in https://github.com/vllm-project/vllm/pull/21161
- [Misc] remove GH discussions link by @jeejeelee in https://github.com/vllm-project/vllm/pull/22722
- [gpt-oss] Enable gpt-oss on ampere by @zyongye in https://github.com/vllm-project/vllm/pull/22714
- [Docs] Improve docs navigation by @hmellor in https://github.com/vllm-project/vllm/pull/22720
- [BugFix][Nixl][PD] Fix heterogenous TP by @NickLucche in https://github.com/vllm-project/vllm/pull/22663
- Officially support SmolLM3 using the Transformers backend by @hmellor in https://github.com/vllm-project/vllm/pull/22665
- [CI Failure] fix tests/entrypoints/openai/test_skip_tokenizer.py by @noooop in https://github.com/vllm-project/vllm/pull/22708
- Fix Llama4 FlashInfer FP4 MoE issues by @nvpohanh in https://github.com/vllm-project/vllm/pull/22511
- [Bugfix][CI] Fix
test_remote_decode_lifecycle.py::test_short_prompt_lifecycle
by @NickLucche in https://github.com/vllm-project/vllm/pull/22727 - [Benchmark] Fix terminal colors in benchmark_serving_multi_turn (python 3.12) by @pliops-daniels in https://github.com/vllm-project/vllm/pull/22730
- Add:
SupportsEagle3
interface for explicit EAGLE3 support by @rahul-tuli in https://github.com/vllm-project/vllm/pull/22642 - Add more test scenario for tensor schema by @teekenl in https://github.com/vllm-project/vllm/pull/22733
- [Chore] Update CODEOWNERS to include @yewentao256 for CUDA kernels, attention backends, quantization, and related tests by @yewentao256 in https://github.com/vllm-project/vllm/pull/22741
- [Kernel][AMD] Avoid D2H copy and cumsum kernel by @mxz297 in https://github.com/vllm-project/vllm/pull/22683
- [CI][Nixl] Check kv cache layout during handshake by @NickLucche in https://github.com/vllm-project/vllm/pull/22745
- Fix torch version check for SM100 mxfp4 by @zifeitong in https://github.com/vllm-project/vllm/pull/22535
- [Misc] parametrize 'dtype' in test_flash_mla by @RUTHLESS-BOT in https://github.com/vllm-project/vllm/pull/22641
- [Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues by @frankwang28 in https://github.com/vllm-project/vllm/pull/22606
- [Docs] Hide the navigation and toc sidebars on home page by @hmellor in https://github.com/vllm-project/vllm/pull/22749
- Fix Transformers backend tensor parallel for multimodal models by @hmellor in https://github.com/vllm-project/vllm/pull/22673
- [Model] Decouple glm4v by @jeejeelee in https://github.com/vllm-project/vllm/pull/22751
- Add hardware plugins to installation doc by @mgoin in https://github.com/vllm-project/vllm/pull/22732
- [V0 Deprecation] Remove multi-step scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22138
- [Misc] Remove tests/multi_step/init.py by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22778
- [V0 Deprecation] Remove args for multi-step scheduling by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22779
- Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op by @nvpohanh in https://github.com/vllm-project/vllm/pull/22701
- [Bugfix] Fix default enable for CUTLASS MLA on SM100 by @mgoin in https://github.com/vllm-project/vllm/pull/22738
- Force TRTLLM attention for gpt-oss on SM100 by @mgoin in https://github.com/vllm-project/vllm/pull/22678
- Remove unneeded ROCm platform import when using CUDA by @mgoin in https://github.com/vllm-project/vllm/pull/22765
- [Bug] Fix Unexpected Keyword Argument 'w1_bias' by @yewentao256 in https://github.com/vllm-project/vllm/pull/22757
- [Perf] Support topk softmax fused kernel for broader num_experts by @shixianc in https://github.com/vllm-project/vllm/pull/22211
- [gpt-oss] upgrade gpt-oss to v0.0.3 and add version check by @heheda12345 in https://github.com/vllm-project/vllm/pull/22768
- [Model] Add option to run Step3VisionEncoder in DP by @zzh142857 in https://github.com/vllm-project/vllm/pull/22697
- [Model] Add missing prefix to glm4_1v by @zRzRzRzRzRzRzR in https://github.com/vllm-project/vllm/pull/22716
- [Bugfix] Fix Nemotron VL image processing by @ducviet00 in https://github.com/vllm-project/vllm/pull/22739
- [Doc] Add max_lora_rank configuration guide by @chi2liu in https://github.com/vllm-project/vllm/pull/22782
- [V1] Add tree drafting tests for eagle spec decoding by @TheEpicDolphin in https://github.com/vllm-project/vllm/pull/22705
- [Platform] Custom ops support for FusedMoe by @wangxiyuan in https://github.com/vllm-project/vllm/pull/22509
- [Frontend] Add chunked processing to handle long inputs in embedding models by @x22x22 in https://github.com/vllm-project/vllm/pull/22280
- [FEATURE] support custom vllm tuned config path by @vermouth1992 in https://github.com/vllm-project/vllm/pull/22791
- [Nixl][CI] Fix tests by @NickLucche in https://github.com/vllm-project/vllm/pull/22806
- [Bugfix][mamba] Fix type annotation of Mamba2Metadata by @heheda12345 in https://github.com/vllm-project/vllm/pull/22787
- Remove unnecessary CUDA sync of qwen image and video preprocess by @cyyever in https://github.com/vllm-project/vllm/pull/22792
- Fix GGUF loader for Qwen3 MoE. by @Gh0u1L5 in https://github.com/vllm-project/vllm/pull/22785
- [Frontend] Multithreaded async multimodal load_bytes by @milesial in https://github.com/vllm-project/vllm/pull/22710
- [Core] Use individual MM items in P0/P1 cache and model runner by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22570
- [Misc] clear and separate error messages for input too long and input + max-tokens too long by @ywang96 in https://github.com/vllm-project/vllm/pull/22803
- [Bugfix] Fix MiniCPMV Image input inference failed by @jio-H in https://github.com/vllm-project/vllm/pull/22813
- [CI/Build] Update VLM common tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22841
- [CI] Fix
tests/v1/e2e/test_kv_sharing_fast_prefill.py
import on test by @NickLucche in https://github.com/vllm-project/vllm/pull/22815 - [CI/Build] Fix param mismatch in
test_eagle_correctness
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22847 - [CI/Build] Skip gpt_big model test because of broken HF model by @Isotr0py in https://github.com/vllm-project/vllm/pull/22848
- [ROCm][Bugfix] Fix compilation error in topk softmax fused kernel by @kliuae in https://github.com/vllm-project/vllm/pull/22819
- Move checklist in PR template by @ProExpertProg in https://github.com/vllm-project/vllm/pull/22852
- [Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP by @Jialin in https://github.com/vllm-project/vllm/pull/22437
- [CI/Build] Increase pooling tolerance to pass CI by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22844
- [CI][Entrypoints]: add filter to generation to filter out invalid tool calls by @wseaton in https://github.com/vllm-project/vllm/pull/22826
- [CI] Fix
tests/distributed/test_ca_buffer_sharing.py
by @ilmarkov in https://github.com/vllm-project/vllm/pull/22849 - [CI] remove flaky v0 test by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/22864
- vLLM Benchmark suite improvement by @louie-tsai in https://github.com/vllm-project/vllm/pull/22119
- [Bugfix] Fix
PixtralHFImagePixelInputs
dynamic shape check by @Isotr0py in https://github.com/vllm-project/vllm/pull/22827 - [BugFix] Threadsafe close async zmq sockets by @njhill in https://github.com/vllm-project/vllm/pull/22877
- Remove Phi 4 Flash configuration workaround by @hmellor in https://github.com/vllm-project/vllm/pull/22723
- [Bugfix] Add reset prefix cache for online serving by @iAmir97 in https://github.com/vllm-project/vllm/pull/22726
- [Doc] fix dead link by @dtrifiro in https://github.com/vllm-project/vllm/pull/22898
- [CI] Re-enable transcriptions
test_long_audio_request
by @NickLucche in https://github.com/vllm-project/vllm/pull/22890 - [Perf] Dont create unnecessary pooling params by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22876
- [Model] Modify the gate implementation of glm4_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/22832
- [Bugfix] Replace custom Encoding class with BatchEncoding in MistralTokenizer by @ZJY0516 in https://github.com/vllm-project/vllm/pull/22786
- [Bugfix] Fix parsing of
--disable-mm-preprocessor-cache
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/22909 - [CI] [Hybrid] Bump min transformers version for Bamba and Jamba by @tdoublep in https://github.com/vllm-project/vllm/pull/22908
- [Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/22428
- docs: update fastsafetensors usage instructions by @NirLevy98 in https://github.com/vllm-project/vllm/pull/22891
- [CI] Temporarily disable flaky test by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22930
- [Kernel] Add nvfp4 gemm flashinfer backends by @nvjullin in https://github.com/vllm-project/vllm/pull/22346
- [Quantization]: Support compressed-tensors mixed-precision model loading by @dsikka in https://github.com/vllm-project/vllm/pull/22468
- [Core] Return final response for aborted requests from
AsyncLLM.generate
by @njhill in https://github.com/vllm-project/vllm/pull/22283 - [BugFix] Fix initial DP request load imbalance by @njhill in https://github.com/vllm-project/vllm/pull/22910
- [Bugfix] use flash attn on sm90 by @zyongye in https://github.com/vllm-project/vllm/pull/22933
- [Kernel] Add cuda kernel for gpt_oss activation by @jeejeelee in https://github.com/vllm-project/vllm/pull/22538
- Revert "[Kernel] Add cuda kernel for gpt_oss activation" by @simon-mo in https://github.com/vllm-project/vllm/pull/22948
- [BugFix][KVConn] Fix use of
get_required_kvcache_layout
by @njhill in https://github.com/vllm-project/vllm/pull/22734 - [BugFix] Fix port lookup in internal DP LB tests by @njhill in https://github.com/vllm-project/vllm/pull/22252
- [CI Perf] Prune tests in
tests/kernels/quantization/
by @mgoin in https://github.com/vllm-project/vllm/pull/22942 - [CI Perf] Prune tests in
tests/kernels/moe/
by @mgoin in https://github.com/vllm-project/vllm/pull/22939 - [CI Perf] Prune tests in
tests/kernels/attention/
by @mgoin in https://github.com/vllm-project/vllm/pull/22936 - refactor: Change scaling factors calculation for flashinfer FusedMoE by @amirkl94 in https://github.com/vllm-project/vllm/pull/22812
- [Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement by @yewentao256 in https://github.com/vllm-project/vllm/pull/22763
- [Mamba] - refactor: Renamed mamba_attn to mamba2_attn by @Josephasafg in https://github.com/vllm-project/vllm/pull/22818
- Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module." by @tjtanaa in https://github.com/vllm-project/vllm/pull/22956
- [P/D]Provide bucket algorithm rate limiter for proxy_server by @frankie-ys in https://github.com/vllm-project/vllm/pull/22643
- [CI] Pooling models mteb test uses enforce_eager by @noooop in https://github.com/vllm-project/vllm/pull/22878
- [V1] - Split Prefill and Decode for Mamba1 models by @amirai21 in https://github.com/vllm-project/vllm/pull/22653
- [Bugfix] Unquote file uri before reading image by @sayandipdutta in https://github.com/vllm-project/vllm/pull/22912
- [Bugfix] fix cuda 12.6 and 11.8 build by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/22952
- [MM] Allow skipping memory profiling for multimodal models. by @ywang96 in https://github.com/vllm-project/vllm/pull/22950
- Improve multimodal hasher performance for re-used Image prompts by @p88h in https://github.com/vllm-project/vllm/pull/22825
- [V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) by @tdoublep in https://github.com/vllm-project/vllm/pull/22928
- [Misc] Ignore ep_kernels_workspace by @jeejeelee in https://github.com/vllm-project/vllm/pull/22807
- [CI] Remove duplicated docs build from buildkite by @hmellor in https://github.com/vllm-project/vllm/pull/22924
- [Frontend] Expose do_log_stats interval to env by @Csrayz in https://github.com/vllm-project/vllm/pull/22905
- [Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer by @fhl2000 in https://github.com/vllm-project/vllm/pull/20059
- [V0 Deprecation] Remove advance_step by @WoosukKwon in https://github.com/vllm-project/vllm/pull/22969
- [BugFix] Skip the Q component for QKVParallelLinear in the case of QKVCrossParallelLinear since its width is 0 by @sstamenk in https://github.com/vllm-project/vllm/pull/22369
- [FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches by @JartX in https://github.com/vllm-project/vllm/pull/22896
- [Benchmarks] Include image data when ShareGPT4V dataset is used. by @huachenheli in https://github.com/vllm-project/vllm/pull/22955
- [Structured Output] Make the output of structured output example more complete by @shen-shanshan in https://github.com/vllm-project/vllm/pull/22481
- [Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods. by @bnellnm in https://github.com/vllm-project/vllm/pull/22035
- [Model] Granite-4 support loading quantized checkpoint by @cyang49 in https://github.com/vllm-project/vllm/pull/22925
- [Log] Debug Once for Randomizing dummy data for DP Rank by @yewentao256 in https://github.com/vllm-project/vllm/pull/22860
- [Core] direct indexing on self.block_table_np in compute_slot_mapping by @linzebing in https://github.com/vllm-project/vllm/pull/22940
- [Bugfix] Added more env vars to hash by @nvjullin in https://github.com/vllm-project/vllm/pull/22449
- Use regex in convert-results-json-to-markdown.py by @mgoin in https://github.com/vllm-project/vllm/pull/22989
- [CI] Speed up Whisper tests by reusing server by @mgoin in https://github.com/vllm-project/vllm/pull/22859
- [Fix] enable swap_ab for pplx problem size computation by @shixianc in https://github.com/vllm-project/vllm/pull/22991
- Add PrefixRepetitionRandomDataset to
vllm bench serve
datasets by @eicherseiji in https://github.com/vllm-project/vllm/pull/20638 - minor: zero workspace buffer init for flashinfer trtllm-gen attn by @yyihuang in https://github.com/vllm-project/vllm/pull/22603
- [Attention] FA3 Attention Sinks Perf Boost by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22478
- [BugFix] Fix regression caused by mamba state dtype PR by @tdoublep in https://github.com/vllm-project/vllm/pull/22998
- ci: Add CUDA + arm64 release builds by @seemethere in https://github.com/vllm-project/vllm/pull/21201
- [Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits by @rishitdholakia13 in https://github.com/vllm-project/vllm/pull/22963
- [BugFix] Handle case where async utility call is cancelled by @njhill in https://github.com/vllm-project/vllm/pull/22996
- [v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) by @orozery in https://github.com/vllm-project/vllm/pull/19728
- Support multiple attention groups for KV sharing by @sarckk in https://github.com/vllm-project/vllm/pull/22672
- [BugFix] Make
run_once
thread-safe by @oraluben in https://github.com/vllm-project/vllm/pull/22978 - [Misc] Support passing multiple request ids at once to
AsyncLLM.abort()
by @njhill in https://github.com/vllm-project/vllm/pull/22944 - [Kernel] Simplify
get_kv_cache_layout
and cacheuse_trtllm_attention
env-dependent bit by @NickLucche in https://github.com/vllm-project/vllm/pull/22735 - [Bugfix] Fix DeepSeek MTP by @benchislett in https://github.com/vllm-project/vllm/pull/22934
- [Frontend] Avoid list copies in
serving_chat.py
by @njhill in https://github.com/vllm-project/vllm/pull/22947 - [V1] support min_tokens for detokener by @calvin0327 in https://github.com/vllm-project/vllm/pull/22014
- [misc] nsys profile output kernel classifier and visualizer by @gracehonv in https://github.com/vllm-project/vllm/pull/22971
- [XPU]avoid circular import during XPU init by @jikunshang in https://github.com/vllm-project/vllm/pull/23017
- [Build] Env var to disable sccache by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22968
- [BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors by @qthequartermasterman in https://github.com/vllm-project/vllm/pull/22962
- [Misc] Add --save-dir option to benchmark_moe by @jeejeelee in https://github.com/vllm-project/vllm/pull/23020
- [Multimodal] Update Tensor schema test to cover arbitrary shape mm inputs by @Isotr0py in https://github.com/vllm-project/vllm/pull/22867
- [Core] Make cudagraph check cuda platform only by @yaochengji in https://github.com/vllm-project/vllm/pull/23005
- [CI][Bugfix] Skip Ovis2 generation test because of broken remote code by @Isotr0py in https://github.com/vllm-project/vllm/pull/22954
- Add docs for PrefixRepetitionDataset + enable usage with
vllm bench throughput
by @eicherseiji in https://github.com/vllm-project/vllm/pull/23012 - [Refactor] Allow optional MultiModalKwargsItem in IPC by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23022
- [New Model]mBART model by @princepride in https://github.com/vllm-project/vllm/pull/22883
- Fix handling of
max_num_batched_tokens
for pooling tasks by @maxdebayser in https://github.com/vllm-project/vllm/pull/23004 - [Frontend] Added support for HermesToolParser for models without special tokens by @minpeter in https://github.com/vllm-project/vllm/pull/16890
- [Bugfix gpt-oss] Fix float32 convert for flashinfer sink support by @mgoin in https://github.com/vllm-project/vllm/pull/23016
- [Flaky CI] Increase timeout tolerance for test_mp_crash_detection+test_default_mm_lora_chat_completions by @mgoin in https://github.com/vllm-project/vllm/pull/23028
- [Kernel/Quant] Remove AQLM by @mgoin in https://github.com/vllm-project/vllm/pull/22943
- [V1] Logits processors extensibility by @afeldman-nm in https://github.com/vllm-project/vllm/pull/19912
- [Bugfix] fix qwen3 moe fp8 accuracy issue by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/23031
- [UX] Separate marlin moe config logic from triton moe by @mgoin in https://github.com/vllm-project/vllm/pull/23006
- [Refactor] Defer tensor data construction in MultiModalKwargs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/23030
- [Misc] method name typo fix by @andyxning in https://github.com/vllm-project/vllm/pull/23042
- [Kernel] Add cuda kernel for gpt_oss activation by @jeejeelee in https://github.com/vllm-project/vllm/pull/22951
- [Bugfix] should use stack instead of concat by @947132885 in https://github.com/vllm-project/vllm/pull/22972
- [Misc] fix typo in the multimodal doc by @KevinZeng08 in https://github.com/vllm-project/vllm/pull/23051
- [BugFix] Fix for IMA in FA3 varlen combine by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/22967
- [Misc] Remove dead return by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23061
- [Misc] Convert use_structured_output property into constant by @WoosukKwon in https://github.com/vllm-project/vllm/pull/23060
- [XPU] fix xpu to set cudagraph batch sizes by @calvin0327 in https://github.com/vllm-project/vllm/pull/23044
- fix: gptq marlin weight loading failure by @simon-mo in https://github.com/vllm-project/vllm/pull/23066
New Contributors
- @zhouwfang made their first contribution in #21407
- @juncgu made their first contribution in #18293
- @weireweire made their first contribution in #21485
- @bbeckca made their first contribution in #21232
- @wzqd made their first contribution in #21494
- @hfan made their first contribution in #21479
- @ignaciosica made their first contribution in #21195
- @xyxinyang made their first contribution in #21586
- @bigshanedogg made their first contribution in #20931
- @fsx950223 made their first contribution in #20295
- @mgazz made their first contribution in #21518
- @Mitix-EPI made their first contribution in #21612
- @lvhan028 made their first contribution in #21628
- @zhouyeju made their first contribution in #21380
- @wenchen76 made their first contribution in #21154
- @skyloevil made their first contribution in #20529
- @joa-stdn made their first contribution in #21697
- @liuyumoye made their first contribution in #21534
- @hsliuustc0106 made their first contribution in #21573
- @Josephasafg made their first contribution in #21715
- @vasqu made their first contribution in #21735
- @key4ng made their first contribution in #19024
- @wuhang2014 made their first contribution in #21728
- @HugoMichard made their first contribution in #21167
- @smarterclayton made their first contribution in #21472
- @nikhil-arm made their first contribution in #17112
- @LyrisZhong made their first contribution in #20396
- @rzabarazesh made their first contribution in #21347
- @milesial made their first contribution in #21798
- @Csrayz made their first contribution in #21803
- @MingzhenHan made their first contribution in #21827
- @aladerran made their first contribution in #20815
- @Yanpas made their first contribution in #18548
- @tanruixiang made their first contribution in #21673
- @nvpohanh made their first contribution in #21499
- @chi2liu made their first contribution in #21816
- @fake0fan made their first contribution in #21611
- @wxsms made their first contribution in #20433
- @wenxindongwork made their first contribution in #21417
- @br4mm made their first contribution in #20272
- @linzebing made their first contribution in #21627
- @sanchit-gandhi made their first contribution in #21833
- @amirkl94 made their first contribution in #21458
- @zhxchen17 made their first contribution in #22028
- @charent made their first contribution in #20873
- @Aviadr-neureality made their first contribution in #21937
- @n0gu-furiosa made their first contribution in #21052
- @ahengljh made their first contribution in #22052
- @sidhpurwala-huzaifa made their first contribution in #21119
- @anijain2305 made their first contribution in #20836
- @JartX made their first contribution in #21733
- @xiszishu made their first contribution in #22122
- @LopezCastroRoberto made their first contribution in #21309
- @TankNee made their first contribution in #21213
- @TheEpicDolphin made their first contribution in #20401
- @chenxi-yang made their first contribution in #22105
- @weixiao-huang made their first contribution in #21164
- @CLFutureX made their first contribution in #21173
- @tlipoca9 made their first contribution in #22149
- @zyongye made their first contribution in #22330
- @zhangnju made their first contribution in #22367
- @tc-mb made their first contribution in #22166
- @syedmba made their first contribution in #22314
- @msanft made their first contribution in #22099
- @mizadri made their first contribution in #20707
- @JaceyShao made their first contribution in #22433
- @andrewkchan made their first contribution in #12022
- @iAmir97 made their first contribution in #22310
- @pliops-daniels made their first contribution in #20267
- @yyweiss made their first contribution in #18097
- @Pradyun92 made their first contribution in #22317
- @kyuyeunk made their first contribution in #22425
- @lec77 made their first contribution in https://github.com/vllm-project/vllm/pull/22333
- @h-brenoskuk made their first contribution in https://github.com/vllm-project/vllm/pull/22534
- @zhewenl made their first contribution in https://github.com/vllm-project/vllm/pull/22584
- @PicoCreator made their first contribution in https://github.com/vllm-project/vllm/pull/22592
- @danielafrimi made their first contribution in https://github.com/vllm-project/vllm/pull/22349
- @GuanLuo made their first contribution in https://github.com/vllm-project/vllm/pull/21074
- @sooraj-satheesh made their first contribution in https://github.com/vllm-project/vllm/pull/22707
- @dongluw made their first contribution in https://github.com/vllm-project/vllm/pull/22660
- @Sugar-zsg made their first contribution in https://github.com/vllm-project/vllm/pull/22630
- @phantomlei3 made their first contribution in https://github.com/vllm-project/vllm/pull/22170
- @RishiAstra made their first contribution in https://github.com/vllm-project/vllm/pull/21783
- @zejunchen-zejun made their first contribution in https://github.com/vllm-project/vllm/pull/21161
- @teekenl made their first contribution in https://github.com/vllm-project/vllm/pull/22733
- @mxz297 made their first contribution in https://github.com/vllm-project/vllm/pull/22683
- @RUTHLESS-BOT made their first contribution in https://github.com/vllm-project/vllm/pull/22641
- @frankwang28 made their first contribution in https://github.com/vllm-project/vllm/pull/22606
- @zzh142857 made their first contribution in https://github.com/vllm-project/vllm/pull/22697
- @ducviet00 made their first contribution in https://github.com/vllm-project/vllm/pull/22739
- @x22x22 made their first contribution in https://github.com/vllm-project/vllm/pull/22280
- @Gh0u1L5 made their first contribution in https://github.com/vllm-project/vllm/pull/22785
- @jio-H made their first contribution in https://github.com/vllm-project/vllm/pull/22813
- @ZJY0516 made their first contribution in https://github.com/vllm-project/vllm/pull/22786
- @NirLevy98 made their first contribution in https://github.com/vllm-project/vllm/pull/22891
- @nvjullin made their first contribution in https://github.com/vllm-project/vllm/pull/22346
- @frankie-ys made their first contribution in https://github.com/vllm-project/vllm/pull/22643
- @amirai21 made their first contribution in https://github.com/vllm-project/vllm/pull/22653
- @sayandipdutta made their first contribution in https://github.com/vllm-project/vllm/pull/22912
- @yyihuang made their first contribution in https://github.com/vllm-project/vllm/pull/22603
- @rishitdholakia13 made their first contribution in https://github.com/vllm-project/vllm/pull/22963
- @oraluben made their first contribution in https://github.com/vllm-project/vllm/pull/22978
- @minpeter made their first contribution in https://github.com/vllm-project/vllm/pull/16890
- @947132885 made their first contribution in https://github.com/vllm-project/vllm/pull/22972
- @KevinZeng08 made their first contribution in https://github.com/vllm-project/vllm/pull/23051
Full Changelog: v0.10.0...v0.10.1rc1