github vllm-project/vllm v0.10.0rc1

latest releases: v0.10.2rc1, v0.10.1.1, v0.10.1...
pre-releaseone month ago

What's Changed

  • [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
  • [Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
  • [Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
  • Change warn_for_unimplemented_methods to debug by @mgoin in #20455
  • [Platform] Add custom default max tokens by @gmarinho2 in #18557
  • Add ignore consolidated file in mistral example code by @princepride in #20420
  • [Misc] small update by @reidliu41 in #20462
  • [Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
  • [Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
  • [Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
  • Support Llama 4 for fused_marlin_moe by @mgoin in #20457
  • [Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
  • [Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
  • [Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
  • [CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
  • [Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
  • Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
  • [feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
  • [CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
  • [Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
  • [Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
  • [doc] small fix by @reidliu41 in #20506
  • [Misc] Remove the unused LoRA test code by @jeejeelee in #20494
  • Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
  • [v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
  • [Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
  • [Misc] remove unused import by @reidliu41 in #20517
  • test_attention compat with coming xformers change by @bottler in #20487
  • [BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
  • [Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
  • [BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
  • [TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
  • [Frontend] Support image object in llm.chat by @sfeng33 in #19635
  • [Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
  • [Misc] call the pre-defined func by @reidliu41 in #20518
  • [V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
  • [V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
  • [BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
  • [Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527
  • Implement OpenAI Responses API [1/N] by @WoosukKwon in #20504
  • [Misc] add a tip for pre-commit by @reidliu41 in #20536
  • [Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU by @dbyoung18 in #19410
  • [CI/Build] Enable phi2 lora test by @jeejeelee in #20540
  • [XPU][CI] add v1/core test in xpu hardware ci by @Liangliang-Ma in #20537
  • Add docstrings to url_schemes.py to improve readability by @windsonsea in #20545
  • [XPU] log clean up for XPU platform by @yma11 in #20553
  • [Docs] Clean up tables in supported_models.md by @windsonsea in #20552
  • [Misc] remove unused jinaai_serving_reranking by @Abirdcfly in #18878
  • [Misc] Set the minimum openai version by @jeejeelee in #20539
  • [Doc] Remove extra whitespace from CI failures doc by @hmellor in #20565
  • [Doc] Use gh-pr and gh-issue everywhere we can in the docs by @hmellor in #20564
  • [Doc] Fix internal links so they don't always point to latest by @hmellor in #20563
  • [Doc] Add outline for content tabs by @hmellor in #20571
  • [Doc] Fix some MkDocs snippets used in the installation docs by @hmellor in #20572
  • [Model][Last/4] Automatic conversion of CrossEncoding model by @noooop in #19675
  • [Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled by @panpan0000 in #20486
  • [Feature] microbatch tokenization by @ztang2370 in #19334
  • [DP] Copy environment variables to Ray DPEngineCoreActors by @ruisearch42 in #20344
  • [Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel by @jvlunteren in #20308
  • [Misc] Add fully interleaved support for multimodal 'string' content format by @Dekakhrone in #14047
  • [Misc] feat output content in stream response by @lengrongfu in #19608
  • Fix links in multi-modal model contributing page by @hmellor in #18615
  • [Config] Refactor mistral configs by @patrickvonplaten in #20570
  • [Misc] Improve logging for dynamic shape cache compilation by @kyolebu in #20573
  • [Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe by @minosfuture in #20167
  • [Optimize] Don't send token ids when kv connector is not used by @WoosukKwon in #20586
  • Make distinct code and console admonitions so readers are less likely to miss them by @hmellor in #20585
  • [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #19209
  • [Doc] Syntax highlight request responses as JSON instead of bash by @hmellor in #20582
  • [Docs] Rewrite offline inference guide by @crypdick in #20594
  • [Docs] Improve docstring for ray data llm example by @crypdick in #20597
  • [Docs] Add Ray Serve LLM section to openai compatible server guide by @crypdick in #20595
  • [Docs] Add Anyscale to frameworks by @crypdick in #20590
  • [Misc] improve error msg by @reidliu41 in #20604
  • [CI/Build][CPU] Fix CPU CI and remove all CPU V0 files by @bigPYJ1151 in #20560
  • [TPU] Temporary fix vmem oom for long model len by reducing page size by @Chenyaaang in #20278
  • [Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load by @sangstar in #19619
  • [PD][Nixl] Remote consumer READ timeout for clearing request blocks by @NickLucche in #20139
  • [Docs] Improve documentation for Deepseek R1 on Ray Serve LLM by @crypdick in #20601
  • Remove unnecessary explicit title anchors and use relative links instead by @hmellor in #20620
  • Stop using title frontmatter and fix doc that can only be reached by search by @hmellor in #20623
  • [xpu]feat: support multi-lora on xpu by @yma11 in #20616
  • Update torch/xla pin to 20250703 by @vanbasten23 in #20589
  • [Model] Implement missing get_language_model for Keye-VL by @DarkLight1337 in #20631
  • Revert invalid spellchecker fix on deepseek_vl2 by @viravera in #20618
  • [CI] Increase the threshold of the MTEB RERANK tests by @noooop in #20615
  • [Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE by @minosfuture in #20166
  • [Core] Rename get_max_tokens_per_item for backward compatibility by @DarkLight1337 in #20630
  • [Bugfix] Fix GLM-4.1-V video prompt update by @Isotr0py in #20635
  • [TPU][Bugfix] disable phi-3 test by @QiliangCui in #20632
  • Replace multiply_add with homogeneous_multiply_add to Address Clang Template Parameter Issue by @wenxin0319 in #20142
  • [misc]refactor Platform.set_device method by @jikunshang in #20262
  • [tech debt] Revisit lora request model checker by @kouroshHakha in #20636
  • [BugFix][Intel GPU] Use refactored API for dist_backend in V1 worker by @ratnampa in #20596
  • [Docs] Improve documentation for multi-node service helper script by @crypdick in #20600
  • [Hardware][PPC64LE] Enable V1 for ppc64le and ARM by @Akashcodes732 in #20554
  • [Bugfix] set default set cuda_graph_sizes to min(self.max_num_seqs * 2, 512) by @izhuhaoran in #20628
  • [feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes by @djmmoss in #20640
  • Fix bullets in incremental_build.md by @mgoin in #20642
  • [Misc] Fix the size of batched_dummy_mm_inputs in profile_run by @B-201 in #20434
  • [XPU] Use spawn with XPU multiprocessing by @dvrogozh in #20649
  • [Intel GPU] support ray as distributed executor backend for XPU. by @jikunshang in #20659
  • [Docs] fix minimax tool_calling docs error by @qscqesze in #20667
  • [Bugfix] Fix the issue where reasoning_content is None when Thinkng is enabled and tool_choice is set to 'required'. by @chaunceyjiang in #20662
  • [V1] [Doc] Update V1 docs for Mamba models by @tdoublep in #20499
  • [Doc] Update notes by @DarkLight1337 in #20668
  • [Benchmark] Parameterization of streaming loading of multimodal datasets by @Potabk in #20528
  • [Docs] Improve docs for RLHF co-location example by @crypdick in #20599
  • [doc] update doc format by @reidliu41 in #20673
  • [Bugfix] Fix handling of Tensorizer arguments for LoadConfig by @sangstar in #20643
  • [TPU][Bugfix] fix test_pallas by @yaochengji in #20666
  • [XPU][CI] enhance xpu test support by @Liangliang-Ma in #20652
  • [Bench] Add NVFP4 GEMM benchmark script by @mgoin in #20578
  • [Doc] Update CPU doc by @bigPYJ1151 in #20676
  • Remove heading from installation inc.md file by @hmellor in #20697
  • [CI/Build] Enlarge tolerance for a CPU multi-modal test by @bigPYJ1151 in #20684
  • Support Llama 4 for cutlass_moe_fp4 by @mgoin in #20453
  • [Kernel] Triton implementation of causal-conv1d for Mamba-based models by @thoangtrvn in #18218
  • [Kernel] Add Conch backend for mixed-precision linear layer by @jmanning-stackav in #19818
  • [Feature][Quantization] MXFP4 support for MOE models by @fxmarty-amd in #17888
  • [BugFix]: Properly set engine_id when using multi connector by @Missmiaom in #19487
  • [Misc] Simplify the prefix caching logic on draft tokens by @WoosukKwon in #20701
  • [CI/Build] Fix FlashInfer double build in Dockerfile by @mgoin in #20651
  • [Misc] DP : Add ExpertTokensMetadata by @varun-sundar-rabindranath in #20332
  • Use NVCC --compress-mode to reduce binary size by 30% by @mgoin in #20694
  • Correct PPMissingLayer handling in Deepseek-V2-Lite PP deployment by @eicherseiji in #20665
  • [Frontend] Support Tool Calling with both tool_choice='required' and $defs. by @chaunceyjiang in #20629
  • [BugFix][CPU] Fix CPU worker dependency on cumem_allocator by @njhill in #20696
  • [BugFix] Fix VllmConfig() construction on all platforms by @njhill in #20695
  • [TPU][Core]Make load weight exceed hbm error more instructive for customers by @Chenyaaang in #20644
  • [KVConnector] Aggregate finished requests on the scheduler by @orozery in #19555
  • [Misc] loose new-model tagger conditions by @Isotr0py in #20747
  • [CI/Build] Fix Basic Models Test by @jeejeelee in #20728
  • [Bugfix][Build][Non-CUDA] Only referencing CMAKE_CUDA_COMPILER_VERSION on CUDA where it is defined by @gshtras in #20738
  • [doc] fix ordered list by @reidliu41 in #20749
  • [CI Bugfix] Skip failing Tensorizer+LoRA test by @mgoin in #20724
  • Normalize lm-eval command between baseline and correctness test by @mgoin in #18560
  • [Misc] Clean up mark to fork process in BNB tests by @Isotr0py in #20692
  • [Doc] Add engine args back in to the docs by @hmellor in #20674
  • Update Dockerfile FlashInfer to v0.2.8rc1 by @mgoin in #20718
  • [Hardware][CPU] Vllm int8 quantization enablement for ARM CPU by @nishith-fujitsu in #14129
  • [ROCm][Regression] Remove tensor creation that harms performance on ROCm by @gshtras in #20741
  • [Model] Add reason parser for Hunyuan A13B Model. by @kzjeef in #20625
  • [Model][VLM] Support JinaVL Reranker by @shineran96 in #20260
  • Fix DeepSeek-R1-0528 chat template by @sfbemerk in #20717
  • [Test] Remove docker build from test. by @QiliangCui in #20542
  • [Bugfix] [CI] Fix Tensorizer LoRA test by @sangstar in #20760
  • [V0][V1][Core] Add outlines integration for V1, and update V0 integration. by @unaidedelf8777 in #15975
  • [CI] Fix pre commit issue by @yewentao256 in #20782
  • [Bugfix] Remove assertion of expert_map being None by @minosfuture in #20714
  • [Core] Add Support for Default Modality Specific LoRAs [generate / chat completions] by @alex-jw-brooks in #19126
  • [Bugfix] Fused MoE Modular Kernel chunking loop by @varun-sundar-rabindranath in #20392
  • [KVConnector] Always call connector clear_metadata() at end of step by @njhill in #20756
  • [Misc] MoE ModularKernel : Introduce TopKWeightAndReduce by @varun-sundar-rabindranath in #20648
  • [Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. by @KuntaiDu in #20786
  • [Docs] Lazy import gguf by @simon-mo in #20785
  • [CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install by @mgoin in #20772
  • Add kimi-k2 tool parser by @MoyanZitto in #20789
  • [fix]: disable cutlass block scaled group gemm for EP by @djmmoss in #20781
  • [Model] Support HF format of minimax by @mgoin in #20211
  • [Attention] MLA - Flashinfer Ragged Prefill by @alexm-redhat in #20034
  • [Feature] Integrate SM100 DeepGEMM support by @yewentao256 in #20087
  • [XPU] XCCL support enabled in torch 2.8.0.dev nightly builds by @ratnampa in #20705
  • [Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf by @ProExpertProg in #19830
  • [V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine by @nopperl in #20660
  • [doc] fold long code block by @reidliu41 in #20795
  • [Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging by @ProExpertProg in #20777
  • [Quantization][1/N] MoE support BNB-Inflight Quantization by @jeejeelee in #20061
  • [Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). by @pavanimajety in #19825
  • [Bugfix] Refactor /invocations to be task-agnostic by @DarkLight1337 in #20764
  • Temporarily suspend google/gemma-3-1b-it. by @QiliangCui in #20722
  • [Bugfix] Add missing field to TritonLanguagePlaceholder by @bigPYJ1151 in #20812
  • [doc] fix ordered list issue by @reidliu41 in #20819
  • [Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility by @varun-sundar-rabindranath in #20449
  • [Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM by @mgoin in #20646
  • [Docs] Data Parallel deployment documentation by @njhill in #20768
  • [Bugfix] Fix OOM in language generation test by @Isotr0py in #20814
  • Update kimi-k2 tool calling docs, enable unit tests by @MoyanZitto in #20821
  • [CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' by @yewentao256 in #20845
  • Integration SM100 FlashInfer fused allreduce RMSNorm by @ilmarkov in #20691
  • Add pynccl all-gatherv and reducescatterv by @trevor-m in #20154
  • [Misc] Restrict deep_gemm's log output by @jeejeelee in #20827
  • [Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices by @bigPYJ1151 in #20822
  • [Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading by @yurhett in #20682
  • [CI/Build] Ensure compatability with Transformers v4.53 by @Isotr0py in #20541
  • [Bugfix] : Fix typo - logger.warn_once -> logger.warning_once by @varun-sundar-rabindranath in #20852
  • [Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models by @NickLucche in #20637
  • [Bugfix] Replace unavailable video url in multimodal test by @Isotr0py in #20854
  • [Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph by @lk-chen in #20834
  • [Bug] Fix DeepGemm for EP low latency case by @yewentao256 in #20833
  • [Docs] Update basic.md by @luccafong in #20846
  • [Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 by @zou3519 in #20823
  • [cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile by @BoyuanFeng in #20790
  • Remove extra tensor on CPU by @maxdebayser in #20693
  • Enable ModelOpt Llama4 fp8 checkpoint deployment by @Edwardf0t1 in #20419
  • Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694" by @mgoin in #20853
  • [Model] New model support for microsoft/Phi-4-mini-flash-reasoning by @congcongchen123 in #20702
  • [Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models by @alex-jw-brooks in #20843
  • [docs] convert supported configs to table by @reidliu41 in #20858
  • [Bugfix] Restrict Machete to only run on Hopper by @mgoin in #20830
  • [Sched] Enhance the logic to remove stopped requests from queues by @WoosukKwon in #20739
  • [Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant by @yewentao256 in #20841
  • [Bugfix] Fix a couple PPLX+CUTLASS MoE bugs by @ElizaWszola in #20825
  • [Refactor] Change the way of import triton by @yewentao256 in #20774
  • [Core] Support multiple tasks per model by @NickLucche in #20771
  • Renable google/gemma-3-1b-it accuracy test. by @QiliangCui in #20866
  • Support for LlamaForSequenceClassification by @thechaos16 in #20807
  • [Bugfix] Fix: add patch_rope_scaling after hf override by @Wangmerlyn in #20857
  • [Bugfix] fix define of RerankDocument by @Liuchenlong in #20877
  • [V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs by @tjtanaa in #20880
  • [V1] Hybrid allocator without prefix caching by @nopperl in #20661
  • [Core] Add update_config RPC method by @22quinn in #20095
  • [Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) by @vMaroon in #20511
  • Removing redundant python version check by @Dannyso05 in #20888
  • Fix: Add missing EOFError handling in CLI complete command by @reidliu41 in #20896
  • [ROCm] [Bugfix] [Critical]: Fix mamba compilation bug by @tjtanaa in #20883
  • [Quantization] add BNB for MixtralForCausalLM by @jeejeelee in #20893
  • [Refactor][V1] Move outlines utils for V1 imports by @aarnphm in #20878
  • [MISC] Move bind_kv_cache to worker module by @wangxiyuan in #20900
  • [CI/Build] Fix OOM issue in Jina-VL test by @DarkLight1337 in #20907
  • [Bugfix] Bump up mistral_common to support v13 tokenizer by @22quinn in #20905
  • [Misc] Remove unused function by @reidliu41 in #20909
  • [Bugfix]: Fix messy code when using logprobs by @chaunceyjiang in #20910
  • [Misc] Log the reason for falling back to FlexAttention by @DarkLight1337 in #20699
  • [Model] Add Ling implementation by @ant-yy in #20680
  • [CI] cc folks on changes to vllm/compilation by @zou3519 in #20925
  • [CI] Update codeowner for compilation code by @houseroad in #20929
  • [Misc] Clean up Aimv2 config registration in Ovis config by @Isotr0py in #20921
  • [CI/Build] Add Transformers nightly tests in CI by @Isotr0py in #20924
  • Change default model to Qwen3-0.6B by @tlrmchlsmth in #20335
  • Add benchmark dataset for mlperf llama tasks by @mgoin in #20338
  • [Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts by @varun-sundar-rabindranath in #20725
  • [Misc] Relax translations tests by @NickLucche in #20856
  • Fix overflow indexing in causal_conv1d kernel by @tdoublep in #20938
  • [Docs] remove outdated performance benchmark by @KuntaiDu in #20935
  • Fall back if flashinfer comm module not found by @sarckk in #20936
  • SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP by @alexm-redhat in #20769
  • [BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache by @zou3519 in #20942
  • [Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM by @mgoin in #20933
  • [CI/Build] Split Entrypoints Test into LLM and API Server by @mgoin in #20945
  • Use w8a8 quantized matmul Pallas kernel by @vanbasten23 in #19170
  • [Docs] Add Kuberay to deployment integrations by @crypdick in #20592
  • feat: add image zoom to improve image viewing experience by @reidliu41 in #20763
  • [CI] Fix flaky test_streaming_response test by @NickLucche in #20913
  • Enabled BnB NF4 inference on Gaudi by @rsshaik1 in #20172
  • [Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer by @pavanimajety in #20934
  • [Doc] Clearer mistral3 and pixtral model support description by @Isotr0py in #20926
  • [cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir by @BoyuanFeng in #20940
  • [Model] Add AutoWeightsLoader support for BERT, RoBERTa by @jennifurhe in #20534
  • Implement Async Scheduling by @WoosukKwon in #19970
  • [Misc] Refactor AllReduceFusionPass. Remove parameter by @ilmarkov in #20918
  • [frontend] Add --help=page option for paginated help output by @reidliu41 in #20961
  • [Docs] Improve documentation for RLHF example by @crypdick in #20598
  • [frontend] Refactor CLI Args for a better modular integration by @kouroshHakha in #20206
  • [Docs] Improve documentation for ray cluster launcher helper script by @crypdick in #20602
  • [TPU] Optimize kv cache update kernel by @tengyifei in #20415
  • [V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli by @tdoublep in #20840
  • [MISC] Add init files for python package by @Potabk in #20908
  • [doc] Add more details for Ray-based DP by @ruisearch42 in #20948
  • [Deprecation] Remove TokenizerPoolConfig by @hmellor in #20968
  • [v1][core] Support for attention free models by @christian-pinto in #20811
  • Voxtral by @patrickvonplaten in #20970
  • [CI/Build] Fix wrong path in Transformers Nightly Models Test by @DarkLight1337 in #20994
  • [Deprecation] Remove everything scheduled for removal in v0.10.0 by @hmellor in #20979
  • Configure Gemini by @hmellor in #20971
  • [Deprecation] Remove nullable_kvs by @hmellor in #20969
  • Add full serve CLI reference back to docs by @hmellor in #20978
  • [ROCm] warpSize is being made non constexpr in ROCm 7.0 by @gshtras in #20330
  • [BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 by @thoangtrvn in #20838
  • [Frontend] Support cache_salt in /v1/completions and /v1/responses by @dr75 in #20981
  • [Bug Fix] get_distributed_init_method should get the ip from get_ip i… by @Relics in #20889
  • [Nvidia] Integrate SM100 cudnn prefill API to MLA prefill by @elfiegg in #20411
  • [Frontend] OpenAI Responses API supports input image by @chaunceyjiang in #20975
  • [Frontend] Remove print left in FrontendArgs.add_cli_args by @mgoin in #21004
  • [Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture by @tdoublep in #20923
  • [Misc] bump xgrammar version to v0.1.21 by @chaunceyjiang in #20992
  • [Chore] Remove outdated transformers check by @b8zhong in #20989
  • [Misc] Refactor: Improve argument handling for conda command by @reidliu41 in #20481
  • [Docs] Enhance Anyscale documentation, add quickstart links for vLLM by @crypdick in #21018
  • [Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… by @minosfuture in #20937
  • Add Dockerfile argument for VLLM_USE_PRECOMPILED environment by @dougbtv in #20943
  • [CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty by @xuechendi in #21006
  • [Bugfix] Fix Mistral3 support on SM100/SM120 by @mgoin in #20998
  • [Doc] Remove duplicate docstring by @yewentao256 in #21012
  • [Voxtral] Add more tests by @patrickvonplaten in #21010
  • Avoid direct comparison of floating point numbers by @maxdebayser in #21002
  • [Meta] Llama4 EAGLE Support by @morgendave in #20591
  • [TPU] fix kv_cache_update kernel block size choosing logic by @yaochengji in #21007
  • [BugFix] Fix import error on non-blackwell machines by @LucasWilkinson in #21020
  • Fix inadvertently silenced PP tests for mp, add DeepSeek V2/V3 model family to PP tests by @eicherseiji in #20831
  • [Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md by @windsonsea in #19199
  • [Model] Consolidate pooler implementations by @DarkLight1337 in #20927
  • feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information by @m-misiura in #20575
  • [fix] fix qwen image_embeds input by @h-avsha in #21049
  • Remove Qwen Omni workaround that's no longer necessary by @hmellor in #21057
  • [Model] Remove model sampler by @DarkLight1337 in #21059
  • Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) by @nirda7 in #12010
  • Remove torch_xla.tpu.version() from pallas.py. by @QiliangCui in #21065
  • Update PyTorch to torch==2.7.1 for CUDA by @mgoin in #21011
  • [Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group by @Kevin-XiongC in #21024
  • [Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile by @mgoin in #21013
  • [TPU] Start using python 3.12 by @vanbasten23 in #21000
  • [Bugfix] Fix Machete zero point issue for GPTQ models on SM90 by @mgoin in #21066
  • [Attention] Refactor attention metadata builder interface by @LucasWilkinson in #20466
  • [V1][P/D]Enhance Performance and code readability for P2pNcclConnector by @Abatom in #20906
  • [V1] [KVConnector] Fix MultiprocExecutor worker output aggregation by @sdavidbd in #21048
  • [Misc] Fix PhiMoE expert mapping by @jeejeelee in #21085
  • [Bugfix]: Fix final_res_batch list index out of range error by @chaunceyjiang in #21055
  • [Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels by @varun-sundar-rabindranath in #20903
  • [Model] Add ToolParser and MoE Config for Hunyuan A13B by @kzjeef in #20820
  • [VLM] Add Nemotron-Nano-VL-8B-V1 support by @kylehh in #20349
  • [Docs] Improve docstring formatting for FusedMoEParallelConfig.make by @hmellor in #21117
  • [Misc] Avoid unnecessary import by @wangxiyuan in #21106
  • [Docs] Move code block out of admonition now that it's short by @hmellor in #21118
  • [Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE by @ElizaWszola in #20762
  • [Model] Update pooling model interface by @DarkLight1337 in #21058
  • [Misc] Qwen MoE model supports LoRA by @jeejeelee in #20932
  • On environments where numa cannot be detected we get 0 by @ericcurtin in #21115
  • [V0 deprecation] Remove V0 HPU backend by @WoosukKwon in #21131
  • [Log] Debugging Log with more Information by @yewentao256 in #20770
  • [Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel by @elvischenv in #21133
  • [Docs] Add minimal demo of Ray Data API usage by @crypdick in #21080
  • [Docs] Update supported models documentation with missing models by @luccafong in #20844
  • [Attention] Make local attention backend agnostic by @LucasWilkinson in #21093
  • [Doc] Add inplace weights loading example by @22quinn in #19640
  • [Core] FlashInfer CUTLASS fused MoE backend (NVFP4) by @wenscarl in #20037
  • [Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm by @shixianc in #20911
  • [Misc] Do not print async output warning for v1 by @WoosukKwon in #21151
  • [benchmark] Sending request strictly follows the random intervals by @Jialin in #21108
  • [Misc] Make MM embedding merge interface explicit in model runner by @ywang96 in #21147
  • [Model] Re-add the implicit conversion feature for as_seq_cls_model by @noooop in #21103
  • [Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. by @noooop in #20750
  • [Doc] Fix typo in model name by @DarkLight1337 in #21178
  • [Bugfix] Allocate less memory in non-batched CUTLASS MoE by @ElizaWszola in #21121
  • [Core] Set pooling params based on task and model by @DarkLight1337 in #21128
  • Let GraniteMoeAttention use YaRN by @tdoublep in #21174
  • [CI] Update CODEOWNERS for vllm/compilation by @zou3519 in #21185
  • [Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 by @zou3519 in #19346
  • [Core] Avoid KVCacheBlock.eq invocations in FreeKVCacheBlockQueue by @JialinOuyang-Meta in #21005
  • [Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) by @hax0r31337 in #21077
  • Elastic Expert Parallel Initial Support by @ruisearch42 in #20775
  • [Quantization] Enable BNB support for more MoE models by @jeejeelee in #21100
  • [Core] Support Local Chunked Attention for Hybrid KV Cache by @luccafong in #19351
  • [Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 by @varun-sundar-rabindranath in #21183
  • [V0 Deprecation] Remove V0 Spec Decode workers by @WoosukKwon in #21152
  • [Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel by @varun-sundar-rabindranath in #21193
  • [BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope by @LucasWilkinson in #21200
  • [Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 by @yewentao256 in #21187
  • [Model] EXAONE 4.0 model support by @Deepfocused in #21060
  • [Misc][Tools][Benchmark] Add readme file for auto_tune script by @Chenyaaang in #20779
  • Fix a couple of Voxtral tests by @huydhn in #21218
  • [V0 deprecation] Remove long context LoRA by @jeejeelee in #21169
  • [Bugfix] Fix ndarray video color from VideoAsset by @Isotr0py in #21064
  • [BugFix] Fix potential cuda-graph IMA by @LucasWilkinson in #21196
  • Add torch golden impl for moe_align_block_size kernel test by @shixianc in #20653
  • [NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency by @kaixih in #20645
  • [Bugfix][Frontend] Fix openai CLI arg middleware by @22quinn in #21220
  • [bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code by @bigPYJ1151 in #21032
  • Fix/remove some broken model executor tests by @rabi in #21224
  • [CI/CD][bugfix]fix: error argument to loads has incompatible type by @llsj14 in #21223
  • [Docs] Update the link to the 'Prometheus/Grafana' example by @1195343015 in #21225
  • [BugFix] Make PD work with Ray by @kouroshHakha in #21072
  • [V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers by @tdoublep in #21194
  • [V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small by @WoosukKwon in #21217
  • [BugFix] Fix full cuda graph slot_mapping by @fhl2000 in #21228
  • GLM-4 Update by @zRzRzRzRzRzRzR in #20736
  • [Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. by @tdoublep in #21233
  • [TPU] support fp8 kv cache quantization by @yaochengji in #19292
  • Enable v1 metrics tests by @eicherseiji in #20953

New Contributors

  • @sangbumlikeagod made their first contribution in #18809
  • @djmmoss made their first contribution in #19757
  • @GuyStone made their first contribution in #20497
  • @bottler made their first contribution in #20487
  • @dbyoung18 made their first contribution in #19410
  • @Abirdcfly made their first contribution in #18878
  • @Dekakhrone made their first contribution in #14047
  • @minosfuture made their first contribution in #20167
  • @crypdick made their first contribution in #20594
  • @viravera made their first contribution in #20618
  • @wenxin0319 made their first contribution in #20142
  • @ratnampa made their first contribution in #20596
  • @dvrogozh made their first contribution in #20649
  • @thoangtrvn made their first contribution in #18218
  • @jmanning-stackav made their first contribution in #19818
  • @Missmiaom made their first contribution in #19487
  • @orozery made their first contribution in #19555
  • @nishith-fujitsu made their first contribution in #14129
  • @kzjeef made their first contribution in #20625
  • @shineran96 made their first contribution in #20260
  • @unaidedelf8777 made their first contribution in #15975
  • @MoyanZitto made their first contribution in #20789
  • @nopperl made their first contribution in #20660
  • @trevor-m made their first contribution in #20154
  • @yurhett made their first contribution in #20682
  • @thechaos16 made their first contribution in #20807
  • @Wangmerlyn made their first contribution in #20857
  • @Liuchenlong made their first contribution in #20877
  • @vMaroon made their first contribution in #20511
  • @Dannyso05 made their first contribution in #20888
  • @ant-yy made their first contribution in #20680
  • @rsshaik1 made their first contribution in #20172
  • @jennifurhe made their first contribution in #20534
  • @tengyifei made their first contribution in #20415
  • @Relics made their first contribution in #20889
  • @dougbtv made their first contribution in #20943
  • @morgendave made their first contribution in #20591
  • @m-misiura made their first contribution in #20575
  • @nirda7 made their first contribution in #12010
  • @Kevin-XiongC made their first contribution in #21024
  • @ericcurtin made their first contribution in #21115
  • @elvischenv made their first contribution in #21133
  • @shixianc made their first contribution in #20911
  • @Jialin made their first contribution in #21108
  • @JialinOuyang-Meta made their first contribution in #21005
  • @hax0r31337 made their first contribution in #21077
  • @Deepfocused made their first contribution in #21060
  • @fhl2000 made their first contribution in #21228

Full Changelog: v0.9.2rc1...v0.10.0rc1

Don't miss a new vllm release

NewReleases is sending notifications on new releases.