pypi vllm 0.9.1
v0.9.1

latest releases: 0.10.1.1, 0.10.1, 0.10.0...
2 months ago

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

  • Progress in large scale serving
    • DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked deepgemm kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
    • Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
    • DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
    • Tooling: Simplify ep kernels installation (#19412)
  • RLHF workflow: Support inplace model weights loading (#18745)
  • Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
  • Add FlexAttention to vLLM V1 (#16078)
  • Various production hardening related to full cuda graph mode (#19171, 19106, #19321)

Model Support

  • Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
  • Enable data parallel for Llama4 vision encoder (#18368)
  • Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

  • Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
  • Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
  • FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
  • CPU: V1 support for the CPU backend (#16441)
  • ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
  • POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
  • TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimisations for the V1 TPU backend (#15655)
  • Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
  • Platform: Make torch distributed process group extendable (#18763)

Engine features

  • Add Lora Support to Beam Search (#18346)
  • Add rerank support to run_batch endpoint (#16278)
  • CLI: add run batch (#18804)
  • Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
  • LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
  • perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

  • Disallow pos-args other than model when initializing LLM (#18802)
  • Remove inputs arg fallback in Engine classes (#18799)
  • Remove fallbacks for Embeddings API (#18795)
  • Remove mean pooling default for Qwen2EmbeddingModel (#18913)
  • Require overriding get_dummy_text and get_dummy_mm_data (#18796)
  • Remove metrics that were deprecated in 0.8 (#18837)

Documentation

  • Add CLI doc (#18871)
  • Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

  • [CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
  • [Neuron] Support quantization on neuron by @aws-satyajith in #18283
  • Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py by @mgoin in #18566
  • [Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
  • [Build] Fixes for CMake install by @ProExpertProg in #18570
  • [Core] Improve Tensor serialisation by @lgeiger in #18774
  • [rocm] Fix wrong attention log by @fxmarty-amd in #18764
  • [Bugfix] Fix nomic max_model_len by @noooop in #18755
  • [Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
  • [V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
  • [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
  • [Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
  • [Deprecation] Require overriding get_dummy_text and get_dummy_mm_data by @DarkLight1337 in #18796
  • [Deprecation] Remove unused sync methods in async_timeout by @DarkLight1337 in #18792
  • [Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
  • [CI] improve embed testing by @noooop in #18747
  • Fix PiecewiseCompileInterpreter by @zou3519 in #17338
  • [BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
  • [Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
  • Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
  • [Frontend] add run batch to CLI by @reidliu41 in #18804
  • decrement server_load on listen for disconnect by @daniel-salib in #18784
  • [Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
  • [Chore] update ty configuration by @aarnphm in #18839
  • [Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
  • [V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
  • [Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
  • [Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
  • Remove checks for None for fields which should never be None by @hmellor in #17985
  • [Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
  • [Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
  • Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
  • Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
  • [Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
  • [Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
  • Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
  • Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
  • [Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
  • [LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
  • [Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
  • [doc] add CLI doc by @reidliu41 in #18871
  • [Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
  • [Misc] Replace TODO in serving transcription by @NickLucche in #18895
  • [Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
  • [BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
  • Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
  • [Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
  • [Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
  • [Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
  • [Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
  • [ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
  • [Deprecation] Disallow pos-args other than model when initializing LLM by @DarkLight1337 in #18802
  • [Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
  • [V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
  • [BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
  • [P/D] NixlConnector DP fixes by @wseaton in #18903
  • Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
  • [TPU] remove transpose ops in moe kernel by @yaochengji in #18923
  • [Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
  • [Misc] Update type annotation for rotary embedding base by @DarkLight1337 in #18914
  • [TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
  • improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
  • [Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
  • [Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
  • [docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
  • [ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
  • [Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
  • [Deprecation] Remove mean pooling default for Qwen2EmbeddingModel by @DarkLight1337 in #18913
  • [Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
  • [doc] add mkdocs doc by @reidliu41 in #18930
  • [Model] Use in-place adds in SigLIP by @lgeiger in #18922
  • [Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
  • [Misc]Fix typo by @Always-Naive in #18947
  • [Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
  • [CI/Build] remove regex from build dependencies by @dtrifiro in #18945
  • [Feature] minicpm eagle support by @huangyuxiang03 in #18943
  • [doc] show the count for fork and watch by @reidliu41 in #18950
  • [Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
  • Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
  • [Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
  • Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
  • [VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
  • [Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
  • Tool parser regex timeout handling by @wseaton in #18960
  • [Docs] Correct multiprocessing design doc by @lgeiger in #18964
  • create util function for batched arange by @yuguo68 in #18937
  • [Frontend] Add rerank support to run_batch endpoint by @pooyadavoodi in #16278
  • [Misc] Fix estimated max model len msg by @sarckk in #18966
  • [Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled by @chaunceyjiang in #18879
  • fix security issue of logging llm output by @luccafong in #18980
  • [Neuron] Add Multi-Modal model support for Neuron by @aws-satyajith in #18921
  • [doc] fix the list rendering issue - security.md by @reidliu41 in #18982
  • [BugFix] Pydantic part 2 by @ProExpertProg in #18911
  • [FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 by @vllmellm in #18825
  • [Bugfix] Fix for issue 17396 by @frreiss in #18773
  • [ROCm][Kernel] Add gfx950 support for skinny gemms by @charlifu in #18010
  • [P/D] NixlConnector use cache device index for memory registration by @ptarasiewiczNV in #18969
  • [BugFix] Fix multi-node offline data-parallel by @njhill in #18981
  • [Misc] add return token strs for tokenize by @reidliu41 in #18941
  • [Misc][Benchmark] Add support for CustomDataset by @ekagra-ranjan in #18511
  • [Bugfix] Fix EAGLE3 broken logits by @benchislett in #18909
  • [Core] Rework dtype resolution by @DarkLight1337 in #18751
  • [LoRA] Support dynamically initialize packed_modules_mapping for VLM with arbitrary components by @Isotr0py in #18987
  • [doc] small fix - mkdocs by @reidliu41 in #18996
  • Let max_num_batched_tokens use human_readable_int for large numbers by @mgoin in #18968
  • [BugFix] fix data parallel construct ipv6 url addres by @lengrongfu in #18991
  • [BugFix] Fix incorrect metrics shutdown error log message by @njhill in #18992
  • [doc] wrong output by @reidliu41 in #19000
  • [Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context by @izhuhaoran in #18935
  • [Bugfix][Nixl] Fix DP Metadata Handshake by @robertgshaw2-redhat in #19008
  • [Core] Support inplace model weights loading by @22quinn in #18745
  • [doc] add pytest tips by @reidliu41 in #19010
  • [Model] enable data parallel for Llama4 vision encoder by @jennyyyyzhen in #18368
  • [Frontend] enable custom logging for the uvicorn server (OpenAI API server) by @fpaupier in #18403
  • [Bugfix][Model] Attempt to fix eagle in V0. by @gshtras in #18978
  • add an absolute path for run.sh by @calvin0327 in #18258
  • [Hardware][TPU] Initial support of model parallelism with single worker using SPMD by @lsy323 in #18011
  • [Doc] Remove duplicate TOCs during MkDocs migration by @Zerohertz in #19021
  • [Bugfix][EP+DP] Use pplx-kernel internode instead of intranode by @tlrmchlsmth in #19034
  • Adding "LoRA Test %N" to AMD production tests by @Concurrensee in #18929
  • [CPU][CI] Re-enable the CPU CI tests by @bigPYJ1151 in #19046
  • [ROCm][Build] Clean up the ROCm build by @gshtras in #19040
  • [V1] Support DP with Ray by @ruisearch42 in #18779
  • Add tarsier model support by @princepride in #18985
  • [bugfix] small fix logic issue by @reidliu41 in #18999
  • Reduce logs in CLI scripts and plugin loader by @mgoin in #18970
  • [Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure by @houseroad in #19019
  • [v1][KVCacheManager] Rename BlockHashType to BlockHash by @heheda12345 in #19015
  • Update docker docs with ARM CUDA cross-compile by @mgoin in #19037
  • [Doc] Add InternVL LoRA support by @jeejeelee in #19055
  • [Misc] Update WeightsMapper for qwen2-vl/qwen2.5-vl by @Isotr0py in #19054
  • [Doc] Update V1 user guide for embedding and enc-dec models by @DarkLight1337 in #19060
  • [doc] clarify windows support by @youkaichao in #19088
  • [CI/Build] Remove V0 LoRA test by @jeejeelee in #19066
  • Fix underscores in dict keys passed via CLI by @hmellor in #19030
  • [Bugfix] disable processor cache by @zucchini-nlp in #19068
  • [Doc] Improve the Pull Request template with key components by @houseroad in #19086
  • [Misc] Add missing _Backend enums by @NickLucche in #19081
  • [Misc] fix: add miss best_of param validation by @googs1025 in #18555
  • [Misc] Add SPDX-FileCopyrightText by @simon-mo in #19100
  • [Doc] Readme standardization by @SorenDreano in #18695
  • [doc] update docker version by @reidliu41 in #19074
  • [Kernel] DeepEP dispatch-combine kernel integration by @varun-sundar-rabindranath in #18434
  • [V1] Support cross-layer KV sharing by @sarckk in #18212
  • [Perf] Tune scaled_fp8_quant by increasing vectorization by @mgoin in #18844
  • Fix interaction between Optional and Annotated in CLI typing by @hmellor in #19093
  • [v1] Re-init input batch for multiple kv cache groups by @heheda12345 in #18654
  • [V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix by @ekagra-ranjan in #18971
  • [Bugfix] get_num_blocks_to_allocate with null_block by @heheda12345 in #19031
  • [Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled by @chaunceyjiang in #19075
  • [Bugfix][P/D] Fix Prefix Cache Bug by @NickLucche in #18411
  • [Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers by @heheda12345 in #19029
  • feat: add data parallel rank to KVEventBatch by @PeaBrane in #18925
  • [Misc] Fix path and python alias errors in disagg_prefill exmaples by @Jeffwan in #18919
  • [Docs] Add developer doc about CI failures by @russellb in #18782
  • [CPU] V1 support for the CPU backend by @bigPYJ1151 in #16441
  • [Core] Cast multimodal input in hf processor by @lgeiger in #18862
  • [KERNEL] Sampler. CUDA kernel for applying repetition penalty by @vadiklyutiy in #18437
  • [Cleanup][v1]:remote guided-decoding-backend for example by @calvin0327 in #19059
  • [NVIDIA] Add Cutlass MLA backend by @kaixih in #17625
  • [Bugfix] Fix FA3 full cuda graph correctness by @WoosukKwon in #19106
  • Fix #19130 by @princepride in #19132
  • [TPU] Skip hanging tests by @lsy323 in #19115
  • Fix ValueError: Missing value for tag key(s): model_name,engine. by @eicherseiji in #19113
  • [Misc] Add packages for benchmark as extra dependency by @Isotr0py in #19089
  • Improve the output precision of embedding models by @noooop in #19092
  • [CI/Build][Bugfix] Ensure compatibility with transformers 4.52 by @DarkLight1337 in #18678
  • Add DeepSeek-R1-0528 function call chat template by @Xu-Wenqing in #18874
  • Sm100 blockwise fp8 swap ab by @IwakuraRein in #18564
  • [Doc] Update V1 Guide for embedding models by @DarkLight1337 in #19141
  • Allow AsyncLLMEngine.generate to target a specific DP rank by @jmswen in #19102
  • [Bugfix][EP+DP] Fix internode check by @tlrmchlsmth in #19112
  • [Perf] Tunings for SM100 FP8 CUTLASS kernel by @mgoin in #18778
  • [TPU] Update dynamo dump file name in compilation test by @lsy323 in #19108
  • [Bugfix] fix v1 cpu worker fails on macOS by @kebe7jun in #19121
  • [Kernel] Integrate batched/masked deepgemm kernel by @varun-sundar-rabindranath in #19111
  • [Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM by @googs1025 in #18817
  • [P/D] Heterogeneous TP by @NickLucche in #18833
  • [doc] small fix by @reidliu41 in #19167
  • [Bugfix][Nixl] Fix full prefix cache hit bug by @robertgshaw2-redhat in #18632
  • [Bugfix] Fix port handling in make_zmq_path by @mgoin in #19117
  • [Torch Nightly]add missing dependency by @yangw-dev in #18770
  • Handle non-serializable objects when dumping benchmark results by @huydhn in #19114
  • [BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 by @WoosukKwon in #19171
  • [Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled by @chaunceyjiang in #19135
  • [Build] Annotate wheel and container path for release workflow by @simon-mo in #19162
  • [Misc] Remove unnecessary fallback to prefill-decode attention by @vllmellm in #19138
  • [Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly by @22quinn in #19105
  • [Frontend] improve vllm run-batch --help display by @reidliu41 in #19187
  • [Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided by @gcalmettes in #19202
  • [mistral_common] Add v11 tokenizer by @patrickvonplaten in #19193
  • Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 by @Xu-Wenqing in #19205
  • [Hardware][NVIDIA] FP4 MoE kernel optimization by @dubcyfor3 in #19110
  • [MISC][Bugfix] Use less CPU when message queue has been empty for some time by @p12tic in #16226
  • [P/D][NixlConnector] Enable FlashInfer backend by @NickLucche in #19090
  • [Quantization] Skip Fp4 Test for compressed-tensors by @dsikka in #19217
  • [V1] Use FlashInfer by default on Blackwell GPUs by @mgoin in #19118
  • [Model] NemotronH support by @vegaluisjose in #18863
  • Fix AOPerModuleConfig name changes by @jerryzh168 in #18869
  • [Bugfix] Fix EAGLE vocab embedding construction for Llama 70B by @benchislett in #19033
  • [v1] Hybrid Memory Allocator by @heheda12345 in #17996
  • [TPU] update torch_xla pin by @yaochengji in #19231
  • Support allowed_token_ids in ChatCompletionRequest by @xu-song in #19143
  • [Chore] update CODEOWNERS by @aarnphm in #19247
  • [v1][P/D] Fix a edge case in kv cache schedule by @KingsleyZhang123 in #19182
  • [TPU] fix kv cache dtype in model runner by @yaochengji in #19244
  • [Quantization] Bump compressed-tensors version; update NVFP4A16 test model by @dsikka in #19224
  • [Docs] Improve V1 KVConnector interface documentation by @njhill in #19172
  • Fix CompilationConfig repr by @zou3519 in #19091
  • Unit Test for run_dp_sharded_vision_model by @cryptopic in #19103
  • [Model] Optimize nemotron_h implementation by @jeejeelee in #19249
  • [Core] Raise when non-multi-instance DP clients target a DP rank by @jmswen in #19227
  • improve logits bias by @yuguo68 in #19041
  • Fixed ppc build when it runs on non-RHEL based linux distros by @npanpaliya in #18422
  • [BugFix] Fix MultiConnector test after HMA changes by @njhill in #19291
  • [Bugfix][Core] Update cancellation logic in generate() to handle Generator exits by @Adolfo-Karim in #19225
  • [Core] Fix abrupt request abort by @NickLucche in #18485
  • [BugFix] Fix tpu_model_runner block_id concatenation by @njhill in #19228
  • [Misc][Tools][Benchmark] Fix and improve auto tune script by @Chenyaaang in #19163
  • [Build][ROCm] Update Dockerfile.rocm by @Alexei-V-Ivanov-AMD in #19296
  • [Easy][Test] Simplify test_function_tool_use with multiple parametrizes by @houseroad in #19269
  • [Kernel] Integrate CUTLASS MoE kernel with PPLX by @ElizaWszola in #18762
  • [TPU][Test] Add script to run benchmark on TPU for buildkite by @QiliangCui in #19039
  • [CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py by @AaruniAggarwal in #19253
  • Add FlexAttention to V1 by @drisspg in #16078
  • [Misc] refactor context extension by @reidliu41 in #19246
  • [CI/Build] Improve Llama GGUF test robustness by @Isotr0py in #19287
  • [Nit][Benchmark]Fix example in benchmark_serving_structured_output.py by @draftbk in #19311
  • [AMD] Update compatible packaging version by @pramenku in #19309
  • [BugFix][V1] Fix memory profiling bug by @ProExpertProg in #18974
  • [Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer by @chaunceyjiang in #19283
  • [Bugfix] Re-enable use_cudagraph in vLLM v1 by @zou3519 in #19299
  • [Misc] Change tests/compile to use VLLM_V1 by default by @zou3519 in #19302
  • Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B by @Xu-Wenqing in #19315
  • [Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection by @Akashcodes732 in #19082
  • [Quantization] Add compressed-tensors NVFP4 support by @dsikka in #18312
  • [Multi Modal] Add an env var for message queue max chunk bytes by @jennyyyyzhen in #19242
  • [Bugfix] model_max_length should consider max_model_len in tokenizer_config by @noooop in #19201
  • [Deprecation] Remove inputs arg fallback in Engine classes by @DarkLight1337 in #18799
  • [Misc] Add documentation update reminder to PR template by @Isotr0py in #19289
  • [Frontend] Remove unreachable code from llm.py by @KsuParkhamchuk in #19288
  • [Misc] Cleanup compilation tests by @zou3519 in #19343
  • [doc] improve ci doc by @reidliu41 in #19307
  • [Doc] Fix description in the Automatic Prefix Caching design doc by @cr7258 in #19333
  • [CI/Build] Fix LoRA test by @jeejeelee in #19350
  • [Fix] Allow kernel compilation for CUDA capability 8.7 by @conroy-cheers in #19328
  • [CI] Introduce rules for llama auto-label by @houseroad in #19323
  • [Docs] Fix a bullet list in usage/security.md by @windsonsea in #19358
  • [full_graph] Fix query_start_loc padding by @yinghai in #19321
  • [v1] Add fp32 support to v1 engine through flex attn by @Isotr0py in #19319
  • [Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. by @varun-sundar-rabindranath in #19298
  • [Bugfix][Core] Prevent token lengths exceeding max_model_len in V0 by @22quinn in #19348
  • [Quantization] Bump compressed-tensors version by @kylesayrs in #19295
  • [Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var by @liusiqian-tal in #18472
  • [TPU]Fix KV cache sharing tests by @lsy323 in #19371
  • [HOT-FIX] Add kv_sharing_target_layer_name argument to cutlass_mla backend by @pavanimajety in #19374
  • [Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration by @lsy323 in #19383
  • [V1] Reuse V0's memory_profiling util for gpu worker memory profiling by @yeqcharlotte in #19312
  • [Bugfix] Fix benchmark_moe.py by @gty111 in #19016
  • Use xla flag to improve the quantized model performance by @vanbasten23 in #19303
  • Fix docs/mkdocs/hooks/remove_announcement.py by @hmellor in #19382
  • [Frontend] Make use_tqdm accept a callable for custom progress bars by @reidliu41 in #19357
  • [Core] Use tuple for kv cache group block ids by @njhill in #19175
  • [Bugfix] Fix modelscope token passed in by @Potabk in #19389
  • [Core] Batch multi modal input using pinned memory by @lgeiger in #19169
  • Add security warning to bug report template by @russellb in #19365
  • [Misc] refactor neuron_multimodal and profiling by @reidliu41 in #19397
  • Add clear documentation around the impact of debugging flag by @annapendleton in #19369
  • Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. by @louie-tsai in #17930
  • Revert "[v1] Add fp32 support to v1 engine through flex attn" by @Isotr0py in #19404
  • [BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope by @YUNQIUGUO in #19134
  • [BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral by @bigPYJ1151 in #19411
  • Simplify ep kernels installation by @youkaichao in #19412
  • [Misc] Slight improvement of the BNB by @jeejeelee in #19418

New Contributors

  • @nerdalert made their first contribution in #18856
  • @Duyi-Wang made their first contribution in #18692
  • @jinyouzhi made their first contribution in #18918
  • @eric-haibin-lin made their first contribution in #18927
  • @Always-Naive made their first contribution in #18947
  • @yuguo68 made their first contribution in #18937
  • @ptarasiewiczNV made their first contribution in #18969
  • @izhuhaoran made their first contribution in #18935
  • @jennyyyyzhen made their first contribution in #18368
  • @zucchini-nlp made their first contribution in #19068
  • @SorenDreano made their first contribution in #18695
  • @PeaBrane made their first contribution in #18925
  • @jmswen made their first contribution in #19102
  • @dubcyfor3 made their first contribution in #19110
  • @p12tic made their first contribution in #16226
  • @KingsleyZhang123 made their first contribution in #19182
  • @cryptopic made their first contribution in #19103
  • @Adolfo-Karim made their first contribution in #19225
  • @QiliangCui made their first contribution in #19039
  • @draftbk made their first contribution in #19311
  • @pramenku made their first contribution in #19309
  • @KsuParkhamchuk made their first contribution in #19288
  • @cr7258 made their first contribution in #19333
  • @liusiqian-tal made their first contribution in #18472
  • @annapendleton made their first contribution in #19369
  • @louie-tsai made their first contribution in #17930
  • @YUNQIUGUO made their first contribution in #19134

Full Changelog: v0.9.0...v0.9.1

Don't miss a new vllm release

NewReleases is sending notifications on new releases.