Highlights
- vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable
VLLM_USE_V1=1
. See our blog for more details. (44 commits). - New methods (
LLM.sleep
,LLM.wake_up
,LLM.collective_rpc
,LLM.reset_prefix_cache
) in vLLM for the post training frameworks! (#12361, #12084, #12284). torch.compile
is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via-O3
engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).
This release features
- 400 commits from 132 contributors, including 57 new contributors.
- 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
- 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
- more than 161 bug fixes and miscellaneous enhancements
Features
Models
- New generative models: CogAgent (#11742), Deepseek-VL2 (#11578, #12068, #12169), fairseq2 Llama (#11442), InternLM3 (#12037), Whisper (#11280)
- New pooling models: Qwen2 PRM (#12202), InternLM2 reward models (#11571)
- VLM: Merged multi-modal processor is now ready for model developers! (#11620, #11900, #11682, #11717, #11669, #11396)
- Any model that implements merged multi-modal processor and the
get_*_embeddings
methods according to this guide is automatically supported by V1 engine.
- Any model that implements merged multi-modal processor and the
Hardwares
- Apple: Native support for macOS Apple Silicon (#11696)
- AMD: MI300 FP8 format for block_quant (#12134), Tuned MoE configurations for multiple models (#12408, #12049), block size heuristic for avg 2.8x speedup for int8 models (#11698)
- TPU: support for
W8A8
(#11785) - x86: Multi-LoRA (#11100) and MoE Support (#11831)
- Progress in out-of-tree hardware support (#12009, #11981, #11948, #11609, #12264, #11516, #11503, #11369, #11602)
Features
- Distributed:
- API Server: Jina- and Cohere-compatible Rerank API (#12376)
- Kernels:
Others
- Benchmark: new script for CPU offloading (#11533)
- Security: Set
weights_only=True
when usingtorch.load()
(#12366)
What's Changed
- [Docs] Document Deepseek V3 support by @simon-mo in #11535
- Update openai_compatible_server.md by @robertgshaw2-redhat in #11536
- [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
- [V1] Fix yapf by @WoosukKwon in #11538
- [CI] Fix broken CI by @robertgshaw2-redhat in #11543
- [misc] fix typing by @youkaichao in #11540
- [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in #11534
- [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in #11547
- [Platform] Move model arch check to platform by @MengqingCao in #11503
- Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in #11465
- [Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in #11546
- [Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in #11527
- [Doc] Add xgrammar in doc by @Chen-0210 in #11549
- [VLM] Support caching in merged multi-modal processor by @DarkLight1337 in #11396
- [MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in #11209
- [Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in #11551
- [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in #11566
- [Bugfix] Fix for ROCM compressed tensor support by @selalipop in #11561
- [Doc] Update mllama example based on official doc by @heheda12345 in #11567
- [V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in #11541
- [Bugfix] Last token measurement fix by @rajveerb in #11376
- [Model] Support InternLM2 Reward models by @Isotr0py in #11571
- [Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in #11582
- [Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in #11515
- [V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in #11581
- [Doc] Minor documentation fixes by @DarkLight1337 in #11580
- [bugfix] interleaving sliding window for cohere2 model by @youkaichao in #11583
- [V1] [5/N] API Server: unify
Detokenizer
andEngineCore
input by @robertgshaw2-redhat in #11545 - [Doc] Convert list tables to MyST by @DarkLight1337 in #11594
- [v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in #11596
- [Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in #11481
- Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in #11604
- [v1] fix compilation cache by @youkaichao in #11598
- [Docker] bump up neuron sdk v2.21 by @liangfu in #11593
- [Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in #11607
- [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in #11618
- [platforms] enable platform plugins by @youkaichao in #11602
- [VLM] Abstract out multi-modal data parsing in merged processor by @DarkLight1337 in #11620
- [V1] [6/N] API Server: Better Shutdown by @robertgshaw2-redhat in #11586
- [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel by @whyiug in #11631
- [benchmark] Remove dependency for H100 benchmark step by @khluu in #11572
- [Model][LoRA]LoRA support added for MolmoForCausalLM by @ayylemao in #11439
- [Bugfix] Fix OpenAI parallel sampling when using xgrammar by @mgoin in #11637
- [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) by @JohnGiorgi in #6909
- [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. by @sakunkun in #11565
- [V1] Simpify vision block hash for prefix caching by removing offset from hash by @heheda12345 in #11646
- [V1][VLM] V1 support for selected single-image models. by @ywang96 in #11632
- [Benchmark] Add benchmark script for CPU offloading by @ApostaC in #11533
- [Bugfix][Refactor] Unify model management in frontend by @joerunde in #11660
- [VLM] Add max-count checking in data parser for single image models by @DarkLight1337 in #11661
- [Misc] Optimize Qwen2-VL LoRA test by @jeejeelee in #11663
- [Misc] Replace space with - in the file names by @houseroad in #11667
- [Doc] Fix typo by @serihiro in #11666
- [V1] Implement Cascade Attention by @WoosukKwon in #11635
- [VLM] Move supported limits and max tokens to merged multi-modal processor by @DarkLight1337 in #11669
- [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input by @DarkLight1337 in #11674
- [mypy] Pass type checking in vllm/inputs by @CloseChoice in #11680
- [VLM] Merged multi-modal processor for LLaVA-NeXT by @DarkLight1337 in #11682
- According to vllm.EngineArgs, the name should be distributed_executor_backend by @chunyang-wen in #11689
- [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. by @kathyyu-google in #10013
- [V1][Minor] Optimize token_ids_cpu copy by @WoosukKwon in #11692
- [Bugfix] Change kv scaling factor by param json on nvidia gpu by @bjmsong in #11688
- Resolve race conditions in Marlin kernel by @wchen61 in #11493
- [Misc] Minimum requirements for SageMaker compatibility by @nathan-az in #11576
- Update default max_num_batch_tokens for chunked prefill by @SachinVarghese in #11694
- [Bugfix] Check chain_speculative_sampling before calling it by @houseroad in #11673
- [perf-benchmark] Fix dependency for steps in benchmark pipeline by @khluu in #11710
- [Model] Whisper model implementation by @aurickq in #11280
- [V1] Simplify Shutdown by @robertgshaw2-redhat in #11659
- [Bugfix] Fix ColumnParallelLinearWithLoRA slice by @zinccat in #11708
- [V1] Improve TP>1 Error Handling + Stack Trace by @robertgshaw2-redhat in #11721
- [Misc]Add BNB quantization for Qwen2VL by @jeejeelee in #11719
- Update requirements-tpu.txt to support python 3.9 and 3.11 by @mgoin in #11695
- [V1] Chore: cruft removal by @robertgshaw2-redhat in #11724
- log GPU blocks num for MultiprocExecutor by @WangErXiao in #11656
- Update tool_calling.md by @Bryce1010 in #11701
- Update bnb.md with example for OpenAI by @bet0x in #11718
- [V1] Add
RayExecutor
support forAsyncLLM
(api server) by @jikunshang in #11712 - [V1] Add kv cache utils tests. by @xcnick in #11513
- [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture by @yanburman in #11233
- [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision by @DarkLight1337 in #11717
- [Bugfix] Fix precision error in LLaVA-NeXT feature size calculation by @DarkLight1337 in #11735
- [Model] Remove unnecessary weight initialization logic by @DarkLight1337 in #11736
- [Bugfix][V1] Fix test_kv_cache_utils.py by @jeejeelee in #11738
- [MISC] Replace c10::optional with std::optional by @houseroad in #11730
- [distributed] remove pynccl's redundant stream by @cennn in #11744
- fix: [doc] fix typo by @RuixiangMa in #11751
- [Frontend] Improve
StreamingResponse
Exception Handling by @robertgshaw2-redhat in #11752 - [distributed] remove pynccl's redundant change_state by @cennn in #11749
- [Doc] [1/N] Reorganize Getting Started section by @DarkLight1337 in #11645
- [Bugfix] Remove block size constraint by @comaniac in #11723
- [V1] Add BlockTable class by @WoosukKwon in #11693
- [Misc] Fix typo for valid_tool_parses by @ruisearch42 in #11753
- [V1] Refactor get_executor_cls by @ruisearch42 in #11754
- [mypy] Forward pass function type hints in lora by @lucas-tucker in #11740
- k8s-config: Update the secret to use stringData by @surajssd in #11679
- [VLM] Separate out profiling-related logic by @DarkLight1337 in #11746
- [Doc][2/N] Reorganize Models and Usage sections by @DarkLight1337 in #11755
- [Bugfix] Fix max image size for LLaVA-Onevision by @ywang96 in #11769
- [doc] explain how to add interleaving sliding window support by @youkaichao in #11771
- [Bugfix][V1] Fix molmo text-only inputs by @jeejeelee in #11676
- [Kernel] Move attn_type to Attention.init() by @heheda12345 in #11690
- [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision by @ywang96 in #11685
- [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) by @DarkLight1337 in #11772
- [Model] Future-proof Qwen2-Audio multi-modal processor by @DarkLight1337 in #11776
- [XPU] Make pp group initilized for pipeline-parallelism by @ys950902 in #11648
- [Doc][3/N] Reorganize Serving section by @DarkLight1337 in #11766
- [Kernel][LoRA]Punica prefill kernels fusion by @jeejeelee in #11234
- [Bugfix] Update attention interface in
Whisper
by @ywang96 in #11784 - [CI] Fix neuron CI and run offline tests by @liangfu in #11779
- fix init error for MessageQueue when n_local_reader is zero by @XiaobingSuper in #11768
- [Doc] Create a vulnerability management team by @russellb in #9925
- [CI][CPU] adding build number to docker image name by @zhouyuan in #11788
- [V1][Doc] Update V1 support for
LLaVa-NeXT-Video
by @ywang96 in #11798 - [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation by @DarkLight1337 in #11800
- [doc] add doc to explain how to use uv by @youkaichao in #11773
- [V1] Support audio language models on V1 by @ywang96 in #11733
- [doc] update how pip can install nightly wheels by @youkaichao in #11806
- [Doc] Add note to
gte-Qwen2
models by @DarkLight1337 in #11808 - [optimization] remove python function call for custom op by @youkaichao in #11750
- [Bugfix] update the prefix for qwen2 by @jiangjiadi in #11795
- [Doc]Add documentation for using EAGLE in vLLM by @sroy745 in #11417
- [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 by @DamonFool in #11794
- [Doc] Group examples into categories by @hmellor in #11782
- [Bugfix] Fix image input for Pixtral-HF by @DarkLight1337 in #11741
- [Misc] sort torch profiler table by kernel timing by @divakar-amd in #11813
- Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… by @WangErXiao in #11824
- Fixed docker build for ppc64le by @npanpaliya in #11518
- [OpenVINO] Fixed Docker.openvino build by @ilya-lavrenov in #11732
- [Bugfix] Add checks for LoRA and CPU offload by @jeejeelee in #11810
- [Docs] reorganize sponsorship page by @simon-mo in #11639
- [Bug] Fix pickling of
ModelConfig
when RunAI Model Streamer is used by @DarkLight1337 in #11825 - [misc] improve memory profiling by @youkaichao in #11809
- [doc] update wheels url by @youkaichao in #11830
- [Docs] Update sponsor name: 'Novita' to 'Novita AI' by @simon-mo in #11833
- [Hardware][Apple] Native support for macOS Apple Silicon by @wallashss in #11696
- [torch.compile] consider relevant code in compilation cache by @youkaichao in #11614
- [VLM] Reorganize profiling/processing-related code by @DarkLight1337 in #11812
- [Doc] Move examples into categories by @hmellor in #11840
- [Doc][4/N] Reorganize API Reference by @DarkLight1337 in #11843
- [CI/Build][Bugfix] Fix CPU CI image clean up by @bigPYJ1151 in #11836
- [Bugfix][XPU] fix silu_and_mul by @yma11 in #11823
- [Misc] Move some model utils into vision file by @DarkLight1337 in #11848
- [Doc] Expand Multimodal API Reference by @DarkLight1337 in #11852
- [Misc]add some explanations for BlockHashType by @WangErXiao in #11847
- [TPU][Quantization] TPU
W8A8
by @robertgshaw2-redhat in #11785 - [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models by @rasmith in #11698
- [Docs] Add Google Cloud Meetup by @simon-mo in #11864
- [CI] Turn on basic correctness tests for V1 by @tlrmchlsmth in #10864
- treat do_lower_case in the same way as the sentence-transformers library by @maxdebayser in #11815
- [Doc] Recommend uv and python 3.12 for quickstart guide by @mgoin in #11849
- [Misc] Move
print_*_once
from utils to logger by @DarkLight1337 in #11298 - [Doc] Intended links Python multiprocessing library by @guspan-tanadi in #11878
- [perf]fix current stream by @youkaichao in #11870
- [Bugfix] Override dunder methods of placeholder modules by @DarkLight1337 in #11882
- [Bugfix] fix beam search input errors and latency benchmark script by @yeqcharlotte in #11875
- [Doc] Add model development API Reference by @DarkLight1337 in #11884
- [platform] Allow platform specify attention backend by @wangxiyuan in #11609
- [ci]try to fix flaky multi-step tests by @youkaichao in #11894
- [Misc] Provide correct Pixtral-HF chat template by @DarkLight1337 in #11891
- [Docs] Add Modal to deployment frameworks by @charlesfrye in #11907
- [Doc][5/N] Move Community and API Reference to the bottom by @DarkLight1337 in #11896
- [VLM] Enable tokenized inputs for merged multi-modal processor by @DarkLight1337 in #11900
- [Doc] Show default pooling method in a table by @DarkLight1337 in #11904
- [torch.compile] Hide KV cache behind torch.compile boundary by @heheda12345 in #11677
- [Bugfix] Validate lora adapters to avoid crashing server by @joerunde in #11727
- [BUGFIX] Fix
UnspecifiedPlatform
package name by @jikunshang in #11916 - [ci] fix gh200 tests by @youkaichao in #11919
- [optimization] remove python function call for custom activation op by @cennn in #11885
- [platform] support pytorch custom op pluggable by @wangxiyuan in #11328
- Replace "online inference" with "online serving" by @hmellor in #11923
- [ci] Fix sampler tests by @youkaichao in #11922
- [Doc] [1/N] Initial guide for merged multi-modal processor by @DarkLight1337 in #11925
- [platform] support custom torch.compile backend key by @wangxiyuan in #11318
- [Doc] Rename offline inference examples by @hmellor in #11927
- [Docs] Fix docstring in
get_ip
function by @KuntaiDu in #11932 - [Doc] Docstring fix in
benchmark_long_document_qa_throughput.py
by @KuntaiDu in #11933 - [Hardware][CPU] Support MOE models on x86 CPU by @bigPYJ1151 in #11831
- [Misc] Clean up debug code in Deepseek-V3 by @Isotr0py in #11930
- [Misc] Update benchmark_prefix_caching.py fixed example usage by @remimin in #11920
- [Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #11939
- [mypy] Fix mypy warnings in api_server.py by @frreiss in #11941
- [ci] fix broken distributed-tests-4-gpus by @youkaichao in #11937
- [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design by @llsj14 in #11672
- [Bugfix] fused_experts_impl wrong compute type for float32 by @shaochangxu in #11921
- [CI/Build] Move model-specific multi-modal processing tests by @DarkLight1337 in #11934
- [Doc] Basic guide for writing unit tests for new models by @DarkLight1337 in #11951
- [Bugfix] Fix RobertaModel loading by @NickLucche in #11940
- [Model] Add cogagent model support vLLM by @sixsixcoder in #11742
- [V1] Avoid sending text prompt to core engine by @ywang96 in #11963
- [CI/Build] Add markdown linter by @rafvasq in #11857
- [Model] Initialize support for Deepseek-VL2 models by @Isotr0py in #11578
- [Hardware][CPU] Multi-LoRA implementation for the CPU backend by @Akshat-Tripathi in #11100
- [Hardware][TPU] workaround fix for MoE on TPU by @avshalomman in #11764
- [V1][Core][1/n] Logging and Metrics by @robertgshaw2-redhat in #11962
- [Model] Support GGUF models newly added in
transformers
4.46.0 by @Isotr0py in #9685 - [V1] [2/n] Logging and Metrics -
OutputProcessor
Abstraction by @robertgshaw2-redhat in #11973 - [MISC] fix typo in kv transfer send recv test by @yyccli in #11983
- [Bug] Fix usage of
.transpose()
and.view()
consecutively. by @liaoyanqing666 in #11979 - [CI][Spec Decode] fix: broken test for EAGLE model by @llsj14 in #11972
- [Misc] Fix Deepseek V2 fp8 kv-scale remapping by @Concurrensee in #11947
- [Misc]Minor Changes about Worker by @noemotiovon in #11555
- [platform] add ray_device_key by @youkaichao in #11948
- Fix Max Token ID for Qwen-VL-Chat by @alex-jw-brooks in #11980
- [Kernel] Attention.forward with unified_attention when use_direct_call=True by @heheda12345 in #11967
- [Doc][V1] Update model implementation guide for V1 support by @ywang96 in #11998
- [Doc] Organise installation documentation into categories and tabs by @hmellor in #11935
- [platform] add device_control env var by @youkaichao in #12009
- [Platform] Move get_punica_wrapper() function to Platform by @shen-shanshan in #11516
- bugfix: Fix signature mismatch in benchmark's
get_tokenizer
function by @e1ijah1 in #11982 - [Doc] Fix build from source and installation link in README.md by @Yikun in #12013
- [Bugfix] Fix deepseekv3 gate bias error by @SunflowerAries in #12002
- [Docs] Add Sky Computing Lab to project intro by @WoosukKwon in #12019
- [Hardware][Gaudi][Bugfix] Fix set_forward_context arguments and CI test execution by @kzawora-intel in #12014
- [Doc] Update Quantization Hardware Support Documentation by @tjtanaa in #12025
- [HPU][misc] add comments for explanation by @youkaichao in #12034
- [Bugfix] Fix various bugs in multi-modal processor by @DarkLight1337 in #12031
- [Kernel] Revert the API change of Attention.forward by @heheda12345 in #12038
- [Platform] Add output for Attention Backend by @wangxiyuan in #11981
- [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention by @heheda12345 in #12040
- Explain where the engine args go when using Docker by @hmellor in #12041
- [Doc]: Update the Json Example of the
Engine Arguments
document by @maang-h in #12045 - [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping by @jeejeelee in #11924
- [Kernel] Support MulAndSilu by @jeejeelee in #11624
- [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py by @kzawora-intel in #12046
- [Platform] Refactor current_memory_usage() function in DeviceMemoryProfiler to Platform by @shen-shanshan in #11369
- [V1][BugFix] Fix edge case in VLM scheduling by @WoosukKwon in #12065
- [Misc] Add multipstep chunked-prefill support for FlashInfer by @elfiegg in #10467
- [core] Turn off GPU communication overlap for Ray executor by @ruisearch42 in #12051
- [core] platform agnostic executor via collective_rpc by @youkaichao in #11256
- [Doc] Update examples to remove SparseAutoModelForCausalLM by @kylesayrs in #12062
- [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager by @heheda12345 in #12003
- Fix: cases with empty sparsity config by @rahul-tuli in #12057
- Type-fix: make execute_model output type optional by @youngkent in #12020
- [Platform] Do not raise error if _Backend is not found by @wangxiyuan in #12023
- [Model]: Support internlm3 by @RunningLeon in #12037
- Misc: allow to use proxy in
HTTPConnection
by @zhouyuan in #12042 - [Misc][Quark] Upstream Quark format to VLLM by @kewang-xlnx in #10765
- [Doc]: Update
OpenAI-Compatible Server
documents by @maang-h in #12082 - [Bugfix] use right truncation for non-generative tasks by @joerunde in #12050
- [V1][Core] Autotune encoder cache budget by @ywang96 in #11895
- [Bugfix] Fix _get_lora_device for HQQ marlin by @varun-sundar-rabindranath in #12090
- Allow hip sources to be directly included when compiling for rocm. by @tvirolai-amd in #12087
- [Core] Default to using per_token quantization for fp8 when cutlass is supported. by @elfiegg in #8651
- [Doc] Add documentation for specifying model architecture by @DarkLight1337 in #12105
- Various cosmetic/comment fixes by @mgoin in #12089
- [Bugfix] Remove hardcoded
head_size=256
for Deepseek v2 and v3 by @Isotr0py in #12067 - Support torchrun and SPMD-style offline inference by @youkaichao in #12071
- [core] LLM.collective_rpc interface and RLHF example by @youkaichao in #12084
- [Bugfix] Fix max image feature size for Llava-one-vision by @ywang96 in #12104
- [misc] Add LoRA kernel micro benchmarks by @varun-sundar-rabindranath in #11579
- [Model] Add support for deepseek-vl2-tiny model by @Isotr0py in #12068
- [Bugfix] Set enforce_eager automatically for mllama by @heheda12345 in #12127
- [Bugfix] Fix a path bug in disaggregated prefill example script. by @KuntaiDu in #12121
- [CI]add genai-perf benchmark in nightly benchmark by @jikunshang in #10704
- [Doc] Add instructions on using Podman when SELinux is active by @terrytangyuan in #12136
- [Bugfix] Revert PR #11435: Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #12135
- [BugFix] add more
is not None
check in VllmConfig.post_init by @heheda12345 in #12138 - [Misc] Add deepseek_vl2 chat template by @Isotr0py in #12143
- [ROCm][MoE] moe tuning support for rocm by @divakar-amd in #12049
- [V1] Move more control of kv cache initialization from model_executor to EngineCore by @heheda12345 in #11960
- [Misc][LoRA] Improve the readability of LoRA error messages during loading by @jeejeelee in #12102
- [CI/Build][CPU][Bugfix] Fix CPU CI by @bigPYJ1151 in #12150
- [core] allow callable in collective_rpc by @youkaichao in #12151
- [Bugfix] Fix score api for missing max_model_len validation by @wallashss in #12119
- [Bugfix] Mistral tokenizer encode accept list of str by @jikunshang in #12149
- [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant by @gshtras in #12134
- [torch.compile] disable logging when cache is disabled by @youkaichao in #12043
- [misc] fix cross-node TP by @youkaichao in #12166
- [AMD][CI/Build][Bugfix] updated pytorch stale wheel path by using stable wheel by @hongxiayang in #12172
- [core] further polish memory profiling by @youkaichao in #12126
- [Docs] Fix broken link in SECURITY.md by @russellb in #12175
- [Model] Port deepseek-vl2 processor and remove
deepseek_vl2
dependency by @Isotr0py in #12169 - [core] clean up executor class hierarchy between v1 and v0 by @youkaichao in #12171
- [Misc] Support register quantization method out-of-tree by @ice-tong in #11969
- [V1] Collect env var for usage stats by @simon-mo in #12115
- [BUGFIX] Move scores to float32 in case of running xgrammar on cpu by @madamczykhabana in #12152
- [Bugfix] Fix multi-modal processors for transformers 4.48 by @DarkLight1337 in #12187
- [torch.compile] store inductor compiled Python file by @youkaichao in #12182
- benchmark_serving support --served-model-name param by @gujingit in #12109
- [Misc] Add BNB support to GLM4-V model by @Isotr0py in #12184
- [V1] Add V1 support of Qwen2-VL by @ywang96 in #12128
- [Model] Support for fairseq2 Llama by @MartinGleize in #11442
- [Bugfix] Fix num_heads value for simple connector when tp enabled by @ShangmingCai in #12074
- [torch.compile] fix sym_tensor_indices by @youkaichao in #12191
- Move linting to
pre-commit
by @hmellor in #11975 - [DOC] Fix typo in SingleStepOutputProcessor docstring and assert message by @terrytangyuan in #12194
- [DOC] Add missing docstring for additional args in LLMEngine.add_request() by @terrytangyuan in #12195
- [Bugfix] Fix incorrect types in LayerwiseProfileResults by @terrytangyuan in #12196
- [Model] Add Qwen2 PRM model support by @Isotr0py in #12202
- [Core] Interface for accessing model from
VllmRunner
by @DarkLight1337 in #10353 - [misc] add placeholder format.sh by @youkaichao in #12206
- [CI/Build] Remove dummy CI steps by @DarkLight1337 in #12208
- [CI/Build] Make pre-commit faster by @DarkLight1337 in #12212
- [Model] Upgrade Aria to transformers 4.48 by @DarkLight1337 in #12203
- [misc] print a message to suggest how to bypass commit hooks by @youkaichao in #12217
- [core][bugfix] configure env var during import vllm by @youkaichao in #12209
- [V1] Remove
_get_cache_block_size
by @heheda12345 in #12214 - [Misc] Pass
attention
to impl backend by @wangxiyuan in #12218 - [Bugfix] Fix
HfExampleModels.find_hf_info
by @DarkLight1337 in #12223 - [CI] Pass local python version explicitly to pre-commit mypy.sh by @heheda12345 in #12224
- [Misc] Update CODEOWNERS by @ywang96 in #12229
- fix: update platform detection for M-series arm based MacBook processors by @isikhi in #12227
- [misc] add cuda runtime version to usage data by @youkaichao in #12190
- [bugfix] catch xgrammar unsupported array constraints by @Jason-CKY in #12210
- [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) by @jinzhen-lin in #12222
- Add quantization and guided decoding CODEOWNERS by @mgoin in #12228
- [AMD][Build] Porting dockerfiles from the ROCm/vllm fork by @gshtras in #11777
- [BugFix] Fix GGUF tp>1 models when vocab_size is not divisible by 64 by @NickLucche in #12230
- [ci/build] disable failed and flaky tests by @youkaichao in #12240
- [Misc] Rename
MultiModalInputsV2 -> MultiModalInputs
by @DarkLight1337 in #12244 - [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration by @jeejeelee in #12237
- [Misc] Remove redundant TypeVar from base model by @DarkLight1337 in #12248
- [Bugfix] Fix mm_limits access for merged multi-modal processor by @DarkLight1337 in #12252
- [torch.compile] transparent compilation with more logging by @youkaichao in #12246
- [V1][Bugfix] Fix data item ordering in mixed-modality inference by @ywang96 in #12259
- [Bugfix] Remove comments re: pytorch for outlines + compressed-tensors dependencies by @tdoublep in #12260
- [Platform] improve platforms getattr by @MengqingCao in #12264
- [ci/build] add nightly torch for test by @youkaichao in #12270
- [Bugfix] fix race condition that leads to wrong order of token returned by @joennlae in #10802
- [Kernel] fix moe_align_block_size error condition by @jinzhen-lin in #12239
- [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types by @rickyyx in #10907
- [Bugfix] Multi-sequence broken by @andylolu2 in #11898
- [Misc] Remove experimental dep from tracing.py by @codefromthecrypt in #12007
- [Misc] Set default backend to SDPA for get_vit_attn_backend by @wangxiyuan in #12235
- [Core] Free CPU pinned memory on environment cleanup by @janimo in #10477
- [bugfix] moe tuning. rm is_navi() by @divakar-amd in #12273
- [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes by @maleksan85 in #12277
- [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose by @hongxiayang in #12281
- [VLM] Simplify post-processing of replacement info by @DarkLight1337 in #12269
- [ci/lint] Add back default arg for pre-commit by @khluu in #12279
- [CI] add docker volume prune to neuron CI by @liangfu in #12291
- [Ci/Build] Fix mypy errors on main by @DarkLight1337 in #12296
- [Benchmark] More accurate TPOT calc in
benchmark_serving.py
by @njhill in #12288 - [core] separate builder init and builder prepare for each batch by @youkaichao in #12253
- [Build] update requirements of no-device by @MengqingCao in #12299
- [Core] Support fully transparent sleep mode by @youkaichao in #11743
- [VLM] Avoid unnecessary tokenization by @DarkLight1337 in #12310
- [Model][Bugfix]: correct Aria model output by @xffxff in #12309
- [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 by @ywang96 in #12313
- [Doc] Add docs for prompt replacement by @DarkLight1337 in #12318
- [Misc] Fix the error in the tip for the --lora-modules parameter by @WangErXiao in #12319
- [Misc] Improve the readability of BNB error messages by @jeejeelee in #12320
- [Hardware][Gaudi][Bugfix] Fix HPU tensor parallelism, enable multiprocessing executor by @kzawora-intel in #12167
- [Core] Support
reset_prefix_cache
by @comaniac in #12284 - [Frontend][V1] Online serving performance improvements by @njhill in #12287
- [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD by @rasmith in #12282
- [Bugfix] Fixing AMD LoRA CI test. by @Alexei-V-Ivanov-AMD in #12329
- [Docs] Update FP8 KV Cache documentation by @mgoin in #12238
- [Docs] Document vulnerability disclosure process by @russellb in #12326
- [V1] Add
uncache_blocks
by @comaniac in #12333 - [doc] explain common errors around torch.compile by @youkaichao in #12340
- [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update by @zhenwei-intel in #12338
- [Bugfix] Fix k_proj's bias for whisper self attention by @Isotr0py in #12342
- [Kernel] Flash Attention 3 Support by @LucasWilkinson in #12093
- [Doc] Troubleshooting errors during model inspection by @DarkLight1337 in #12351
- [V1] Simplify M-RoPE by @ywang96 in #12352
- [Bugfix] Fix broken internvl2 inference with v1 by @Isotr0py in #12360
- [core] add wake_up doc and some sanity check by @youkaichao in #12361
- [torch.compile] decouple compile sizes and cudagraph sizes by @youkaichao in #12243
- [FP8][Kernel] Dynamic kv cache scaling factors computation by @gshtras in #11906
- [TPU] Update TPU CI to use torchxla nightly on 20250122 by @lsy323 in #12334
- [Docs] Document Phi-4 support by @Isotr0py in #12362
- [BugFix] Fix parameter names and
process_after_weight_loading
for W4A16 MoE Group Act Order by @dsikka in #11528 - [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script by @jsato8094 in #12357
- [Docs] Add meetup slides by @WoosukKwon in #12345
- [Docs] Update spec decode + structured output in compat matrix by @russellb in #12373
- [V1][Frontend] Coalesce bunched
RequestOutput
s by @njhill in #12298 - Set weights_only=True when using torch.load() by @russellb in #12366
- [Bugfix] Path join when building local path for S3 clone by @omer-dayan in #12353
- Update compressed-tensors version by @dsikka in #12367
- [V1] Increase default batch size for H100/H200 by @WoosukKwon in #12369
- [perf] fix perf regression from #12253 by @youkaichao in #12380
- [Misc] Use VisionArena Dataset for VLM Benchmarking by @ywang96 in #12389
- [ci/build] fix wheel size check by @youkaichao in #12396
- [Hardware][Gaudi][Doc] Add missing step in setup instructions by @MohitIntel in #12382
- [ci/build] sync default value for wheel size by @youkaichao in #12398
- [Misc] Enable proxy support in benchmark script by @jsato8094 in #12356
- [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build by @LucasWilkinson in #12375
- [Misc] Remove deprecated code by @DarkLight1337 in #12383
- [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). by @LucasWilkinson in #12405
- [Bugfix][Kernel] Fix moe align block issue for mixtral by @ElizaWszola in #12413
- [Bugfix] Fix BLIP-2 processing by @DarkLight1337 in #12412
- [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 by @divakar-amd in #12408
- [Misc] Add FA2 support to ViT MHA layer by @Isotr0py in #12355
- [TPU][CI] Update torchxla version in requirement-tpu.txt by @lsy323 in #12422
- [Misc][Bugfix] FA3 support to ViT MHA layer by @ywang96 in #12435
- [V1][Perf] Reduce scheduling overhead in model runner after cuda sync by @youngkent in #12094
- [V1][Bugfix] Fix assertion when mm hashing is turned off by @ywang96 in #12439
- [Misc] Revert FA on ViT #12355 and #12435 by @ywang96 in #12445
- [Frontend] Set server's maximum number of generated tokens using generation_config.json by @mhendrey in #12242
- [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 by @tlrmchlsmth in #12417
- [Bugfix/CI] Fix broken kernels/test_mha.py by @tlrmchlsmth in #12450
- [Bugfix][Kernel] Fix perf regression caused by PR #12405 by @LucasWilkinson in #12434
- [Build/CI] Fix libcuda.so linkage by @tlrmchlsmth in #12424
- [Frontend] Rerank API (Jina- and Cohere-compatible API) by @K-Mistele in #12376
- [DOC] Add link to vLLM blog by @terrytangyuan in #12460
- [V1] Avoid list creation in input preparation by @WoosukKwon in #12457
- [Frontend] Support scores endpoint in run_batch by @pooyadavoodi in #12430
- [Bugfix] Fix Granite 3.0 MoE model loading by @DarkLight1337 in #12446
New Contributors
- @Chen-0210 made their first contribution in #11549
- @ErezSC42 made their first contribution in #11209
- @selalipop made their first contribution in #11561
- @rajveerb made their first contribution in #11376
- @hj-wei made their first contribution in #11515
- @ayylemao made their first contribution in #11439
- @JohnGiorgi made their first contribution in #6909
- @sakunkun made their first contribution in #11565
- @ApostaC made their first contribution in #11533
- @houseroad made their first contribution in #11667
- @serihiro made their first contribution in #11666
- @CloseChoice made their first contribution in #11680
- @chunyang-wen made their first contribution in #11689
- @kathyyu-google made their first contribution in #10013
- @bjmsong made their first contribution in #11688
- @nathan-az made their first contribution in #11576
- @SachinVarghese made their first contribution in #11694
- @zinccat made their first contribution in #11708
- @WangErXiao made their first contribution in #11656
- @Bryce1010 made their first contribution in #11701
- @bet0x made their first contribution in #11718
- @yanburman made their first contribution in #11233
- @RuixiangMa made their first contribution in #11751
- @surajssd made their first contribution in #11679
- @ys950902 made their first contribution in #11648
- @XiaobingSuper made their first contribution in #11768
- @jiangjiadi made their first contribution in #11795
- @guspan-tanadi made their first contribution in #11878
- @yeqcharlotte made their first contribution in #11875
- @charlesfrye made their first contribution in #11907
- @remimin made their first contribution in #11920
- @frreiss made their first contribution in #11941
- @shaochangxu made their first contribution in #11921
- @Akshat-Tripathi made their first contribution in #11100
- @liaoyanqing666 made their first contribution in #11979
- @Concurrensee made their first contribution in #11947
- @shen-shanshan made their first contribution in #11516
- @e1ijah1 made their first contribution in #11982
- @Yikun made their first contribution in #12013
- @SunflowerAries made their first contribution in #12002
- @maang-h made their first contribution in #12045
- @rahul-tuli made their first contribution in #12057
- @youngkent made their first contribution in #12020
- @RunningLeon made their first contribution in #12037
- @kewang-xlnx made their first contribution in #10765
- @tvirolai-amd made their first contribution in #12087
- @ice-tong made their first contribution in #11969
- @madamczykhabana made their first contribution in #12152
- @gujingit made their first contribution in #12109
- @MartinGleize made their first contribution in #11442
- @isikhi made their first contribution in #12227
- @Jason-CKY made their first contribution in #12210
- @andylolu2 made their first contribution in #11898
- @codefromthecrypt made their first contribution in #12007
- @zhenwei-intel made their first contribution in #12338
- @MohitIntel made their first contribution in #12382
- @mhendrey made their first contribution in #12242
Full Changelog: v0.6.6...v0.7.0