pypi vllm 0.7.0
v0.7.0

3 days ago

Highlights

  • vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
  • New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
  • torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

  • 400 commits from 132 contributors, including 57 new contributors.
    • 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
    • 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
    • more than 161 bug fixes and miscellaneous enhancements

Features

Models

Hardwares

Features

  • Distributed:
    • Support torchrun and SPMD-style offline inference (#12071)
    • New collective_rpc abstraction (#12151, #11256)
  • API Server: Jina- and Cohere-compatible Rerank API (#12376)
  • Kernels:
    • Flash Attention 3 Support (#12093)
    • Punica prefill kernels fusion (#11234)
    • For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

  • Benchmark: new script for CPU offloading (#11533)
  • Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

  • [Docs] Document Deepseek V3 support by @simon-mo in #11535
  • Update openai_compatible_server.md by @robertgshaw2-redhat in #11536
  • [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
  • [V1] Fix yapf by @WoosukKwon in #11538
  • [CI] Fix broken CI by @robertgshaw2-redhat in #11543
  • [misc] fix typing by @youkaichao in #11540
  • [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in #11534
  • [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in #11547
  • [Platform] Move model arch check to platform by @MengqingCao in #11503
  • Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in #11465
  • [Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in #11546
  • [Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in #11527
  • [Doc] Add xgrammar in doc by @Chen-0210 in #11549
  • [VLM] Support caching in merged multi-modal processor by @DarkLight1337 in #11396
  • [MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in #11209
  • [Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in #11551
  • [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in #11566
  • [Bugfix] Fix for ROCM compressed tensor support by @selalipop in #11561
  • [Doc] Update mllama example based on official doc by @heheda12345 in #11567
  • [V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in #11541
  • [Bugfix] Last token measurement fix by @rajveerb in #11376
  • [Model] Support InternLM2 Reward models by @Isotr0py in #11571
  • [Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in #11582
  • [Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in #11515
  • [V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in #11581
  • [Doc] Minor documentation fixes by @DarkLight1337 in #11580
  • [bugfix] interleaving sliding window for cohere2 model by @youkaichao in #11583
  • [V1] [5/N] API Server: unify Detokenizer and EngineCore input by @robertgshaw2-redhat in #11545
  • [Doc] Convert list tables to MyST by @DarkLight1337 in #11594
  • [v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in #11596
  • [Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in #11481
  • Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in #11604
  • [v1] fix compilation cache by @youkaichao in #11598
  • [Docker] bump up neuron sdk v2.21 by @liangfu in #11593
  • [Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in #11607
  • [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in #11618
  • [platforms] enable platform plugins by @youkaichao in #11602
  • [VLM] Abstract out multi-modal data parsing in merged processor by @DarkLight1337 in #11620
  • [V1] [6/N] API Server: Better Shutdown by @robertgshaw2-redhat in #11586
  • [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel by @whyiug in #11631
  • [benchmark] Remove dependency for H100 benchmark step by @khluu in #11572
  • [Model][LoRA]LoRA support added for MolmoForCausalLM by @ayylemao in #11439
  • [Bugfix] Fix OpenAI parallel sampling when using xgrammar by @mgoin in #11637
  • [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) by @JohnGiorgi in #6909
  • [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. by @sakunkun in #11565
  • [V1] Simpify vision block hash for prefix caching by removing offset from hash by @heheda12345 in #11646
  • [V1][VLM] V1 support for selected single-image models. by @ywang96 in #11632
  • [Benchmark] Add benchmark script for CPU offloading by @ApostaC in #11533
  • [Bugfix][Refactor] Unify model management in frontend by @joerunde in #11660
  • [VLM] Add max-count checking in data parser for single image models by @DarkLight1337 in #11661
  • [Misc] Optimize Qwen2-VL LoRA test by @jeejeelee in #11663
  • [Misc] Replace space with - in the file names by @houseroad in #11667
  • [Doc] Fix typo by @serihiro in #11666
  • [V1] Implement Cascade Attention by @WoosukKwon in #11635
  • [VLM] Move supported limits and max tokens to merged multi-modal processor by @DarkLight1337 in #11669
  • [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input by @DarkLight1337 in #11674
  • [mypy] Pass type checking in vllm/inputs by @CloseChoice in #11680
  • [VLM] Merged multi-modal processor for LLaVA-NeXT by @DarkLight1337 in #11682
  • According to vllm.EngineArgs, the name should be distributed_executor_backend by @chunyang-wen in #11689
  • [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. by @kathyyu-google in #10013
  • [V1][Minor] Optimize token_ids_cpu copy by @WoosukKwon in #11692
  • [Bugfix] Change kv scaling factor by param json on nvidia gpu by @bjmsong in #11688
  • Resolve race conditions in Marlin kernel by @wchen61 in #11493
  • [Misc] Minimum requirements for SageMaker compatibility by @nathan-az in #11576
  • Update default max_num_batch_tokens for chunked prefill by @SachinVarghese in #11694
  • [Bugfix] Check chain_speculative_sampling before calling it by @houseroad in #11673
  • [perf-benchmark] Fix dependency for steps in benchmark pipeline by @khluu in #11710
  • [Model] Whisper model implementation by @aurickq in #11280
  • [V1] Simplify Shutdown by @robertgshaw2-redhat in #11659
  • [Bugfix] Fix ColumnParallelLinearWithLoRA slice by @zinccat in #11708
  • [V1] Improve TP>1 Error Handling + Stack Trace by @robertgshaw2-redhat in #11721
  • [Misc]Add BNB quantization for Qwen2VL by @jeejeelee in #11719
  • Update requirements-tpu.txt to support python 3.9 and 3.11 by @mgoin in #11695
  • [V1] Chore: cruft removal by @robertgshaw2-redhat in #11724
  • log GPU blocks num for MultiprocExecutor by @WangErXiao in #11656
  • Update tool_calling.md by @Bryce1010 in #11701
  • Update bnb.md with example for OpenAI by @bet0x in #11718
  • [V1] Add RayExecutor support for AsyncLLM (api server) by @jikunshang in #11712
  • [V1] Add kv cache utils tests. by @xcnick in #11513
  • [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture by @yanburman in #11233
  • [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision by @DarkLight1337 in #11717
  • [Bugfix] Fix precision error in LLaVA-NeXT feature size calculation by @DarkLight1337 in #11735
  • [Model] Remove unnecessary weight initialization logic by @DarkLight1337 in #11736
  • [Bugfix][V1] Fix test_kv_cache_utils.py by @jeejeelee in #11738
  • [MISC] Replace c10::optional with std::optional by @houseroad in #11730
  • [distributed] remove pynccl's redundant stream by @cennn in #11744
  • fix: [doc] fix typo by @RuixiangMa in #11751
  • [Frontend] Improve StreamingResponse Exception Handling by @robertgshaw2-redhat in #11752
  • [distributed] remove pynccl's redundant change_state by @cennn in #11749
  • [Doc] [1/N] Reorganize Getting Started section by @DarkLight1337 in #11645
  • [Bugfix] Remove block size constraint by @comaniac in #11723
  • [V1] Add BlockTable class by @WoosukKwon in #11693
  • [Misc] Fix typo for valid_tool_parses by @ruisearch42 in #11753
  • [V1] Refactor get_executor_cls by @ruisearch42 in #11754
  • [mypy] Forward pass function type hints in lora by @lucas-tucker in #11740
  • k8s-config: Update the secret to use stringData by @surajssd in #11679
  • [VLM] Separate out profiling-related logic by @DarkLight1337 in #11746
  • [Doc][2/N] Reorganize Models and Usage sections by @DarkLight1337 in #11755
  • [Bugfix] Fix max image size for LLaVA-Onevision by @ywang96 in #11769
  • [doc] explain how to add interleaving sliding window support by @youkaichao in #11771
  • [Bugfix][V1] Fix molmo text-only inputs by @jeejeelee in #11676
  • [Kernel] Move attn_type to Attention.init() by @heheda12345 in #11690
  • [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision by @ywang96 in #11685
  • [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) by @DarkLight1337 in #11772
  • [Model] Future-proof Qwen2-Audio multi-modal processor by @DarkLight1337 in #11776
  • [XPU] Make pp group initilized for pipeline-parallelism by @ys950902 in #11648
  • [Doc][3/N] Reorganize Serving section by @DarkLight1337 in #11766
  • [Kernel][LoRA]Punica prefill kernels fusion by @jeejeelee in #11234
  • [Bugfix] Update attention interface in Whisper by @ywang96 in #11784
  • [CI] Fix neuron CI and run offline tests by @liangfu in #11779
  • fix init error for MessageQueue when n_local_reader is zero by @XiaobingSuper in #11768
  • [Doc] Create a vulnerability management team by @russellb in #9925
  • [CI][CPU] adding build number to docker image name by @zhouyuan in #11788
  • [V1][Doc] Update V1 support for LLaVa-NeXT-Video by @ywang96 in #11798
  • [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation by @DarkLight1337 in #11800
  • [doc] add doc to explain how to use uv by @youkaichao in #11773
  • [V1] Support audio language models on V1 by @ywang96 in #11733
  • [doc] update how pip can install nightly wheels by @youkaichao in #11806
  • [Doc] Add note to gte-Qwen2 models by @DarkLight1337 in #11808
  • [optimization] remove python function call for custom op by @youkaichao in #11750
  • [Bugfix] update the prefix for qwen2 by @jiangjiadi in #11795
  • [Doc]Add documentation for using EAGLE in vLLM by @sroy745 in #11417
  • [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 by @DamonFool in #11794
  • [Doc] Group examples into categories by @hmellor in #11782
  • [Bugfix] Fix image input for Pixtral-HF by @DarkLight1337 in #11741
  • [Misc] sort torch profiler table by kernel timing by @divakar-amd in #11813
  • Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… by @WangErXiao in #11824
  • Fixed docker build for ppc64le by @npanpaliya in #11518
  • [OpenVINO] Fixed Docker.openvino build by @ilya-lavrenov in #11732
  • [Bugfix] Add checks for LoRA and CPU offload by @jeejeelee in #11810
  • [Docs] reorganize sponsorship page by @simon-mo in #11639
  • [Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used by @DarkLight1337 in #11825
  • [misc] improve memory profiling by @youkaichao in #11809
  • [doc] update wheels url by @youkaichao in #11830
  • [Docs] Update sponsor name: 'Novita' to 'Novita AI' by @simon-mo in #11833
  • [Hardware][Apple] Native support for macOS Apple Silicon by @wallashss in #11696
  • [torch.compile] consider relevant code in compilation cache by @youkaichao in #11614
  • [VLM] Reorganize profiling/processing-related code by @DarkLight1337 in #11812
  • [Doc] Move examples into categories by @hmellor in #11840
  • [Doc][4/N] Reorganize API Reference by @DarkLight1337 in #11843
  • [CI/Build][Bugfix] Fix CPU CI image clean up by @bigPYJ1151 in #11836
  • [Bugfix][XPU] fix silu_and_mul by @yma11 in #11823
  • [Misc] Move some model utils into vision file by @DarkLight1337 in #11848
  • [Doc] Expand Multimodal API Reference by @DarkLight1337 in #11852
  • [Misc]add some explanations for BlockHashType by @WangErXiao in #11847
  • [TPU][Quantization] TPU W8A8 by @robertgshaw2-redhat in #11785
  • [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models by @rasmith in #11698
  • [Docs] Add Google Cloud Meetup by @simon-mo in #11864
  • [CI] Turn on basic correctness tests for V1 by @tlrmchlsmth in #10864
  • treat do_lower_case in the same way as the sentence-transformers library by @maxdebayser in #11815
  • [Doc] Recommend uv and python 3.12 for quickstart guide by @mgoin in #11849
  • [Misc] Move print_*_once from utils to logger by @DarkLight1337 in #11298
  • [Doc] Intended links Python multiprocessing library by @guspan-tanadi in #11878
  • [perf]fix current stream by @youkaichao in #11870
  • [Bugfix] Override dunder methods of placeholder modules by @DarkLight1337 in #11882
  • [Bugfix] fix beam search input errors and latency benchmark script by @yeqcharlotte in #11875
  • [Doc] Add model development API Reference by @DarkLight1337 in #11884
  • [platform] Allow platform specify attention backend by @wangxiyuan in #11609
  • [ci]try to fix flaky multi-step tests by @youkaichao in #11894
  • [Misc] Provide correct Pixtral-HF chat template by @DarkLight1337 in #11891
  • [Docs] Add Modal to deployment frameworks by @charlesfrye in #11907
  • [Doc][5/N] Move Community and API Reference to the bottom by @DarkLight1337 in #11896
  • [VLM] Enable tokenized inputs for merged multi-modal processor by @DarkLight1337 in #11900
  • [Doc] Show default pooling method in a table by @DarkLight1337 in #11904
  • [torch.compile] Hide KV cache behind torch.compile boundary by @heheda12345 in #11677
  • [Bugfix] Validate lora adapters to avoid crashing server by @joerunde in #11727
  • [BUGFIX] Fix UnspecifiedPlatform package name by @jikunshang in #11916
  • [ci] fix gh200 tests by @youkaichao in #11919
  • [optimization] remove python function call for custom activation op by @cennn in #11885
  • [platform] support pytorch custom op pluggable by @wangxiyuan in #11328
  • Replace "online inference" with "online serving" by @hmellor in #11923
  • [ci] Fix sampler tests by @youkaichao in #11922
  • [Doc] [1/N] Initial guide for merged multi-modal processor by @DarkLight1337 in #11925
  • [platform] support custom torch.compile backend key by @wangxiyuan in #11318
  • [Doc] Rename offline inference examples by @hmellor in #11927
  • [Docs] Fix docstring in get_ip function by @KuntaiDu in #11932
  • [Doc] Docstring fix in benchmark_long_document_qa_throughput.py by @KuntaiDu in #11933
  • [Hardware][CPU] Support MOE models on x86 CPU by @bigPYJ1151 in #11831
  • [Misc] Clean up debug code in Deepseek-V3 by @Isotr0py in #11930
  • [Misc] Update benchmark_prefix_caching.py fixed example usage by @remimin in #11920
  • [Bugfix] Check that number of images matches number of <|image|> tokens with mllama by @tjohnson31415 in #11939
  • [mypy] Fix mypy warnings in api_server.py by @frreiss in #11941
  • [ci] fix broken distributed-tests-4-gpus by @youkaichao in #11937
  • [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design by @llsj14 in #11672
  • [Bugfix] fused_experts_impl wrong compute type for float32 by @shaochangxu in #11921
  • [CI/Build] Move model-specific multi-modal processing tests by @DarkLight1337 in #11934
  • [Doc] Basic guide for writing unit tests for new models by @DarkLight1337 in #11951
  • [Bugfix] Fix RobertaModel loading by @NickLucche in #11940
  • [Model] Add cogagent model support vLLM by @sixsixcoder in #11742
  • [V1] Avoid sending text prompt to core engine by @ywang96 in #11963
  • [CI/Build] Add markdown linter by @rafvasq in #11857
  • [Model] Initialize support for Deepseek-VL2 models by @Isotr0py in #11578
  • [Hardware][CPU] Multi-LoRA implementation for the CPU backend by @Akshat-Tripathi in #11100
  • [Hardware][TPU] workaround fix for MoE on TPU by @avshalomman in #11764
  • [V1][Core][1/n] Logging and Metrics by @robertgshaw2-redhat in #11962
  • [Model] Support GGUF models newly added in transformers 4.46.0 by @Isotr0py in #9685
  • [V1] [2/n] Logging and Metrics - OutputProcessor Abstraction by @robertgshaw2-redhat in #11973
  • [MISC] fix typo in kv transfer send recv test by @yyccli in #11983
  • [Bug] Fix usage of .transpose() and .view() consecutively. by @liaoyanqing666 in #11979
  • [CI][Spec Decode] fix: broken test for EAGLE model by @llsj14 in #11972
  • [Misc] Fix Deepseek V2 fp8 kv-scale remapping by @Concurrensee in #11947
  • [Misc]Minor Changes about Worker by @noemotiovon in #11555
  • [platform] add ray_device_key by @youkaichao in #11948
  • Fix Max Token ID for Qwen-VL-Chat by @alex-jw-brooks in #11980
  • [Kernel] Attention.forward with unified_attention when use_direct_call=True by @heheda12345 in #11967
  • [Doc][V1] Update model implementation guide for V1 support by @ywang96 in #11998
  • [Doc] Organise installation documentation into categories and tabs by @hmellor in #11935
  • [platform] add device_control env var by @youkaichao in #12009
  • [Platform] Move get_punica_wrapper() function to Platform by @shen-shanshan in #11516
  • bugfix: Fix signature mismatch in benchmark's get_tokenizer function by @e1ijah1 in #11982
  • [Doc] Fix build from source and installation link in README.md by @Yikun in #12013
  • [Bugfix] Fix deepseekv3 gate bias error by @SunflowerAries in #12002
  • [Docs] Add Sky Computing Lab to project intro by @WoosukKwon in #12019
  • [Hardware][Gaudi][Bugfix] Fix set_forward_context arguments and CI test execution by @kzawora-intel in #12014
  • [Doc] Update Quantization Hardware Support Documentation by @tjtanaa in #12025
  • [HPU][misc] add comments for explanation by @youkaichao in #12034
  • [Bugfix] Fix various bugs in multi-modal processor by @DarkLight1337 in #12031
  • [Kernel] Revert the API change of Attention.forward by @heheda12345 in #12038
  • [Platform] Add output for Attention Backend by @wangxiyuan in #11981
  • [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention by @heheda12345 in #12040
  • Explain where the engine args go when using Docker by @hmellor in #12041
  • [Doc]: Update the Json Example of the Engine Arguments document by @maang-h in #12045
  • [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping by @jeejeelee in #11924
  • [Kernel] Support MulAndSilu by @jeejeelee in #11624
  • [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py by @kzawora-intel in #12046
  • [Platform] Refactor current_memory_usage() function in DeviceMemoryProfiler to Platform by @shen-shanshan in #11369
  • [V1][BugFix] Fix edge case in VLM scheduling by @WoosukKwon in #12065
  • [Misc] Add multipstep chunked-prefill support for FlashInfer by @elfiegg in #10467
  • [core] Turn off GPU communication overlap for Ray executor by @ruisearch42 in #12051
  • [core] platform agnostic executor via collective_rpc by @youkaichao in #11256
  • [Doc] Update examples to remove SparseAutoModelForCausalLM by @kylesayrs in #12062
  • [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager by @heheda12345 in #12003
  • Fix: cases with empty sparsity config by @rahul-tuli in #12057
  • Type-fix: make execute_model output type optional by @youngkent in #12020
  • [Platform] Do not raise error if _Backend is not found by @wangxiyuan in #12023
  • [Model]: Support internlm3 by @RunningLeon in #12037
  • Misc: allow to use proxy in HTTPConnection by @zhouyuan in #12042
  • [Misc][Quark] Upstream Quark format to VLLM by @kewang-xlnx in #10765
  • [Doc]: Update OpenAI-Compatible Server documents by @maang-h in #12082
  • [Bugfix] use right truncation for non-generative tasks by @joerunde in #12050
  • [V1][Core] Autotune encoder cache budget by @ywang96 in #11895
  • [Bugfix] Fix _get_lora_device for HQQ marlin by @varun-sundar-rabindranath in #12090
  • Allow hip sources to be directly included when compiling for rocm. by @tvirolai-amd in #12087
  • [Core] Default to using per_token quantization for fp8 when cutlass is supported. by @elfiegg in #8651
  • [Doc] Add documentation for specifying model architecture by @DarkLight1337 in #12105
  • Various cosmetic/comment fixes by @mgoin in #12089
  • [Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 by @Isotr0py in #12067
  • Support torchrun and SPMD-style offline inference by @youkaichao in #12071
  • [core] LLM.collective_rpc interface and RLHF example by @youkaichao in #12084
  • [Bugfix] Fix max image feature size for Llava-one-vision by @ywang96 in #12104
  • [misc] Add LoRA kernel micro benchmarks by @varun-sundar-rabindranath in #11579
  • [Model] Add support for deepseek-vl2-tiny model by @Isotr0py in #12068
  • [Bugfix] Set enforce_eager automatically for mllama by @heheda12345 in #12127
  • [Bugfix] Fix a path bug in disaggregated prefill example script. by @KuntaiDu in #12121
  • [CI]add genai-perf benchmark in nightly benchmark by @jikunshang in #10704
  • [Doc] Add instructions on using Podman when SELinux is active by @terrytangyuan in #12136
  • [Bugfix] Revert PR #11435: Fix issues in CPU build Dockerfile. Fixes #9182 by @terrytangyuan in #12135
  • [BugFix] add more is not None check in VllmConfig.post_init by @heheda12345 in #12138
  • [Misc] Add deepseek_vl2 chat template by @Isotr0py in #12143
  • [ROCm][MoE] moe tuning support for rocm by @divakar-amd in #12049
  • [V1] Move more control of kv cache initialization from model_executor to EngineCore by @heheda12345 in #11960
  • [Misc][LoRA] Improve the readability of LoRA error messages during loading by @jeejeelee in #12102
  • [CI/Build][CPU][Bugfix] Fix CPU CI by @bigPYJ1151 in #12150
  • [core] allow callable in collective_rpc by @youkaichao in #12151
  • [Bugfix] Fix score api for missing max_model_len validation by @wallashss in #12119
  • [Bugfix] Mistral tokenizer encode accept list of str by @jikunshang in #12149
  • [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant by @gshtras in #12134
  • [torch.compile] disable logging when cache is disabled by @youkaichao in #12043
  • [misc] fix cross-node TP by @youkaichao in #12166
  • [AMD][CI/Build][Bugfix] updated pytorch stale wheel path by using stable wheel by @hongxiayang in #12172
  • [core] further polish memory profiling by @youkaichao in #12126
  • [Docs] Fix broken link in SECURITY.md by @russellb in #12175
  • [Model] Port deepseek-vl2 processor and remove deepseek_vl2 dependency by @Isotr0py in #12169
  • [core] clean up executor class hierarchy between v1 and v0 by @youkaichao in #12171
  • [Misc] Support register quantization method out-of-tree by @ice-tong in #11969
  • [V1] Collect env var for usage stats by @simon-mo in #12115
  • [BUGFIX] Move scores to float32 in case of running xgrammar on cpu by @madamczykhabana in #12152
  • [Bugfix] Fix multi-modal processors for transformers 4.48 by @DarkLight1337 in #12187
  • [torch.compile] store inductor compiled Python file by @youkaichao in #12182
  • benchmark_serving support --served-model-name param by @gujingit in #12109
  • [Misc] Add BNB support to GLM4-V model by @Isotr0py in #12184
  • [V1] Add V1 support of Qwen2-VL by @ywang96 in #12128
  • [Model] Support for fairseq2 Llama by @MartinGleize in #11442
  • [Bugfix] Fix num_heads value for simple connector when tp enabled by @ShangmingCai in #12074
  • [torch.compile] fix sym_tensor_indices by @youkaichao in #12191
  • Move linting to pre-commit by @hmellor in #11975
  • [DOC] Fix typo in SingleStepOutputProcessor docstring and assert message by @terrytangyuan in #12194
  • [DOC] Add missing docstring for additional args in LLMEngine.add_request() by @terrytangyuan in #12195
  • [Bugfix] Fix incorrect types in LayerwiseProfileResults by @terrytangyuan in #12196
  • [Model] Add Qwen2 PRM model support by @Isotr0py in #12202
  • [Core] Interface for accessing model from VllmRunner by @DarkLight1337 in #10353
  • [misc] add placeholder format.sh by @youkaichao in #12206
  • [CI/Build] Remove dummy CI steps by @DarkLight1337 in #12208
  • [CI/Build] Make pre-commit faster by @DarkLight1337 in #12212
  • [Model] Upgrade Aria to transformers 4.48 by @DarkLight1337 in #12203
  • [misc] print a message to suggest how to bypass commit hooks by @youkaichao in #12217
  • [core][bugfix] configure env var during import vllm by @youkaichao in #12209
  • [V1] Remove _get_cache_block_size by @heheda12345 in #12214
  • [Misc] Pass attention to impl backend by @wangxiyuan in #12218
  • [Bugfix] Fix HfExampleModels.find_hf_info by @DarkLight1337 in #12223
  • [CI] Pass local python version explicitly to pre-commit mypy.sh by @heheda12345 in #12224
  • [Misc] Update CODEOWNERS by @ywang96 in #12229
  • fix: update platform detection for M-series arm based MacBook processors by @isikhi in #12227
  • [misc] add cuda runtime version to usage data by @youkaichao in #12190
  • [bugfix] catch xgrammar unsupported array constraints by @Jason-CKY in #12210
  • [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) by @jinzhen-lin in #12222
  • Add quantization and guided decoding CODEOWNERS by @mgoin in #12228
  • [AMD][Build] Porting dockerfiles from the ROCm/vllm fork by @gshtras in #11777
  • [BugFix] Fix GGUF tp>1 models when vocab_size is not divisible by 64 by @NickLucche in #12230
  • [ci/build] disable failed and flaky tests by @youkaichao in #12240
  • [Misc] Rename MultiModalInputsV2 -> MultiModalInputs by @DarkLight1337 in #12244
  • [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration by @jeejeelee in #12237
  • [Misc] Remove redundant TypeVar from base model by @DarkLight1337 in #12248
  • [Bugfix] Fix mm_limits access for merged multi-modal processor by @DarkLight1337 in #12252
  • [torch.compile] transparent compilation with more logging by @youkaichao in #12246
  • [V1][Bugfix] Fix data item ordering in mixed-modality inference by @ywang96 in #12259
  • [Bugfix] Remove comments re: pytorch for outlines + compressed-tensors dependencies by @tdoublep in #12260
  • [Platform] improve platforms getattr by @MengqingCao in #12264
  • [ci/build] add nightly torch for test by @youkaichao in #12270
  • [Bugfix] fix race condition that leads to wrong order of token returned by @joennlae in #10802
  • [Kernel] fix moe_align_block_size error condition by @jinzhen-lin in #12239
  • [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types by @rickyyx in #10907
  • [Bugfix] Multi-sequence broken by @andylolu2 in #11898
  • [Misc] Remove experimental dep from tracing.py by @codefromthecrypt in #12007
  • [Misc] Set default backend to SDPA for get_vit_attn_backend by @wangxiyuan in #12235
  • [Core] Free CPU pinned memory on environment cleanup by @janimo in #10477
  • [bugfix] moe tuning. rm is_navi() by @divakar-amd in #12273
  • [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes by @maleksan85 in #12277
  • [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose by @hongxiayang in #12281
  • [VLM] Simplify post-processing of replacement info by @DarkLight1337 in #12269
  • [ci/lint] Add back default arg for pre-commit by @khluu in #12279
  • [CI] add docker volume prune to neuron CI by @liangfu in #12291
  • [Ci/Build] Fix mypy errors on main by @DarkLight1337 in #12296
  • [Benchmark] More accurate TPOT calc in benchmark_serving.py by @njhill in #12288
  • [core] separate builder init and builder prepare for each batch by @youkaichao in #12253
  • [Build] update requirements of no-device by @MengqingCao in #12299
  • [Core] Support fully transparent sleep mode by @youkaichao in #11743
  • [VLM] Avoid unnecessary tokenization by @DarkLight1337 in #12310
  • [Model][Bugfix]: correct Aria model output by @xffxff in #12309
  • [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 by @ywang96 in #12313
  • [Doc] Add docs for prompt replacement by @DarkLight1337 in #12318
  • [Misc] Fix the error in the tip for the --lora-modules parameter by @WangErXiao in #12319
  • [Misc] Improve the readability of BNB error messages by @jeejeelee in #12320
  • [Hardware][Gaudi][Bugfix] Fix HPU tensor parallelism, enable multiprocessing executor by @kzawora-intel in #12167
  • [Core] Support reset_prefix_cache by @comaniac in #12284
  • [Frontend][V1] Online serving performance improvements by @njhill in #12287
  • [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD by @rasmith in #12282
  • [Bugfix] Fixing AMD LoRA CI test. by @Alexei-V-Ivanov-AMD in #12329
  • [Docs] Update FP8 KV Cache documentation by @mgoin in #12238
  • [Docs] Document vulnerability disclosure process by @russellb in #12326
  • [V1] Add uncache_blocks by @comaniac in #12333
  • [doc] explain common errors around torch.compile by @youkaichao in #12340
  • [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update by @zhenwei-intel in #12338
  • [Bugfix] Fix k_proj's bias for whisper self attention by @Isotr0py in #12342
  • [Kernel] Flash Attention 3 Support by @LucasWilkinson in #12093
  • [Doc] Troubleshooting errors during model inspection by @DarkLight1337 in #12351
  • [V1] Simplify M-RoPE by @ywang96 in #12352
  • [Bugfix] Fix broken internvl2 inference with v1 by @Isotr0py in #12360
  • [core] add wake_up doc and some sanity check by @youkaichao in #12361
  • [torch.compile] decouple compile sizes and cudagraph sizes by @youkaichao in #12243
  • [FP8][Kernel] Dynamic kv cache scaling factors computation by @gshtras in #11906
  • [TPU] Update TPU CI to use torchxla nightly on 20250122 by @lsy323 in #12334
  • [Docs] Document Phi-4 support by @Isotr0py in #12362
  • [BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order by @dsikka in #11528
  • [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script by @jsato8094 in #12357
  • [Docs] Add meetup slides by @WoosukKwon in #12345
  • [Docs] Update spec decode + structured output in compat matrix by @russellb in #12373
  • [V1][Frontend] Coalesce bunched RequestOutputs by @njhill in #12298
  • Set weights_only=True when using torch.load() by @russellb in #12366
  • [Bugfix] Path join when building local path for S3 clone by @omer-dayan in #12353
  • Update compressed-tensors version by @dsikka in #12367
  • [V1] Increase default batch size for H100/H200 by @WoosukKwon in #12369
  • [perf] fix perf regression from #12253 by @youkaichao in #12380
  • [Misc] Use VisionArena Dataset for VLM Benchmarking by @ywang96 in #12389
  • [ci/build] fix wheel size check by @youkaichao in #12396
  • [Hardware][Gaudi][Doc] Add missing step in setup instructions by @MohitIntel in #12382
  • [ci/build] sync default value for wheel size by @youkaichao in #12398
  • [Misc] Enable proxy support in benchmark script by @jsato8094 in #12356
  • [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build by @LucasWilkinson in #12375
  • [Misc] Remove deprecated code by @DarkLight1337 in #12383
  • [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). by @LucasWilkinson in #12405
  • [Bugfix][Kernel] Fix moe align block issue for mixtral by @ElizaWszola in #12413
  • [Bugfix] Fix BLIP-2 processing by @DarkLight1337 in #12412
  • [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 by @divakar-amd in #12408
  • [Misc] Add FA2 support to ViT MHA layer by @Isotr0py in #12355
  • [TPU][CI] Update torchxla version in requirement-tpu.txt by @lsy323 in #12422
  • [Misc][Bugfix] FA3 support to ViT MHA layer by @ywang96 in #12435
  • [V1][Perf] Reduce scheduling overhead in model runner after cuda sync by @youngkent in #12094
  • [V1][Bugfix] Fix assertion when mm hashing is turned off by @ywang96 in #12439
  • [Misc] Revert FA on ViT #12355 and #12435 by @ywang96 in #12445
  • [Frontend] Set server's maximum number of generated tokens using generation_config.json by @mhendrey in #12242
  • [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 by @tlrmchlsmth in #12417
  • [Bugfix/CI] Fix broken kernels/test_mha.py by @tlrmchlsmth in #12450
  • [Bugfix][Kernel] Fix perf regression caused by PR #12405 by @LucasWilkinson in #12434
  • [Build/CI] Fix libcuda.so linkage by @tlrmchlsmth in #12424
  • [Frontend] Rerank API (Jina- and Cohere-compatible API) by @K-Mistele in #12376
  • [DOC] Add link to vLLM blog by @terrytangyuan in #12460
  • [V1] Avoid list creation in input preparation by @WoosukKwon in #12457
  • [Frontend] Support scores endpoint in run_batch by @pooyadavoodi in #12430
  • [Bugfix] Fix Granite 3.0 MoE model loading by @DarkLight1337 in #12446

New Contributors

  • @Chen-0210 made their first contribution in #11549
  • @ErezSC42 made their first contribution in #11209
  • @selalipop made their first contribution in #11561
  • @rajveerb made their first contribution in #11376
  • @hj-wei made their first contribution in #11515
  • @ayylemao made their first contribution in #11439
  • @JohnGiorgi made their first contribution in #6909
  • @sakunkun made their first contribution in #11565
  • @ApostaC made their first contribution in #11533
  • @houseroad made their first contribution in #11667
  • @serihiro made their first contribution in #11666
  • @CloseChoice made their first contribution in #11680
  • @chunyang-wen made their first contribution in #11689
  • @kathyyu-google made their first contribution in #10013
  • @bjmsong made their first contribution in #11688
  • @nathan-az made their first contribution in #11576
  • @SachinVarghese made their first contribution in #11694
  • @zinccat made their first contribution in #11708
  • @WangErXiao made their first contribution in #11656
  • @Bryce1010 made their first contribution in #11701
  • @bet0x made their first contribution in #11718
  • @yanburman made their first contribution in #11233
  • @RuixiangMa made their first contribution in #11751
  • @surajssd made their first contribution in #11679
  • @ys950902 made their first contribution in #11648
  • @XiaobingSuper made their first contribution in #11768
  • @jiangjiadi made their first contribution in #11795
  • @guspan-tanadi made their first contribution in #11878
  • @yeqcharlotte made their first contribution in #11875
  • @charlesfrye made their first contribution in #11907
  • @remimin made their first contribution in #11920
  • @frreiss made their first contribution in #11941
  • @shaochangxu made their first contribution in #11921
  • @Akshat-Tripathi made their first contribution in #11100
  • @liaoyanqing666 made their first contribution in #11979
  • @Concurrensee made their first contribution in #11947
  • @shen-shanshan made their first contribution in #11516
  • @e1ijah1 made their first contribution in #11982
  • @Yikun made their first contribution in #12013
  • @SunflowerAries made their first contribution in #12002
  • @maang-h made their first contribution in #12045
  • @rahul-tuli made their first contribution in #12057
  • @youngkent made their first contribution in #12020
  • @RunningLeon made their first contribution in #12037
  • @kewang-xlnx made their first contribution in #10765
  • @tvirolai-amd made their first contribution in #12087
  • @ice-tong made their first contribution in #11969
  • @madamczykhabana made their first contribution in #12152
  • @gujingit made their first contribution in #12109
  • @MartinGleize made their first contribution in #11442
  • @isikhi made their first contribution in #12227
  • @Jason-CKY made their first contribution in #12210
  • @andylolu2 made their first contribution in #11898
  • @codefromthecrypt made their first contribution in #12007
  • @zhenwei-intel made their first contribution in #12338
  • @MohitIntel made their first contribution in #12382
  • @mhendrey made their first contribution in #12242

Full Changelog: v0.6.6...v0.7.0

Don't miss a new vllm release

NewReleases is sending notifications on new releases.