Highlights
- Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242).
- Major improvements in
torch.compile
integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance optimizations (#10558, #10613, #10121, #10383, #10399, #10406, #10437, #10460, #10552, #10622, #10722, #10620, #10906, #11108, #11059, #11005, #10838, #11081, #11110). - Expanded model support, including Aria, Cross Encoders, GLM-4, OLMo November 2024, Telechat2, LoRA improvements and multimodal Granite models (#10514, #10400, #10561, #10503, #10311, #10291, #9057, #10418, #5064).
- Use xgrammar as the default guided decoding backend (#10785)
- Improved hardware enablement for AMD ROCm, ARM AARCH64, TPU prefix caching, XPU AWQ/GPTQ, and various CPU/Gaudi/HPU/NVIDIA enhancements (#10254, #9228, #10307, #10107, #10667, #10565, #10239, #11016, #9735, #10355, #10700).
- Note: Changed default temperature for ChatCompletionRequest from 0.7 to 1.0 to align with OpenAI (#11219)
Model Support
- Added Aria (#10514), Cross Encoder (#10400), GLM-4 (#10561), OLMo (#10503), Telechat2 (#10311), Cohere R7B (#11203), GritLM embeddings (#10816)
- LoRA support for Internlm2, glm-4v, Pixtral-HF (#5064, #10418, #10795).
- Improved quantization (BNB, bitsandbytes) for multiple models (#10795, #10842, #10682, #10549)
- Expanded multimodal support (#10291, #11142).
Hardware Support
- AMD ROCm GGUF quantization (#10254), ARM AARCH64 enablement (#9228), TPU prefix caching (#10307), XPU AWQ/GPTQ (#10107), CPU/Gaudi/HPU enhancements (#10355, #10667, #10565, #10239, #11016, #9735, #10541, #10394, #10700).
Performance & Scheduling
- Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).
Benchmark & Frontend
- Benchmark structured outputs and vision datasets (#10804, #10557, #10880, #10547).
- Frontend: Automatic chat format detection (#9919), input_audio support (#11027), CLI --version (#10369), extra fields in requests (#10463).
Documentation & Plugins
- Architecture overview (#10368), Helm chart (#9199), KubeAI integration (#10837), plugin system docs (#10372), disaggregated prefilling (#11197), structured outputs (#9943), usage section (#10827).
Bugfixes & Misc
What's Changed
- Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
- [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
- [Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
- [Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
- [Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
- [core][misc] keep compatibility for old-style classes by @youkaichao in #10356
- [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
- [Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
- [Misc] bump mistral common version by @simon-mo in #10367
- [Docs] Add Nebius as sponsors by @simon-mo in #10371
- [Frontend] Add --version flag to CLI by @russellb in #10369
- [Doc] Move PR template content to docs by @russellb in #10159
- [Docs] Misc updates to TPU installation instructions by @mikegre-google in #10165
- [Frontend] Automatic detection of chat content format from AST by @DarkLight1337 in #9919
- [doc] add doc for the plugin system by @youkaichao in #10372
- [misc][plugin] improve log messages by @youkaichao in #10386
- [BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel by @rasmith in #10385
- [Misc] Update benchmark to support image_url file or http by @kakao-steve-ai in #10287
- [Misc] Medusa supports custom bias by @skylee-01 in #10361
- [Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled by @imkero in #10388
- [V1] Add code owners for V1 by @WoosukKwon in #10397
- [2/N][torch.compile] make compilation cfg part of vllm cfg by @youkaichao in #10383
- [V1] Refactor model executable interface for all text-only language models by @ywang96 in #10374
- [CI/Build] Fix IDC hpu [Device not found] issue by @xuechendi in #10384
- [Bugfix][Hardware][CPU] Fix CPU embedding runner with tensor parallel by @Isotr0py in #10394
- [platforms] refactor cpu code by @youkaichao in #10402
- [Hardware] [HPU]add
mark_step
for hpu by @jikunshang in #10239 - [Bugfix] Fix mrope_position_delta in non-last prefill chunk by @imkero in #10403
- [Misc] Enhance offline_inference to support user-configurable paramet… by @wchen61 in #10392
- [Misc] Add uninitialized params tracking for
AutoWeightsLoader
by @Isotr0py in #10327 - [Bugfix] Ignore ray reinit error when current platform is ROCm or XPU by @HollowMan6 in #10375
- [4/N][torch.compile] clean up set_torch_compile_backend by @youkaichao in #10401
- [VLM] Report multi_modal_placeholders in output by @lk-chen in #10407
- [Model] Remove redundant softmax when using PoolingType.STEP by @Maybewuss in #10415
- [Model][LoRA]LoRA support added for glm-4v by @B-201 in #10418
- [Model] Remove transformers attention porting in VITs by @Isotr0py in #10414
- [Doc] Update doc for LoRA support in GLM-4V by @B-201 in #10425
- [5/N][torch.compile] torch.jit.script --> torch.compile by @youkaichao in #10406
- [Doc] Add documentation for Structured Outputs by @ismael-dm in #9943
- Fix open_collective value in FUNDING.yml by @andrew in #10426
- [Model][Bugfix] Support TP for PixtralHF ViT by @mgoin in #10405
- [Hardware][XPU] AWQ/GPTQ support for xpu backend by @yma11 in #10107
- [Kernel] Explicitly specify other value in tl.load calls by @angusYuhao in #9014
- [Kernel] Initial Machete W4A8 support + Refactors by @LucasWilkinson in #9855
- [3/N][torch.compile] consolidate custom op logging by @youkaichao in #10399
- [ci][bugfix] fix kernel tests by @youkaichao in #10431
- [misc] Allow partial prefix benchmarking & random input generation for prefix benchmarking by @rickyyx in #9929
- [ci/build] Have dependabot ignore all patch update by @khluu in #10436
- [Bugfix]Fix Phi-3 BNB online quantization by @jeejeelee in #10417
- [Platform][Refactor] Extract func
get_default_attn_backend
toPlatform
by @MengqingCao in #10358 - Add openai.beta.chat.completions.parse example to structured_outputs.rst by @mgoin in #10433
- [Bugfix] Guard for negative counter metrics to prevent crash by @tjohnson31415 in #10430
- [Misc] Avoid misleading warning messages by @jeejeelee in #10438
- [Doc] Add the start of an arch overview page by @russellb in #10368
- [misc][plugin] improve plugin loading by @youkaichao in #10443
- [CI][CPU] adding numa node number as container name suffix by @zhouyuan in #10441
- [BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) by @xiyuan-lee in #10398
- [Pixtral-Large] Pixtral actually has no bias in vision-lang adapter by @patrickvonplaten in #10449
- Fix: Build error seen on Power Architecture by @mikejuliet13 in #10421
- [Doc] fix link for page that was renamed by @russellb in #10455
- [6/N] torch.compile rollout to users by @youkaichao in #10437
- [Core] Avoid metrics log noise when idle by @russellb in #8868
- [Model][Quantization] HQQ support through Marlin kernel expansion by @ElizaWszola in #9766
- Change granite chat template to keep json list formatting for tool calls by @maxdebayser in #10452
- [CI/Build] Update Dockerfile.rocm by @Alexei-V-Ivanov-AMD in #10434
- [Bugfix] Marlin 2:4 temp fix for large M dim (>256) by @LucasWilkinson in #10464
- [Misc] Add setitem for LazyDict by @liuyanyi in #10469
- [Bugfix] Fix Mamba model initialization and MLP Speculator weights loading by @Isotr0py in #10456
- [Bugfix] Enforce no chunked prefill for embedding models by @DarkLight1337 in #10470
- [CI/Build] Add sphinx/rst linter for docs by @rafvasq in #10366
- [CI/Build] Support compilation with local cutlass path (#10423) by @wchen61 in #10424
- [ci/build] Combine nightly and optional by @khluu in #10465
- [model] Reduce medusa weight by @skylee-01 in #10454
- [Bugfix] Handle conflicts between modern and legacy fields by @DarkLight1337 in #10471
- [Platforms] Refactor xpu code by @MengqingCao in #10468
- [Hardware][CPU] Support chunked-prefill and prefix-caching on CPU by @bigPYJ1151 in #10355
- [platforms] restore xpu check for parallel config by @youkaichao in #10479
- [perf bench] H200 development by @simon-mo in #9768
- [7/N] torch.compile, reduce compilation time by @youkaichao in #10460
- [Bugfix]: allow extra fields in requests to openai compatible server by @gcalmettes in #10463
- [TPU] Implement prefix caching for TPUs by @WoosukKwon in #10307
- [torch.compile] limit inductor threads and lazy import quant by @youkaichao in #10482
- [Core] Add Sliding Window Support with Flashinfer by @pavanimajety in #10462
- [Platforms] Add
device_type
inPlatform
by @MengqingCao in #10508 - [torch.compile] PostGradPassManager, Inductor code caching fix, fix_functionalization pass refactor + tests by @ProExpertProg in #10273
- [Misc] Increase default video fetch timeout by @DarkLight1337 in #10495
- [platforms] improve error message for unspecified platforms by @youkaichao in #10520
- [Doc] fix a small typo in docstring of llama_tool_parser by @FerdinandZhong in #10513
- [Model] Add Support for Multimodal Granite Models by @alex-jw-brooks in #10291
- fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len by @sywangyi in #10524
- [Model] Expose
dynamic_image_size
as mm_processor_kwargs for InternVL2 models by @Isotr0py in #10518 - [Bugfix] Embedding model pooling_type equals ALL and multi input's bug by @BBuf in #10494
- [Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored by @chaunceyjiang in #10180
- [Kernel] Register punica ops directly by @jeejeelee in #10522
- [Misc] Suppress duplicated logging regarding multimodal input pipeline by @ywang96 in #10530
- [Bugfix] Allow token ID-only inputs in Qwen2-Audio by @DarkLight1337 in #10536
- [8/N] enable cli flag without a space by @youkaichao in #10529
- [V1] Fix Compilation config & Enable CUDA graph by default by @WoosukKwon in #10528
- [CI][Installation] Avoid uploading CUDA 11.8 wheel by @cermeng in #10535
- [misc] improve error message by @youkaichao in #10553
- [Minor] Revert change in offline inference example by @WoosukKwon in #10545
- Add small example to metrics.rst by @mgoin in #10550
- [Benchmark] Add new H100 machine by @simon-mo in #10547
- [9/N] torch.compile LLM usage by @youkaichao in #10552
- [Minor] Fix line-too-long by @WoosukKwon in #10563
- [platforms] absorb worker cls difference into platforms folder by @youkaichao in #10555
- [Bugfix] Fix Phi-3 BNB quantization with tensor parallel by @Isotr0py in #9948
- Remove token-adding chat embedding params by @noamgat in #10551
- [bugfix] fix full graph tests by @youkaichao in #10581
- [torch.compile] support all attention backends by @youkaichao in #10558
- [v1] Refactor KVCacheManager for more hash input than token ids by @rickyyx in #10507
- support bitsandbytes quantization with qwen model by @zixuanzhang226 in #10549
- [Core] remove temporary local variables in LLMEngine.init by @russellb in #10577
- [V1] EngineCore supports profiling by @Abatom in #10564
- [bugfix] fix cpu tests by @youkaichao in #10585
- [Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use by @tjohnson31415 in #10164
- [Core] Fix broken log configuration by @russellb in #10458
- [Misc] Add pynccl wrappers for all_gather and reduce_scatter by @tlrmchlsmth in #9432
- [core] gemma2 full context length support by @youkaichao in #10584
- [Bugfix] 500 Internal Server Error when tool_choice is incorrect. by @shenoyvvarun in #10567
- [Model] Fix Baichuan BNB online quantization by @CNTRYROA in #10572
- Update default max_num_batch_tokens for chunked prefill to 2048 by @mgoin in #10544
- [Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm by @kliuae in #10254
- Prefix Cache Aware Scheduling [1/n] by @rickyyx in #10128
- [2/N] Proper handling of placeholders in merged multi-modal processor by @DarkLight1337 in #10485
- [Bugfix][Hardware][CPU] Fix
multi_modal_kwargs
broadcast for CPU tensor parallel by @Isotr0py in #10541 - [Platforms] Refactor openvino code by @statelesshz in #10573
- [CI/Build] For ppc64le, disabled tests for now and addressed space issues by @npanpaliya in #10538
- [Bugfix] Avoid import AttentionMetadata explicitly in Mllama and fix openvino import by @Isotr0py in #10593
- [bugfix] Fix example/tensorize_vllm_model tests by @jeejeelee in #10595
- [Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA by @jeejeelee in #10450
- [CI/Build] Print running script to enhance CI log readability by @jeejeelee in #10594
- Revert "[CI/Build] Print running script to enhance CI log readability" by @youkaichao in #10601
- [model][utils] add extract_layer_index utility function by @youkaichao in #10599
- [doc] update the code to add models by @youkaichao in #10603
- [Doc] Update README.md with Ray Summit talk links by @zhuohan123 in #10610
- Support Cross encoder models by @maxdebayser in #10400
- [Refactor][MISC] del redundant code in ParallelConfig.postinit by @MengqingCao in #10614
- [torch.compile] support encoder based models by @youkaichao in #10613
- [Doc] Add encoder-based models to Supported Models page by @DarkLight1337 in #10616
- [torch.compile] force inductor threads by @jeejeelee in #10620
- [torch.compile] add warning for unsupported models by @youkaichao in #10622
- [misc] add torch.compile compatibility check by @youkaichao in #10618
- [misc] move functions to config.py by @youkaichao in #10624
- [Model] Support
is_causal
HF config field for Qwen2 model by @DarkLight1337 in #10621 - [Doc] Super tiny little typo fix by @fzyzcjy in #10633
- [Bug]: Authorization ignored when root_path is set by @chaunceyjiang in #10606
- [Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices by @wallashss in #9850
- [Docs] Add Snowflake Slides by @simon-mo in #10641
- [Model]: Add support for Aria model by @xffxff in #10514
- [Model] Enable optional prefix when loading embedding models by @DarkLight1337 in #10639
- [Doc] Fix typos in docs by @DarkLight1337 in #10636
- [Model] Add OLMo November 2024 model by @2015aroras in #10503
- [misc] do not read HOST_IP by @youkaichao in #10644
- [bugfix] fix aria model and add torch.compile by @youkaichao in #10645
- [Feature] vLLM ARM Enablement for AARCH64 CPUs by @sanketkaleoss in #9228
- [v1] EngineArgs for better config handling for v1 by @rickyyx in #10382
- custom allreduce + torch.compile by @SageMoore in #10121
- [Misc] Remove outdated init protocols by @DarkLight1337 in #10655
- [ci] add vllm_test_utils by @youkaichao in #10659
- [V1] Enable profile for LLMEngine by @jikunshang in #10665
- [Bugfix] Fix for Spec model TP + Chunked Prefill by @andoorve in #10232
- [Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson by @conroy-cheers in #9735
- [Bugfix] Fix using
-O[0,3]
with LLM entrypoint by @mgoin in #10677 - [Bugfix] Check bnb_4bit_quant_storage for bitsandbytes by @mgoin in #10642
- [V1] Refactor model executable interface for multimodal models by @ywang96 in #10570
- [Kernel] Remove hard-dependencies of Speculative decode to CUDA workers by @xuechendi in #10587
- [V1] Update interface for idefics3 by @ywang96 in #10680
- [Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. by @jeongin601 in #10198
- [bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig by @yansh97 in #10657
- [Hardware][Gaudi]add get_name method for HPUAttentionBackend by @jikunshang in #10667
- [Misc]Further reduce BNB static variable by @jeejeelee in #10597
- [Cleanup][Kernel] Remove if-else with identical branches in marlin 2:4 by @tlrmchlsmth in #10687
- [Model] Support telechat2 by @shunxing12345 in #10311
- [Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault by @bigPYJ1151 in #10700
- [V1] Update interface for mistral-format Pixtral by @ywang96 in #10703
- [ci] fix slow tests by @youkaichao in #10698
- [torch.compile] fix shape specialization by @youkaichao in #10722
- [Bugfix] Fix GGUF inference with FP16 unquantized checkpoint by @Isotr0py in #10675
- [Bugfix][Mamba] Fix Multistep on Mamba-like models by @mzusman in #10705
- [Bugfix] Ignore
lm_head
when loading embedding models by @DarkLight1337 in #10719 - [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server by @tomeras91 in #10635
- [misc] upgrade filelock version by @youkaichao in #10731
- [Model] support bitsandbytes quantization with minicpm3 model by @zixuanzhang226 in #10682
- [Doc] Update model in arch_overview.rst to match comment by @spacewander in #10701
- [Bug][CLI] Allow users to disable prefix caching explicitly by @rickyyx in #10724
- [V1] Do not allocate beyond the max_model_len by @WoosukKwon in #10730
- [Kernel] Update vllm-flash-attn version by @WoosukKwon in #10736
- Update requirements-tpu by @richardsliu in #10726
- [Model] Added GLM-4 series hf format model support vllm==0.6.4 by @sixsixcoder in #10561
- [Kernel] Update vllm-flash-attn version by @WoosukKwon in #10742
- [V1] Optimize the CPU overheads in FlashAttention custom op by @WoosukKwon in #10733
- [Model] Add Internlm2 LoRA support by @Isotr0py in #5064
- [Model] Clean up MiniCPMV by @DarkLight1337 in #10751
- [Misc] typo find in sampling_metadata.py by @noooop in #10740
- [Bugfix] Fix Idefics3 bug by @jeejeelee in #10778
- [platform] Add verify_quantization in platform. by @wangxiyuan in #10757
- [Bugfix] Fix OpenVino/Neuron
driver_worker
init by @NickLucche in #10779 - [Model] Refactor Molmo weights loading to use AutoWeightsLoader by @Isotr0py in #10771
- Interleaving sliding window for Ministral-8B-Instruct-2410 by @patrickvonplaten in #10591
- [doc] format fix by @wangxiyuan in #10789
- [Model] Replace embedding models with pooling adapter by @DarkLight1337 in #10769
- [Misc] Improve type annotations for
support_torch_compile
by @DarkLight1337 in #10763 - [Misc] Rename embedding classes to pooling by @DarkLight1337 in #10801
- [doc] add warning about comparing hf and vllm outputs by @youkaichao in #10805
- [Misc] Adding
MMMU-Pro
vision dataset to serving benchmark by @ywang96 in #10804 - [Core] Implement disagg prefill by StatelessProcessGroup by @KuntaiDu in #10502
- [Model] Add BNB support to Llava and Pixtral-HF by @Isotr0py in #10795
- [core] Avoid metrics log noise when idle - include speculative decodi… by @cduk in #10809
- [Kernel] Use
out
in flash_attn_varlen_func by @WoosukKwon in #10811 - Fill TorchSDPAAttentionMetadata seq_lens_field for prefill by @maxdebayser in #10799
- [misc] remove xverse modeling file by @youkaichao in #10814
- [doc]Update config docstring by @wangxiyuan in #10732
- [Model]: add some tests for aria model by @xffxff in #10770
- [CI/Build] Update
mistral_common
version for tests and docs by @DarkLight1337 in #10825 - [misc] use out argument for flash attention by @youkaichao in #10822
- [Misc][LoRA] Move the implementation of lora bias to punica.py by @jeejeelee in #10829
- [Misc][XPU] Avoid torch compile for XPU platform by @yma11 in #10747
- Fix openvino on GPU by @janimo in #10793
- [Model] Add TP and BNB quantization support to LlavaMultiModalProjector by @Isotr0py in #10834
- [Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts by @mgoin in #10753
- [Model] support bitsandbytes quantization with minicpm model by @zixuanzhang226 in #10842
- [Bugfix] Fix QKVParallelLinearWithShardedLora bias bug by @jeejeelee in #10844
- [core][distributed] add pynccl broadcast by @youkaichao in #10843
- [torch.compile] remove compilation_context and simplify code by @youkaichao in #10838
- [Doc] Add github links for source code references by @russellb in #10672
- [Misc] Remove deprecated names by @DarkLight1337 in #10817
- [Core][Performance] Add XGrammar support for guided decoding and set it as default by @aarnphm in #10785
- [Speculative Decoding] Move indices to device before filtering output by @zhengy001 in #10850
- [V1] VLM - Run the mm_mapper preprocessor in the frontend process by @alexm-neuralmagic in #10640
- [MISC][XPU] quick fix for XPU CI by @yma11 in #10859
- [Bugfix] Only require XGrammar on x86 by @mgoin in #10865
- [Bugfix][Frontend] correctly record prefill and decode time metrics by @tomeras91 in #10853
- [Build][Bugfix] Using the correct type hint by @gshtras in #10866
- [Benchmark] Benchmark structured output with datasets by @xuechendi in #10557
- [CI/Build] Replace mean with torch.all in test_pynccl.py by @tlrmchlsmth in #10876
- Drop ROCm load format check by @wangxiyuan in #10767
- [ci/build] Change queue name for Release jobs by @khluu in #10875
- [ci/build] Job to build and push release image by @khluu in #10877
- [bugfix] fixed parameter “n” not work when set parameter “bestof” > 1 by @o2363286 in #10854
- [ci/build] Update vLLM postmerge ECR repo by @khluu in #10887
- [LoRA] Change lora_tokenizers capacity by @xyang16 in #10796
- [Model] Consolidate ViTs attention implementation without mask by @Isotr0py in #10893
- Benchmark serving structured output by @xuechendi in #10880
- [CI/Build] improve python-only dev setup by @dtrifiro in #9621
- [V1] Fix when max_model_len is not divisible by block_size by @WoosukKwon in #10903
- [benchmark] Make H100 benchmark optional by @khluu in #10908
- [Bugfix] Fallback to outlines for complex json schemas by @mgoin in #10899
- [Doc] Create a new "Usage" section by @DarkLight1337 in #10827
- [Bugfix] Fix BNB loader target_modules by @jeejeelee in #10720
- [Misc] Update llama 3.2 template to support system prompt with images by @tjohnson31415 in #10901
- [Misc][LoRA] Clean up the function interface of Punica by @jeejeelee in #10917
- [CI/Build] Bump test transformers version by @Isotr0py in #10106
- [Misc][Gaudi] Avoid torch.compile and enable lazy collectives by default for HPU lazy backend by @kzawora-intel in #10897
- [ci][build] add tests for python only compilation by @youkaichao in #10915
- [torch.compile] use size tuning for specific sizes by @youkaichao in #10933
- [torch.compile] add logging for compilation time by @youkaichao in #10941
- [CI/Build] Fix broken multimodal test by @DarkLight1337 in #10950
- [torch.compile] fix deprecated code by @youkaichao in #10948
- [Core] Support Lark grammars for XGrammar by @mgoin in #10870
- [Doc] add KubeAI to serving integrations by @samos123 in #10837
- [misc] fix typo by @youkaichao in #10960
- [ci] fix broken tests by @youkaichao in #10956
- [Core] Cleanup startup logging a bit by @russellb in #10961
- [Bugfix] Fix test-pipeline.yaml by @jeejeelee in #10973
- [Model] Implement merged input processor for LLaVA model by @DarkLight1337 in #10676
- [Build] Fix for the Wswitch-bool clang warning by @gshtras in #10060
- [Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation by @Isotr0py in #10958
- [Model] Composite weight loading for multimodal Qwen2 by @DarkLight1337 in #10944
- [Doc] Explicitly state that InternVL 2.5 is supported by @DarkLight1337 in #10978
- [Model] Update multi-modal processor to support Mantis(LLaVA) model by @DarkLight1337 in #10711
- [Doc] Explicitly state that PP isn't compatible with speculative decoding yet by @DarkLight1337 in #10975
- [BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None by @xffxff in #10928
- [core][executor] simplify instance id by @youkaichao in #10976
- [core][misc] remove use_dummy driver for _run_workers by @youkaichao in #10920
- [torch.compile] allow candidate compile sizes by @youkaichao in #10984
- [V1] Initial support of multimodal models for V1 re-arch by @ywang96 in #10699
- [torch.compile][misc] fix comments by @youkaichao in #10993
- [misc] clean up and unify logging by @youkaichao in #10999
- [Doc][V1] Add V1 support column for multimodal models by @ywang96 in #10998
- [torch.compile] add dynamo time tracking by @youkaichao in #11005
- [V1] Fix Detokenizer loading in
AsyncLLM
by @ywang96 in #10997 - [Core] Require xgrammar >= 0.1.6 by @russellb in #11021
- [Platform] Move
async output
check to platform by @wangxiyuan in #10768 - [V1] Input Batch Relocation by @varun-sundar-rabindranath in #10962
- [ci/build] Recompile CI dependencies list with Python 3.12 by @khluu in #11013
- [V1] Further reduce CPU overheads in flash-attn by @WoosukKwon in #10989
- [Misc][LoRA] Abstract PunicaWrapper by @jeejeelee in #10955
- [Model] Implement merged input processor for Phi-3-Vision models by @Isotr0py in #10977
- [Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version by @kzawora-intel in #11028
- [v1] fix use compile sizes by @youkaichao in #11000
- [Neuron] Upgrade neuron to 2.20.2 by @xendo in #11016
- [ROCm][bugfix] Setting the value for the scpecilative decoding worker class on rocm platform by @gshtras in #11035
- Build tpu image in release pipeline by @richardsliu in #10936
- [V1] Do not store
None
in self.generators by @WoosukKwon in #11038 - [Docs] Add dedicated tool calling page to docs by @mgoin in #10554
- [Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba by @Isotr0py in #10739
- [Bugfix] Fix usage of
deprecated
decorator by @DarkLight1337 in #11025 - [Frontend] Use request id from header by @joerunde in #10968
- [Pixtral] Improve loading by @patrickvonplaten in #11040
- [V1] Multiprocessing Tensor Parallel Support for v1 by @tlrmchlsmth in #9856
- monitor metrics of tokens per step using cudagraph batchsizes by @youkaichao in #11031
- [Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. by @sjuxax in #11043
- Update README.md by @dmoliveira in #11034
- [Bugfix] cuda error running llama 3.2 by @GeneDer in #11047
- Add example of helm chart for vllm deployment on k8s by @mfournioux in #9199
- [Bugfix] Handle <|tool_call|> token in granite tool parser by @tjohnson31415 in #11039
- [Misc][LoRA] Add PEFTHelper for LoRA by @jeejeelee in #11003
- [Bugfix] Backport request id validation to v0 by @joerunde in #11036
- [BUG] Remove token param #10921 by @flaviabeo in #11022
- [Core] Update to outlines >= 0.1.8 by @russellb in #10576
- [torch.compile] add a flag to track batchsize statistics by @youkaichao in #11059
- [V1][Bugfix] Always set enable_chunked_prefill = True for V1 by @WoosukKwon in #11061
- [Bugfix] Fix Mamba multistep by @tlrmchlsmth in #11071
- [Misc] LoRA + Chunked Prefill by @aurickq in #9057
- [Model] PP support for Mamba-like models by @mzusman in #10992
- Fix streaming for granite tool call when <|tool_call|> is present by @maxdebayser in #11069
- [CI/Build] Check transformers v4.47 by @DarkLight1337 in #10991
- [ci/build] Fix AMD CI dependencies by @khluu in #11087
- [ci/build] Fix entrypoints test and pin outlines version by @khluu in #11088
- [Core] v1: Use atexit to handle engine core client shutdown by @russellb in #11076
- [Bugfix] Fix Idefics3 fails during multi-image inference by @B-201 in #11080
- [Bugfix]: Clamp
-inf
logprob values in prompt_logprobs by @rafvasq in #11073 - [Misc] Split up pooling tasks by @DarkLight1337 in #10820
- [Doc] Update docs to refer to pooling models by @DarkLight1337 in #11093
- [CI/Build] Enable prefix caching test for AMD by @hissu-hyvarinen in #11098
- [Doc] Installed version of llmcompressor for int8/fp8 quantization by @bingps in #11103
- [torch.compile] use depyf to dump torch.compile internals by @youkaichao in #10972
- [V1] Use input_ids as input for text-only models by @WoosukKwon in #11032
- [torch.compile] remove graph logging in ci by @youkaichao in #11110
- [core] Bump ray to use _overlap_gpu_communication in compiled graph tests by @ruisearch42 in #10410
- [CI/Build] Split up VLM tests by @DarkLight1337 in #11083
- [V1][Core] Remove should_shutdown to simplify core process termination by @tlrmchlsmth in #11113
- [V1] VLM preprocessor hashing by @alexm-neuralmagic in #11020
- [Bugfix] Multiple fixes to tool streaming with hermes and mistral by @cedonley in #10979
- [Docs] Add media kit by @simon-mo in #11121
- Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst by @terrytangyuan in #11112
- [Core] cleanup zmq ipc sockets on exit by @russellb in #11115
- [Model] Add support for embedding model GritLM by @pooyadavoodi in #10816
- [V1] Use more persistent buffers to optimize input preparation overheads by @WoosukKwon in #11111
- [Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in #10565
- [core][distributed] initialization from StatelessProcessGroup by @youkaichao in #10986
- [Misc][LoRA] Ensure Lora Adapter requests return adapter name by @Jeffwan in #11094
- [V1] Fix torch profiling for offline inference by @ywang96 in #11125
- fix(docs): typo in helm install instructions by @ramonziai in #11141
- [Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c. by @sjuxax in #11024
- [Misc] Validate grammar and fail early by @comaniac in #11119
- Fix logging of the vLLM Config by @JArnoldAMD in #11143
- [Bugfix] Fix value unpack error of simple connector for KVCache transfer. by @ShangmingCai in #11058
- [Misc][V1] Fix type in v1 prefix caching by @comaniac in #11151
- [torch.compile] Dynamic fp8 + rms_norm fusion by @ProExpertProg in #10906
- [Bugfix] Use runner_type instead of task in GritLM by @pooyadavoodi in #11144
- [Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization by @dsikka in #11148
- [ROCm][AMD] Disable auto enabling chunked prefill on ROCm by @gshtras in #11146
- [Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' by @comaniac in #11157
- [core] clean up cudagraph batchsize padding logic by @youkaichao in #10996
- PaliGemma 2 support by @janimo in #11142
- [Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt by @bigPYJ1151 in #11159
- [Frontend] Separate pooling APIs in offline inference by @DarkLight1337 in #11129
- [V1][VLM] Fix edge case bug for InternVL2 by @ywang96 in #11165
- [Refactor]A simple device-related refactor by @noemotiovon in #11163
- [Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching by @llsj14 in #8240
- [Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor by @zhangjf-nlp in #11156
- [Misc] Add tokenizer_mode param to benchmark_serving.py by @alexm-neuralmagic in #11174
- [Doc] Reorganize online pooling APIs by @DarkLight1337 in #11172
- [Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend by @janimo in #11169
- [Distributed] Allow the placement group more time to wait for resources to be ready by @Jeffwan in #11138
- [Core] V1: Use multiprocessing by default by @russellb in #11074
- [V1][Bugfix] Fix EngineCoreProc profile by @tlrmchlsmth in #11185
- [Bugfix][V1] Re-compute an entire block when fully cache hit by @comaniac in #11186
- update compressed-tensors to latest version by @dhuangnm in #11183
- [Core] Update outlines and increase its threadpool size by @russellb in #11140
- [V1][Bugfix] Fix V1 TP trust-remote-code by @tlrmchlsmth in #11182
- [Misc] Minor improvements to the readability of PunicaWrapperBase by @jeejeelee in #11200
- [Frontend] Add
logits_processors
as an extra completion argument by @bradhilton in #11150 - [VLM] Fully dynamic prompt replacement in merged input processor by @DarkLight1337 in #11199
- Enable mypy checking on V1 code by @markmc in #11105
- [Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion by @llsj14 in #7209
- [[Misc]Upgrade bitsandbytes to the latest version 0.45.0 by @jeejeelee in https://github.com//pull/11201
- [torch.compile] allow tracking forward time by @youkaichao in #11081
- [Misc] Clean up multi-modal processor by @DarkLight1337 in #11207
- [Bugfix] Fix error handling of unsupported sliding window by @DarkLight1337 in #11213
- [Doc] add documentation for disaggregated prefilling by @KuntaiDu in #11197
- [Core] Support disaggregated prefill with Mooncake Transfer Engine by @ShangmingCai in #10884
- [V1][Minor] Cache np arange to reduce input preparation overhead by @WoosukKwon in #11214
- Update deploying_with_k8s.rst by @AlexHe99 in #10922
- fix block-size description by @chenqianfzh in #10938
- [Bugfix] Fix the default value for temperature in ChatCompletionRequest by @yansh97 in #11219
- [CI/Build] simplify Dockerfile build for ARM64 / GH200 by @cennn in #11212
- [Model] Support Cohere2ForCausalLM (Cohere R7B) by @janimo in #11203
- [Model] Refactor Ultravox to use merged input processor by @Isotr0py in #11198
- [Doc] Reorder vision language examples in alphabet order by @Isotr0py in #11228
- [misc] Layerwise profile updates by @varun-sundar-rabindranath in #10242
- [core] overhaul memory profiling and fix backward compatibility by @youkaichao in #10511
- [Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving by @bk-TurbaAI in #11235
- [ci][tests] add gh200 tests by @youkaichao in #11244
- [torch.compile] fast inductor by @youkaichao in #11108
- fix gh200 tests on main by @youkaichao in #11246
- [CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse by @mgoin in #10935
- [Frontend] Add OpenAI API support for input_audio by @kylehh in #11027
- [V1][VLM] Proper memory profiling for image language models by @ywang96 in #11210
- [Platform] platform agnostic for EngineArgs initialization by @wangxiyuan in #11225
- [V1][Core] Use weakref.finalize instead of atexit by @tlrmchlsmth in #11242
- [Misc] Kernel Benchmark for
RMSNorm
by @ywang96 in #11241 - [Misc] Allow passing logits_soft_cap for xformers backend by @Isotr0py in #11252
- [Bugfix] Fix request cancellation without polling by @joerunde in #11190
New Contributors
- @wchen61 made their first contribution in #10347
- @kakao-steve-ai made their first contribution in #10287
- @Maybewuss made their first contribution in #10415
- @ismael-dm made their first contribution in #9943
- @andrew made their first contribution in #10426
- @angusYuhao made their first contribution in #9014
- @xiyuan-lee made their first contribution in #10398
- @mikejuliet13 made their first contribution in #10421
- @BBuf made their first contribution in #10494
- @zixuanzhang226 made their first contribution in #10549
- @shenoyvvarun made their first contribution in #10567
- @CNTRYROA made their first contribution in #10572
- @npanpaliya made their first contribution in #10538
- @xffxff made their first contribution in #10514
- @2015aroras made their first contribution in #10503
- @sanketkaleoss made their first contribution in #9228
- @conroy-cheers made their first contribution in #9735
- @jeongin601 made their first contribution in #10198
- @shunxing12345 made their first contribution in #10311
- @spacewander made their first contribution in #10701
- @wangxiyuan made their first contribution in #10757
- @cduk made their first contribution in #10809
- @o2363286 made their first contribution in #10854
- @sjuxax made their first contribution in #11043
- @dmoliveira made their first contribution in #11034
- @mfournioux made their first contribution in #9199
- @bingps made their first contribution in #11103
- @cedonley made their first contribution in #10979
- @SanjuCSudhakaran made their first contribution in #10565
- @ramonziai made their first contribution in #11141
- @noemotiovon made their first contribution in #11163
- @zhangjf-nlp made their first contribution in #11156
- @dhuangnm made their first contribution in #11183
- @bradhilton made their first contribution in #11150
- @AlexHe99 made their first contribution in #10922
- @cennn made their first contribution in #11212
- @bk-TurbaAI made their first contribution in #11235
- @kylehh made their first contribution in #11027
Full Changelog: v0.6.4...v0.6.5