vllm-project/vllm v0.6.5 on GitHub

Highlights

Significant progress on the V1 engine refactor and multimodal support: New model executable interfaces for text-only and multimodal models, multiprocessing, improved configuration handling, and profiling enhancements (#10374, #10570, #11074, #11076, #10382, #10665, #10564, #11125, #11185, #11242).
Major improvements in torch.compile integration: Support for all attention backends, encoder-based models, dynamic FP8 fusion, shape specialization fixes, and performance optimizations (#10558, #10613, #10121, #10383, #10399, #10406, #10437, #10460, #10552, #10622, #10722, #10620, #10906, #11108, #11059, #11005, #10838, #11081, #11110).
Expanded model support, including Aria, Cross Encoders, GLM-4, OLMo November 2024, Telechat2, LoRA improvements and multimodal Granite models (#10514, #10400, #10561, #10503, #10311, #10291, #9057, #10418, #5064).
Use xgrammar as the default guided decoding backend (#10785)
Improved hardware enablement for AMD ROCm, ARM AARCH64, TPU prefix caching, XPU AWQ/GPTQ, and various CPU/Gaudi/HPU/NVIDIA enhancements (#10254, #9228, #10307, #10107, #10667, #10565, #10239, #11016, #9735, #10355, #10700).
Note: Changed default temperature for ChatCompletionRequest from 0.7 to 1.0 to align with OpenAI (#11219)

Model Support

Added Aria (#10514), Cross Encoder (#10400), GLM-4 (#10561), OLMo (#10503), Telechat2 (#10311), Cohere R7B (#11203), GritLM embeddings (#10816)
LoRA support for Internlm2, glm-4v, Pixtral-HF (#5064, #10418, #10795).
Improved quantization (BNB, bitsandbytes) for multiple models (#10795, #10842, #10682, #10549)
Expanded multimodal support (#10291, #11142).

Hardware Support

AMD ROCm GGUF quantization (#10254), ARM AARCH64 enablement (#9228), TPU prefix caching (#10307), XPU AWQ/GPTQ (#10107), CPU/Gaudi/HPU enhancements (#10355, #10667, #10565, #10239, #11016, #9735, #10541, #10394, #10700).

Performance & Scheduling

Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).

Benchmark & Frontend

Benchmark structured outputs and vision datasets (#10804, #10557, #10880, #10547).
Frontend: Automatic chat format detection (#9919), input_audio support (#11027), CLI --version (#10369), extra fields in requests (#10463).

Documentation & Plugins

Architecture overview (#10368), Helm chart (#9199), KubeAI integration (#10837), plugin system docs (#10372), disaggregated prefilling (#11197), structured outputs (#9943), usage section (#10827).

Bugfixes & Misc

Updated defaults for chunked prefill (#10544)
Add GH200 support (#11212, #11244)

What's Changed

Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
[Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
[Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
[Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
[Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
[core][misc] keep compatibility for old-style classes by @youkaichao in #10356
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
[Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
[Misc] bump mistral common version by @simon-mo in #10367
[Docs] Add Nebius as sponsors by @simon-mo in #10371
[Frontend] Add --version flag to CLI by @russellb in #10369
[Doc] Move PR template content to docs by @russellb in #10159
[Docs] Misc updates to TPU installation instructions by @mikegre-google in #10165
[Frontend] Automatic detection of chat content format from AST by @DarkLight1337 in #9919
[doc] add doc for the plugin system by @youkaichao in #10372
[misc][plugin] improve log messages by @youkaichao in #10386
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel by @rasmith in #10385
[Misc] Update benchmark to support image_url file or http by @kakao-steve-ai in #10287
[Misc] Medusa supports custom bias by @skylee-01 in #10361
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled by @imkero in #10388
[V1] Add code owners for V1 by @WoosukKwon in #10397
[2/N][torch.compile] make compilation cfg part of vllm cfg by @youkaichao in #10383
[V1] Refactor model executable interface for all text-only language models by @ywang96 in #10374
[CI/Build] Fix IDC hpu [Device not found] issue by @xuechendi in #10384
[Bugfix][Hardware][CPU] Fix CPU embedding runner with tensor parallel by @Isotr0py in #10394
[platforms] refactor cpu code by @youkaichao in #10402
[Hardware] [HPU]add mark_step for hpu by @jikunshang in #10239
[Bugfix] Fix mrope_position_delta in non-last prefill chunk by @imkero in #10403
[Misc] Enhance offline_inference to support user-configurable paramet… by @wchen61 in #10392
[Misc] Add uninitialized params tracking for AutoWeightsLoader by @Isotr0py in #10327
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU by @HollowMan6 in #10375
[4/N][torch.compile] clean up set_torch_compile_backend by @youkaichao in #10401
[VLM] Report multi_modal_placeholders in output by @lk-chen in #10407
[Model] Remove redundant softmax when using PoolingType.STEP by @Maybewuss in #10415
[Model][LoRA]LoRA support added for glm-4v by @B-201 in #10418
[Model] Remove transformers attention porting in VITs by @Isotr0py in #10414
[Doc] Update doc for LoRA support in GLM-4V by @B-201 in #10425
[5/N][torch.compile] torch.jit.script --> torch.compile by @youkaichao in #10406
[Doc] Add documentation for Structured Outputs by @ismael-dm in #9943
Fix open_collective value in FUNDING.yml by @andrew in #10426
[Model][Bugfix] Support TP for PixtralHF ViT by @mgoin in #10405
[Hardware][XPU] AWQ/GPTQ support for xpu backend by @yma11 in #10107
[Kernel] Explicitly specify other value in tl.load calls by @angusYuhao in #9014
[Kernel] Initial Machete W4A8 support + Refactors by @LucasWilkinson in #9855
[3/N][torch.compile] consolidate custom op logging by @youkaichao in #10399
[ci][bugfix] fix kernel tests by @youkaichao in #10431
[misc] Allow partial prefix benchmarking & random input generation for prefix benchmarking by @rickyyx in #9929
[ci/build] Have dependabot ignore all patch update by @khluu in #10436
[Bugfix]Fix Phi-3 BNB online quantization by @jeejeelee in #10417
[Platform][Refactor] Extract func get_default_attn_backend to Platform by @MengqingCao in #10358
Add openai.beta.chat.completions.parse example to structured_outputs.rst by @mgoin in #10433
[Bugfix] Guard for negative counter metrics to prevent crash by @tjohnson31415 in #10430
[Misc] Avoid misleading warning messages by @jeejeelee in #10438
[Doc] Add the start of an arch overview page by @russellb in #10368
[misc][plugin] improve plugin loading by @youkaichao in #10443
[CI][CPU] adding numa node number as container name suffix by @zhouyuan in #10441
[BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) by @xiyuan-lee in #10398
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter by @patrickvonplaten in #10449
Fix: Build error seen on Power Architecture by @mikejuliet13 in #10421
[Doc] fix link for page that was renamed by @russellb in #10455
[6/N] torch.compile rollout to users by @youkaichao in #10437
[Core] Avoid metrics log noise when idle by @russellb in #8868
[Model][Quantization] HQQ support through Marlin kernel expansion by @ElizaWszola in #9766
Change granite chat template to keep json list formatting for tool calls by @maxdebayser in #10452
[CI/Build] Update Dockerfile.rocm by @Alexei-V-Ivanov-AMD in #10434
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) by @LucasWilkinson in #10464
[Misc] Add setitem for LazyDict by @liuyanyi in #10469
[Bugfix] Fix Mamba model initialization and MLP Speculator weights loading by @Isotr0py in #10456
[Bugfix] Enforce no chunked prefill for embedding models by @DarkLight1337 in #10470
[CI/Build] Add sphinx/rst linter for docs by @rafvasq in #10366
[CI/Build] Support compilation with local cutlass path (#10423) by @wchen61 in #10424
[ci/build] Combine nightly and optional by @khluu in #10465
[model] Reduce medusa weight by @skylee-01 in #10454
[Bugfix] Handle conflicts between modern and legacy fields by @DarkLight1337 in #10471
[Platforms] Refactor xpu code by @MengqingCao in #10468
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU by @bigPYJ1151 in #10355
[platforms] restore xpu check for parallel config by @youkaichao in #10479
[perf bench] H200 development by @simon-mo in #9768
[7/N] torch.compile, reduce compilation time by @youkaichao in #10460
[Bugfix]: allow extra fields in requests to openai compatible server by @gcalmettes in #10463
[TPU] Implement prefix caching for TPUs by @WoosukKwon in #10307
[torch.compile] limit inductor threads and lazy import quant by @youkaichao in #10482
[Core] Add Sliding Window Support with Flashinfer by @pavanimajety in #10462
[Platforms] Add device_type in Platform by @MengqingCao in #10508
[torch.compile] PostGradPassManager, Inductor code caching fix, fix_functionalization pass refactor + tests by @ProExpertProg in #10273
[Misc] Increase default video fetch timeout by @DarkLight1337 in #10495
[platforms] improve error message for unspecified platforms by @youkaichao in #10520
[Doc] fix a small typo in docstring of llama_tool_parser by @FerdinandZhong in #10513
[Model] Add Support for Multimodal Granite Models by @alex-jw-brooks in #10291
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len by @sywangyi in #10524
[Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models by @Isotr0py in #10518
[Bugfix] Embedding model pooling_type equals ALL and multi input's bug by @BBuf in #10494
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored by @chaunceyjiang in #10180
[Kernel] Register punica ops directly by @jeejeelee in #10522
[Misc] Suppress duplicated logging regarding multimodal input pipeline by @ywang96 in #10530
[Bugfix] Allow token ID-only inputs in Qwen2-Audio by @DarkLight1337 in #10536
[8/N] enable cli flag without a space by @youkaichao in #10529
[V1] Fix Compilation config & Enable CUDA graph by default by @WoosukKwon in #10528
[CI][Installation] Avoid uploading CUDA 11.8 wheel by @cermeng in #10535
[misc] improve error message by @youkaichao in #10553
[Minor] Revert change in offline inference example by @WoosukKwon in #10545
Add small example to metrics.rst by @mgoin in #10550
[Benchmark] Add new H100 machine by @simon-mo in #10547
[9/N] torch.compile LLM usage by @youkaichao in #10552
[Minor] Fix line-too-long by @WoosukKwon in #10563
[platforms] absorb worker cls difference into platforms folder by @youkaichao in #10555
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel by @Isotr0py in #9948
Remove token-adding chat embedding params by @noamgat in #10551
[bugfix] fix full graph tests by @youkaichao in #10581
[torch.compile] support all attention backends by @youkaichao in #10558
[v1] Refactor KVCacheManager for more hash input than token ids by @rickyyx in #10507
support bitsandbytes quantization with qwen model by @zixuanzhang226 in #10549
[Core] remove temporary local variables in LLMEngine.init by @russellb in #10577
[V1] EngineCore supports profiling by @Abatom in #10564
[bugfix] fix cpu tests by @youkaichao in #10585
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use by @tjohnson31415 in #10164
[Core] Fix broken log configuration by @russellb in #10458
[Misc] Add pynccl wrappers for all_gather and reduce_scatter by @tlrmchlsmth in #9432
[core] gemma2 full context length support by @youkaichao in #10584
[Bugfix] 500 Internal Server Error when tool_choice is incorrect. by @shenoyvvarun in #10567
[Model] Fix Baichuan BNB online quantization by @CNTRYROA in #10572
Update default max_num_batch_tokens for chunked prefill to 2048 by @mgoin in #10544
[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm by @kliuae in #10254
Prefix Cache Aware Scheduling [1/n] by @rickyyx in #10128
[2/N] Proper handling of placeholders in merged multi-modal processor by @DarkLight1337 in #10485
[Bugfix][Hardware][CPU] Fix multi_modal_kwargs broadcast for CPU tensor parallel by @Isotr0py in #10541
[Platforms] Refactor openvino code by @statelesshz in #10573
[CI/Build] For ppc64le, disabled tests for now and addressed space issues by @npanpaliya in #10538
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama and fix openvino import by @Isotr0py in #10593
[bugfix] Fix example/tensorize_vllm_model tests by @jeejeelee in #10595
[Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA by @jeejeelee in #10450
[CI/Build] Print running script to enhance CI log readability by @jeejeelee in #10594
Revert "[CI/Build] Print running script to enhance CI log readability" by @youkaichao in #10601
[model][utils] add extract_layer_index utility function by @youkaichao in #10599
[doc] update the code to add models by @youkaichao in #10603
[Doc] Update README.md with Ray Summit talk links by @zhuohan123 in #10610
Support Cross encoder models by @maxdebayser in #10400
[Refactor][MISC] del redundant code in ParallelConfig.postinit by @MengqingCao in #10614
[torch.compile] support encoder based models by @youkaichao in #10613
[Doc] Add encoder-based models to Supported Models page by @DarkLight1337 in #10616
[torch.compile] force inductor threads by @jeejeelee in #10620
[torch.compile] add warning for unsupported models by @youkaichao in #10622
[misc] add torch.compile compatibility check by @youkaichao in #10618
[misc] move functions to config.py by @youkaichao in #10624
[Model] Support is_causal HF config field for Qwen2 model by @DarkLight1337 in #10621
[Doc] Super tiny little typo fix by @fzyzcjy in #10633
[Bug]: Authorization ignored when root_path is set by @chaunceyjiang in #10606
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices by @wallashss in #9850
[Docs] Add Snowflake Slides by @simon-mo in #10641
[Model]: Add support for Aria model by @xffxff in #10514
[Model] Enable optional prefix when loading embedding models by @DarkLight1337 in #10639
[Doc] Fix typos in docs by @DarkLight1337 in #10636
[Model] Add OLMo November 2024 model by @2015aroras in #10503
[misc] do not read HOST_IP by @youkaichao in #10644
[bugfix] fix aria model and add torch.compile by @youkaichao in #10645
[Feature] vLLM ARM Enablement for AARCH64 CPUs by @sanketkaleoss in #9228
[v1] EngineArgs for better config handling for v1 by @rickyyx in #10382
custom allreduce + torch.compile by @SageMoore in #10121
[Misc] Remove outdated init protocols by @DarkLight1337 in #10655
[ci] add vllm_test_utils by @youkaichao in #10659
[V1] Enable profile for LLMEngine by @jikunshang in #10665
[Bugfix] Fix for Spec model TP + Chunked Prefill by @andoorve in #10232
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson by @conroy-cheers in #9735
[Bugfix] Fix using -O[0,3] with LLM entrypoint by @mgoin in #10677
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes by @mgoin in #10642
[V1] Refactor model executable interface for multimodal models by @ywang96 in #10570
[Kernel] Remove hard-dependencies of Speculative decode to CUDA workers by @xuechendi in #10587
[V1] Update interface for idefics3 by @ywang96 in #10680
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. by @jeongin601 in #10198
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig by @yansh97 in #10657
[Hardware][Gaudi]add get_name method for HPUAttentionBackend by @jikunshang in #10667
[Misc]Further reduce BNB static variable by @jeejeelee in #10597
[Cleanup][Kernel] Remove if-else with identical branches in marlin 2:4 by @tlrmchlsmth in #10687
[Model] Support telechat2 by @shunxing12345 in #10311
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault by @bigPYJ1151 in #10700
[V1] Update interface for mistral-format Pixtral by @ywang96 in #10703
[ci] fix slow tests by @youkaichao in #10698
[torch.compile] fix shape specialization by @youkaichao in #10722
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint by @Isotr0py in #10675
[Bugfix][Mamba] Fix Multistep on Mamba-like models by @mzusman in #10705
[Bugfix] Ignore lm_head when loading embedding models by @DarkLight1337 in #10719
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server by @tomeras91 in #10635
[misc] upgrade filelock version by @youkaichao in #10731
[Model] support bitsandbytes quantization with minicpm3 model by @zixuanzhang226 in #10682
[Doc] Update model in arch_overview.rst to match comment by @spacewander in #10701
[Bug][CLI] Allow users to disable prefix caching explicitly by @rickyyx in #10724
[V1] Do not allocate beyond the max_model_len by @WoosukKwon in #10730
[Kernel] Update vllm-flash-attn version by @WoosukKwon in #10736
Update requirements-tpu by @richardsliu in #10726
[Model] Added GLM-4 series hf format model support vllm==0.6.4 by @sixsixcoder in #10561
[Kernel] Update vllm-flash-attn version by @WoosukKwon in #10742
[V1] Optimize the CPU overheads in FlashAttention custom op by @WoosukKwon in #10733
[Model] Add Internlm2 LoRA support by @Isotr0py in #5064
[Model] Clean up MiniCPMV by @DarkLight1337 in #10751
[Misc] typo find in sampling_metadata.py by @noooop in #10740
[Bugfix] Fix Idefics3 bug by @jeejeelee in #10778
[platform] Add verify_quantization in platform. by @wangxiyuan in #10757
[Bugfix] Fix OpenVino/Neuron driver_worker init by @NickLucche in #10779
[Model] Refactor Molmo weights loading to use AutoWeightsLoader by @Isotr0py in #10771
Interleaving sliding window for Ministral-8B-Instruct-2410 by @patrickvonplaten in #10591
[doc] format fix by @wangxiyuan in #10789
[Model] Replace embedding models with pooling adapter by @DarkLight1337 in #10769
[Misc] Improve type annotations for support_torch_compile by @DarkLight1337 in #10763
[Misc] Rename embedding classes to pooling by @DarkLight1337 in #10801
[doc] add warning about comparing hf and vllm outputs by @youkaichao in #10805
[Misc] Adding MMMU-Pro vision dataset to serving benchmark by @ywang96 in #10804
[Core] Implement disagg prefill by StatelessProcessGroup by @KuntaiDu in #10502
[Model] Add BNB support to Llava and Pixtral-HF by @Isotr0py in #10795
[core] Avoid metrics log noise when idle - include speculative decodi… by @cduk in #10809
[Kernel] Use out in flash_attn_varlen_func by @WoosukKwon in #10811
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill by @maxdebayser in #10799
[misc] remove xverse modeling file by @youkaichao in #10814
[doc]Update config docstring by @wangxiyuan in #10732
[Model]: add some tests for aria model by @xffxff in #10770
[CI/Build] Update mistral_common version for tests and docs by @DarkLight1337 in #10825
[misc] use out argument for flash attention by @youkaichao in #10822
[Misc][LoRA] Move the implementation of lora bias to punica.py by @jeejeelee in #10829
[Misc][XPU] Avoid torch compile for XPU platform by @yma11 in #10747
Fix openvino on GPU by @janimo in #10793
[Model] Add TP and BNB quantization support to LlavaMultiModalProjector by @Isotr0py in #10834
[Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts by @mgoin in #10753
[Model] support bitsandbytes quantization with minicpm model by @zixuanzhang226 in #10842
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug by @jeejeelee in #10844
[core][distributed] add pynccl broadcast by @youkaichao in #10843
[torch.compile] remove compilation_context and simplify code by @youkaichao in #10838
[Doc] Add github links for source code references by @russellb in #10672
[Misc] Remove deprecated names by @DarkLight1337 in #10817
[Core][Performance] Add XGrammar support for guided decoding and set it as default by @aarnphm in #10785
[Speculative Decoding] Move indices to device before filtering output by @zhengy001 in #10850
[V1] VLM - Run the mm_mapper preprocessor in the frontend process by @alexm-neuralmagic in #10640
[MISC][XPU] quick fix for XPU CI by @yma11 in #10859
[Bugfix] Only require XGrammar on x86 by @mgoin in #10865
[Bugfix][Frontend] correctly record prefill and decode time metrics by @tomeras91 in #10853
[Build][Bugfix] Using the correct type hint by @gshtras in #10866
[Benchmark] Benchmark structured output with datasets by @xuechendi in #10557
[CI/Build] Replace mean with torch.all in test_pynccl.py by @tlrmchlsmth in #10876
Drop ROCm load format check by @wangxiyuan in #10767
[ci/build] Change queue name for Release jobs by @khluu in #10875
[ci/build] Job to build and push release image by @khluu in #10877
[bugfix] fixed parameter “n” not work when set parameter “bestof” > 1 by @o2363286 in #10854
[ci/build] Update vLLM postmerge ECR repo by @khluu in #10887
[LoRA] Change lora_tokenizers capacity by @xyang16 in #10796
[Model] Consolidate ViTs attention implementation without mask by @Isotr0py in #10893
Benchmark serving structured output by @xuechendi in #10880
[CI/Build] improve python-only dev setup by @dtrifiro in #9621
[V1] Fix when max_model_len is not divisible by block_size by @WoosukKwon in #10903
[benchmark] Make H100 benchmark optional by @khluu in #10908
[Bugfix] Fallback to outlines for complex json schemas by @mgoin in #10899
[Doc] Create a new "Usage" section by @DarkLight1337 in #10827
[Bugfix] Fix BNB loader target_modules by @jeejeelee in #10720
[Misc] Update llama 3.2 template to support system prompt with images by @tjohnson31415 in #10901
[Misc][LoRA] Clean up the function interface of Punica by @jeejeelee in #10917
[CI/Build] Bump test transformers version by @Isotr0py in #10106
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives by default for HPU lazy backend by @kzawora-intel in #10897
[ci][build] add tests for python only compilation by @youkaichao in #10915
[torch.compile] use size tuning for specific sizes by @youkaichao in #10933
[torch.compile] add logging for compilation time by @youkaichao in #10941
[CI/Build] Fix broken multimodal test by @DarkLight1337 in #10950
[torch.compile] fix deprecated code by @youkaichao in #10948
[Core] Support Lark grammars for XGrammar by @mgoin in #10870
[Doc] add KubeAI to serving integrations by @samos123 in #10837
[misc] fix typo by @youkaichao in #10960
[ci] fix broken tests by @youkaichao in #10956
[Core] Cleanup startup logging a bit by @russellb in #10961
[Bugfix] Fix test-pipeline.yaml by @jeejeelee in #10973
[Model] Implement merged input processor for LLaVA model by @DarkLight1337 in #10676
[Build] Fix for the Wswitch-bool clang warning by @gshtras in #10060
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation by @Isotr0py in #10958
[Model] Composite weight loading for multimodal Qwen2 by @DarkLight1337 in #10944
[Doc] Explicitly state that InternVL 2.5 is supported by @DarkLight1337 in #10978
[Model] Update multi-modal processor to support Mantis(LLaVA) model by @DarkLight1337 in #10711
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet by @DarkLight1337 in #10975
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None by @xffxff in #10928
[core][executor] simplify instance id by @youkaichao in #10976
[core][misc] remove use_dummy driver for _run_workers by @youkaichao in #10920
[torch.compile] allow candidate compile sizes by @youkaichao in #10984
[V1] Initial support of multimodal models for V1 re-arch by @ywang96 in #10699
[torch.compile][misc] fix comments by @youkaichao in #10993
[misc] clean up and unify logging by @youkaichao in #10999
[Doc][V1] Add V1 support column for multimodal models by @ywang96 in #10998
[torch.compile] add dynamo time tracking by @youkaichao in #11005
[V1] Fix Detokenizer loading in AsyncLLM by @ywang96 in #10997
[Core] Require xgrammar >= 0.1.6 by @russellb in #11021
[Platform] Move async output check to platform by @wangxiyuan in #10768
[V1] Input Batch Relocation by @varun-sundar-rabindranath in #10962
[ci/build] Recompile CI dependencies list with Python 3.12 by @khluu in #11013
[V1] Further reduce CPU overheads in flash-attn by @WoosukKwon in #10989
[Misc][LoRA] Abstract PunicaWrapper by @jeejeelee in #10955
[Model] Implement merged input processor for Phi-3-Vision models by @Isotr0py in #10977
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version by @kzawora-intel in #11028
[v1] fix use compile sizes by @youkaichao in #11000
[Neuron] Upgrade neuron to 2.20.2 by @xendo in #11016
[ROCm][bugfix] Setting the value for the scpecilative decoding worker class on rocm platform by @gshtras in #11035
Build tpu image in release pipeline by @richardsliu in #10936
[V1] Do not store None in self.generators by @WoosukKwon in #11038
[Docs] Add dedicated tool calling page to docs by @mgoin in #10554
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba by @Isotr0py in #10739
[Bugfix] Fix usage of deprecated decorator by @DarkLight1337 in #11025
[Frontend] Use request id from header by @joerunde in #10968
[Pixtral] Improve loading by @patrickvonplaten in #11040
[V1] Multiprocessing Tensor Parallel Support for v1 by @tlrmchlsmth in #9856
monitor metrics of tokens per step using cudagraph batchsizes by @youkaichao in #11031
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. by @sjuxax in #11043
Update README.md by @dmoliveira in #11034
[Bugfix] cuda error running llama 3.2 by @GeneDer in #11047
Add example of helm chart for vllm deployment on k8s by @mfournioux in #9199
[Bugfix] Handle <|tool_call|> token in granite tool parser by @tjohnson31415 in #11039
[Misc][LoRA] Add PEFTHelper for LoRA by @jeejeelee in #11003
[Bugfix] Backport request id validation to v0 by @joerunde in #11036
[BUG] Remove token param #10921 by @flaviabeo in #11022
[Core] Update to outlines >= 0.1.8 by @russellb in #10576
[torch.compile] add a flag to track batchsize statistics by @youkaichao in #11059
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 by @WoosukKwon in #11061
[Bugfix] Fix Mamba multistep by @tlrmchlsmth in #11071
[Misc] LoRA + Chunked Prefill by @aurickq in #9057
[Model] PP support for Mamba-like models by @mzusman in #10992
Fix streaming for granite tool call when <|tool_call|> is present by @maxdebayser in #11069
[CI/Build] Check transformers v4.47 by @DarkLight1337 in #10991
[ci/build] Fix AMD CI dependencies by @khluu in #11087
[ci/build] Fix entrypoints test and pin outlines version by @khluu in #11088
[Core] v1: Use atexit to handle engine core client shutdown by @russellb in #11076
[Bugfix] Fix Idefics3 fails during multi-image inference by @B-201 in #11080
[Bugfix]: Clamp -inf logprob values in prompt_logprobs by @rafvasq in #11073
[Misc] Split up pooling tasks by @DarkLight1337 in #10820
[Doc] Update docs to refer to pooling models by @DarkLight1337 in #11093
[CI/Build] Enable prefix caching test for AMD by @hissu-hyvarinen in #11098
[Doc] Installed version of llmcompressor for int8/fp8 quantization by @bingps in #11103
[torch.compile] use depyf to dump torch.compile internals by @youkaichao in #10972
[V1] Use input_ids as input for text-only models by @WoosukKwon in #11032
[torch.compile] remove graph logging in ci by @youkaichao in #11110
[core] Bump ray to use _overlap_gpu_communication in compiled graph tests by @ruisearch42 in #10410
[CI/Build] Split up VLM tests by @DarkLight1337 in #11083
[V1][Core] Remove should_shutdown to simplify core process termination by @tlrmchlsmth in #11113
[V1] VLM preprocessor hashing by @alexm-neuralmagic in #11020
[Bugfix] Multiple fixes to tool streaming with hermes and mistral by @cedonley in #10979
[Docs] Add media kit by @simon-mo in #11121
Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst by @terrytangyuan in #11112
[Core] cleanup zmq ipc sockets on exit by @russellb in #11115
[Model] Add support for embedding model GritLM by @pooyadavoodi in #10816
[V1] Use more persistent buffers to optimize input preparation overheads by @WoosukKwon in #11111
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in #10565
[core][distributed] initialization from StatelessProcessGroup by @youkaichao in #10986
[Misc][LoRA] Ensure Lora Adapter requests return adapter name by @Jeffwan in #11094
[V1] Fix torch profiling for offline inference by @ywang96 in #11125
fix(docs): typo in helm install instructions by @ramonziai in #11141
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c. by @sjuxax in #11024
[Misc] Validate grammar and fail early by @comaniac in #11119
Fix logging of the vLLM Config by @JArnoldAMD in #11143
[Bugfix] Fix value unpack error of simple connector for KVCache transfer. by @ShangmingCai in #11058
[Misc][V1] Fix type in v1 prefix caching by @comaniac in #11151
[torch.compile] Dynamic fp8 + rms_norm fusion by @ProExpertProg in #10906
[Bugfix] Use runner_type instead of task in GritLM by @pooyadavoodi in #11144
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization by @dsikka in #11148
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm by @gshtras in #11146
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' by @comaniac in #11157
[core] clean up cudagraph batchsize padding logic by @youkaichao in #10996
PaliGemma 2 support by @janimo in #11142
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt by @bigPYJ1151 in #11159
[Frontend] Separate pooling APIs in offline inference by @DarkLight1337 in #11129
[V1][VLM] Fix edge case bug for InternVL2 by @ywang96 in #11165
[Refactor]A simple device-related refactor by @noemotiovon in #11163
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching by @llsj14 in #8240
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor by @zhangjf-nlp in #11156
[Misc] Add tokenizer_mode param to benchmark_serving.py by @alexm-neuralmagic in #11174
[Doc] Reorganize online pooling APIs by @DarkLight1337 in #11172
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend by @janimo in #11169
[Distributed] Allow the placement group more time to wait for resources to be ready by @Jeffwan in #11138
[Core] V1: Use multiprocessing by default by @russellb in #11074
[V1][Bugfix] Fix EngineCoreProc profile by @tlrmchlsmth in #11185
[Bugfix][V1] Re-compute an entire block when fully cache hit by @comaniac in #11186
update compressed-tensors to latest version by @dhuangnm in #11183
[Core] Update outlines and increase its threadpool size by @russellb in #11140
[V1][Bugfix] Fix V1 TP trust-remote-code by @tlrmchlsmth in #11182
[Misc] Minor improvements to the readability of PunicaWrapperBase by @jeejeelee in #11200
[Frontend] Add logits_processors as an extra completion argument by @bradhilton in #11150
[VLM] Fully dynamic prompt replacement in merged input processor by @DarkLight1337 in #11199
Enable mypy checking on V1 code by @markmc in #11105
[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion by @llsj14 in #7209
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 by @jeejeelee in https://github.com//pull/11201
[torch.compile] allow tracking forward time by @youkaichao in #11081
[Misc] Clean up multi-modal processor by @DarkLight1337 in #11207
[Bugfix] Fix error handling of unsupported sliding window by @DarkLight1337 in #11213
[Doc] add documentation for disaggregated prefilling by @KuntaiDu in #11197
[Core] Support disaggregated prefill with Mooncake Transfer Engine by @ShangmingCai in #10884
[V1][Minor] Cache np arange to reduce input preparation overhead by @WoosukKwon in #11214
Update deploying_with_k8s.rst by @AlexHe99 in #10922
fix block-size description by @chenqianfzh in #10938
[Bugfix] Fix the default value for temperature in ChatCompletionRequest by @yansh97 in #11219
[CI/Build] simplify Dockerfile build for ARM64 / GH200 by @cennn in #11212
[Model] Support Cohere2ForCausalLM (Cohere R7B) by @janimo in #11203
[Model] Refactor Ultravox to use merged input processor by @Isotr0py in #11198
[Doc] Reorder vision language examples in alphabet order by @Isotr0py in #11228
[misc] Layerwise profile updates by @varun-sundar-rabindranath in #10242
[core] overhaul memory profiling and fix backward compatibility by @youkaichao in #10511
[Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving by @bk-TurbaAI in #11235
[ci][tests] add gh200 tests by @youkaichao in #11244
[torch.compile] fast inductor by @youkaichao in #11108
fix gh200 tests on main by @youkaichao in #11246
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse by @mgoin in #10935
[Frontend] Add OpenAI API support for input_audio by @kylehh in #11027
[V1][VLM] Proper memory profiling for image language models by @ywang96 in #11210
[Platform] platform agnostic for EngineArgs initialization by @wangxiyuan in #11225
[V1][Core] Use weakref.finalize instead of atexit by @tlrmchlsmth in #11242
[Misc] Kernel Benchmark for RMSNorm by @ywang96 in #11241
[Misc] Allow passing logits_soft_cap for xformers backend by @Isotr0py in #11252
[Bugfix] Fix request cancellation without polling by @joerunde in #11190

New Contributors

@wchen61 made their first contribution in #10347
@kakao-steve-ai made their first contribution in #10287
@Maybewuss made their first contribution in #10415
@ismael-dm made their first contribution in #9943
@andrew made their first contribution in #10426
@angusYuhao made their first contribution in #9014
@xiyuan-lee made their first contribution in #10398
@mikejuliet13 made their first contribution in #10421
@BBuf made their first contribution in #10494
@zixuanzhang226 made their first contribution in #10549
@shenoyvvarun made their first contribution in #10567
@CNTRYROA made their first contribution in #10572
@npanpaliya made their first contribution in #10538
@xffxff made their first contribution in #10514
@2015aroras made their first contribution in #10503
@sanketkaleoss made their first contribution in #9228
@conroy-cheers made their first contribution in #9735
@jeongin601 made their first contribution in #10198
@shunxing12345 made their first contribution in #10311
@spacewander made their first contribution in #10701
@wangxiyuan made their first contribution in #10757
@cduk made their first contribution in #10809
@o2363286 made their first contribution in #10854
@sjuxax made their first contribution in #11043
@dmoliveira made their first contribution in #11034
@mfournioux made their first contribution in #9199
@bingps made their first contribution in #11103
@cedonley made their first contribution in #10979
@SanjuCSudhakaran made their first contribution in #10565
@ramonziai made their first contribution in #11141
@noemotiovon made their first contribution in #11163
@zhangjf-nlp made their first contribution in #11156
@dhuangnm made their first contribution in #11183
@bradhilton made their first contribution in #11150
@AlexHe99 made their first contribution in #10922
@cennn made their first contribution in #11212
@bk-TurbaAI made their first contribution in #11235
@kylehh made their first contribution in #11027

Full Changelog: v0.6.4...v0.6.5