github vllm-project/vllm v0.6.5

23 hours ago

Highlights

Model Support

Hardware Support

Performance & Scheduling

  • Prefix-cache aware scheduling (#10128), sliding window support (#10462), disaggregated prefill enhancements (#10502, #10884), evictor optimization (#7209).

Benchmark & Frontend

Documentation & Plugins

Bugfixes & Misc

What's Changed

  • Add default value to avoid Falcon crash (#5363) by @wchen61 in #10347
  • [Misc] Fix import error in tensorizer tests and cleanup some code by @DarkLight1337 in #10349
  • [Doc] Remove float32 choice from --lora-dtype by @xyang16 in #10348
  • [Bugfix] Fix fully sharded LoRA bug by @jeejeelee in #10352
  • [Misc] Fix some help info of arg_utils to improve readability by @ShangmingCai in #10362
  • [core][misc] keep compatibility for old-style classes by @youkaichao in #10356
  • [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer by @gcalmettes in #10363
  • [Misc] Bump up test_fused_moe tolerance by @ElizaWszola in #10364
  • [Misc] bump mistral common version by @simon-mo in #10367
  • [Docs] Add Nebius as sponsors by @simon-mo in #10371
  • [Frontend] Add --version flag to CLI by @russellb in #10369
  • [Doc] Move PR template content to docs by @russellb in #10159
  • [Docs] Misc updates to TPU installation instructions by @mikegre-google in #10165
  • [Frontend] Automatic detection of chat content format from AST by @DarkLight1337 in #9919
  • [doc] add doc for the plugin system by @youkaichao in #10372
  • [misc][plugin] improve log messages by @youkaichao in #10386
  • [BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel by @rasmith in #10385
  • [Misc] Update benchmark to support image_url file or http by @kakao-steve-ai in #10287
  • [Misc] Medusa supports custom bias by @skylee-01 in #10361
  • [Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled by @imkero in #10388
  • [V1] Add code owners for V1 by @WoosukKwon in #10397
  • [2/N][torch.compile] make compilation cfg part of vllm cfg by @youkaichao in #10383
  • [V1] Refactor model executable interface for all text-only language models by @ywang96 in #10374
  • [CI/Build] Fix IDC hpu [Device not found] issue by @xuechendi in #10384
  • [Bugfix][Hardware][CPU] Fix CPU embedding runner with tensor parallel by @Isotr0py in #10394
  • [platforms] refactor cpu code by @youkaichao in #10402
  • [Hardware] [HPU]add mark_step for hpu by @jikunshang in #10239
  • [Bugfix] Fix mrope_position_delta in non-last prefill chunk by @imkero in #10403
  • [Misc] Enhance offline_inference to support user-configurable paramet… by @wchen61 in #10392
  • [Misc] Add uninitialized params tracking for AutoWeightsLoader by @Isotr0py in #10327
  • [Bugfix] Ignore ray reinit error when current platform is ROCm or XPU by @HollowMan6 in #10375
  • [4/N][torch.compile] clean up set_torch_compile_backend by @youkaichao in #10401
  • [VLM] Report multi_modal_placeholders in output by @lk-chen in #10407
  • [Model] Remove redundant softmax when using PoolingType.STEP by @Maybewuss in #10415
  • [Model][LoRA]LoRA support added for glm-4v by @B-201 in #10418
  • [Model] Remove transformers attention porting in VITs by @Isotr0py in #10414
  • [Doc] Update doc for LoRA support in GLM-4V by @B-201 in #10425
  • [5/N][torch.compile] torch.jit.script --> torch.compile by @youkaichao in #10406
  • [Doc] Add documentation for Structured Outputs by @ismael-dm in #9943
  • Fix open_collective value in FUNDING.yml by @andrew in #10426
  • [Model][Bugfix] Support TP for PixtralHF ViT by @mgoin in #10405
  • [Hardware][XPU] AWQ/GPTQ support for xpu backend by @yma11 in #10107
  • [Kernel] Explicitly specify other value in tl.load calls by @angusYuhao in #9014
  • [Kernel] Initial Machete W4A8 support + Refactors by @LucasWilkinson in #9855
  • [3/N][torch.compile] consolidate custom op logging by @youkaichao in #10399
  • [ci][bugfix] fix kernel tests by @youkaichao in #10431
  • [misc] Allow partial prefix benchmarking & random input generation for prefix benchmarking by @rickyyx in #9929
  • [ci/build] Have dependabot ignore all patch update by @khluu in #10436
  • [Bugfix]Fix Phi-3 BNB online quantization by @jeejeelee in #10417
  • [Platform][Refactor] Extract func get_default_attn_backend to Platform by @MengqingCao in #10358
  • Add openai.beta.chat.completions.parse example to structured_outputs.rst by @mgoin in #10433
  • [Bugfix] Guard for negative counter metrics to prevent crash by @tjohnson31415 in #10430
  • [Misc] Avoid misleading warning messages by @jeejeelee in #10438
  • [Doc] Add the start of an arch overview page by @russellb in #10368
  • [misc][plugin] improve plugin loading by @youkaichao in #10443
  • [CI][CPU] adding numa node number as container name suffix by @zhouyuan in #10441
  • [BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) by @xiyuan-lee in #10398
  • [Pixtral-Large] Pixtral actually has no bias in vision-lang adapter by @patrickvonplaten in #10449
  • Fix: Build error seen on Power Architecture by @mikejuliet13 in #10421
  • [Doc] fix link for page that was renamed by @russellb in #10455
  • [6/N] torch.compile rollout to users by @youkaichao in #10437
  • [Core] Avoid metrics log noise when idle by @russellb in #8868
  • [Model][Quantization] HQQ support through Marlin kernel expansion by @ElizaWszola in #9766
  • Change granite chat template to keep json list formatting for tool calls by @maxdebayser in #10452
  • [CI/Build] Update Dockerfile.rocm by @Alexei-V-Ivanov-AMD in #10434
  • [Bugfix] Marlin 2:4 temp fix for large M dim (>256) by @LucasWilkinson in #10464
  • [Misc] Add setitem for LazyDict by @liuyanyi in #10469
  • [Bugfix] Fix Mamba model initialization and MLP Speculator weights loading by @Isotr0py in #10456
  • [Bugfix] Enforce no chunked prefill for embedding models by @DarkLight1337 in #10470
  • [CI/Build] Add sphinx/rst linter for docs by @rafvasq in #10366
  • [CI/Build] Support compilation with local cutlass path (#10423) by @wchen61 in #10424
  • [ci/build] Combine nightly and optional by @khluu in #10465
  • [model] Reduce medusa weight by @skylee-01 in #10454
  • [Bugfix] Handle conflicts between modern and legacy fields by @DarkLight1337 in #10471
  • [Platforms] Refactor xpu code by @MengqingCao in #10468
  • [Hardware][CPU] Support chunked-prefill and prefix-caching on CPU by @bigPYJ1151 in #10355
  • [platforms] restore xpu check for parallel config by @youkaichao in #10479
  • [perf bench] H200 development by @simon-mo in #9768
  • [7/N] torch.compile, reduce compilation time by @youkaichao in #10460
  • [Bugfix]: allow extra fields in requests to openai compatible server by @gcalmettes in #10463
  • [TPU] Implement prefix caching for TPUs by @WoosukKwon in #10307
  • [torch.compile] limit inductor threads and lazy import quant by @youkaichao in #10482
  • [Core] Add Sliding Window Support with Flashinfer by @pavanimajety in #10462
  • [Platforms] Add device_type in Platform by @MengqingCao in #10508
  • [torch.compile] PostGradPassManager, Inductor code caching fix, fix_functionalization pass refactor + tests by @ProExpertProg in #10273
  • [Misc] Increase default video fetch timeout by @DarkLight1337 in #10495
  • [platforms] improve error message for unspecified platforms by @youkaichao in #10520
  • [Doc] fix a small typo in docstring of llama_tool_parser by @FerdinandZhong in #10513
  • [Model] Add Support for Multimodal Granite Models by @alex-jw-brooks in #10291
  • fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len by @sywangyi in #10524
  • [Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models by @Isotr0py in #10518
  • [Bugfix] Embedding model pooling_type equals ALL and multi input's bug by @BBuf in #10494
  • [Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored by @chaunceyjiang in #10180
  • [Kernel] Register punica ops directly by @jeejeelee in #10522
  • [Misc] Suppress duplicated logging regarding multimodal input pipeline by @ywang96 in #10530
  • [Bugfix] Allow token ID-only inputs in Qwen2-Audio by @DarkLight1337 in #10536
  • [8/N] enable cli flag without a space by @youkaichao in #10529
  • [V1] Fix Compilation config & Enable CUDA graph by default by @WoosukKwon in #10528
  • [CI][Installation] Avoid uploading CUDA 11.8 wheel by @cermeng in #10535
  • [misc] improve error message by @youkaichao in #10553
  • [Minor] Revert change in offline inference example by @WoosukKwon in #10545
  • Add small example to metrics.rst by @mgoin in #10550
  • [Benchmark] Add new H100 machine by @simon-mo in #10547
  • [9/N] torch.compile LLM usage by @youkaichao in #10552
  • [Minor] Fix line-too-long by @WoosukKwon in #10563
  • [platforms] absorb worker cls difference into platforms folder by @youkaichao in #10555
  • [Bugfix] Fix Phi-3 BNB quantization with tensor parallel by @Isotr0py in #9948
  • Remove token-adding chat embedding params by @noamgat in #10551
  • [bugfix] fix full graph tests by @youkaichao in #10581
  • [torch.compile] support all attention backends by @youkaichao in #10558
  • [v1] Refactor KVCacheManager for more hash input than token ids by @rickyyx in #10507
  • support bitsandbytes quantization with qwen model by @zixuanzhang226 in #10549
  • [Core] remove temporary local variables in LLMEngine.init by @russellb in #10577
  • [V1] EngineCore supports profiling by @Abatom in #10564
  • [bugfix] fix cpu tests by @youkaichao in #10585
  • [Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use by @tjohnson31415 in #10164
  • [Core] Fix broken log configuration by @russellb in #10458
  • [Misc] Add pynccl wrappers for all_gather and reduce_scatter by @tlrmchlsmth in #9432
  • [core] gemma2 full context length support by @youkaichao in #10584
  • [Bugfix] 500 Internal Server Error when tool_choice is incorrect. by @shenoyvvarun in #10567
  • [Model] Fix Baichuan BNB online quantization by @CNTRYROA in #10572
  • Update default max_num_batch_tokens for chunked prefill to 2048 by @mgoin in #10544
  • [Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm by @kliuae in #10254
  • Prefix Cache Aware Scheduling [1/n] by @rickyyx in #10128
  • [2/N] Proper handling of placeholders in merged multi-modal processor by @DarkLight1337 in #10485
  • [Bugfix][Hardware][CPU] Fix multi_modal_kwargs broadcast for CPU tensor parallel by @Isotr0py in #10541
  • [Platforms] Refactor openvino code by @statelesshz in #10573
  • [CI/Build] For ppc64le, disabled tests for now and addressed space issues by @npanpaliya in #10538
  • [Bugfix] Avoid import AttentionMetadata explicitly in Mllama and fix openvino import by @Isotr0py in #10593
  • [bugfix] Fix example/tensorize_vllm_model tests by @jeejeelee in #10595
  • [Bugfix] Fix the LoRA weight sharding in ColumnParallelLinearWithLoRA by @jeejeelee in #10450
  • [CI/Build] Print running script to enhance CI log readability by @jeejeelee in #10594
  • Revert "[CI/Build] Print running script to enhance CI log readability" by @youkaichao in #10601
  • [model][utils] add extract_layer_index utility function by @youkaichao in #10599
  • [doc] update the code to add models by @youkaichao in #10603
  • [Doc] Update README.md with Ray Summit talk links by @zhuohan123 in #10610
  • Support Cross encoder models by @maxdebayser in #10400
  • [Refactor][MISC] del redundant code in ParallelConfig.postinit by @MengqingCao in #10614
  • [torch.compile] support encoder based models by @youkaichao in #10613
  • [Doc] Add encoder-based models to Supported Models page by @DarkLight1337 in #10616
  • [torch.compile] force inductor threads by @jeejeelee in #10620
  • [torch.compile] add warning for unsupported models by @youkaichao in #10622
  • [misc] add torch.compile compatibility check by @youkaichao in #10618
  • [misc] move functions to config.py by @youkaichao in #10624
  • [Model] Support is_causal HF config field for Qwen2 model by @DarkLight1337 in #10621
  • [Doc] Super tiny little typo fix by @fzyzcjy in #10633
  • [Bug]: Authorization ignored when root_path is set by @chaunceyjiang in #10606
  • [Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices by @wallashss in #9850
  • [Docs] Add Snowflake Slides by @simon-mo in #10641
  • [Model]: Add support for Aria model by @xffxff in #10514
  • [Model] Enable optional prefix when loading embedding models by @DarkLight1337 in #10639
  • [Doc] Fix typos in docs by @DarkLight1337 in #10636
  • [Model] Add OLMo November 2024 model by @2015aroras in #10503
  • [misc] do not read HOST_IP by @youkaichao in #10644
  • [bugfix] fix aria model and add torch.compile by @youkaichao in #10645
  • [Feature] vLLM ARM Enablement for AARCH64 CPUs by @sanketkaleoss in #9228
  • [v1] EngineArgs for better config handling for v1 by @rickyyx in #10382
  • custom allreduce + torch.compile by @SageMoore in #10121
  • [Misc] Remove outdated init protocols by @DarkLight1337 in #10655
  • [ci] add vllm_test_utils by @youkaichao in #10659
  • [V1] Enable profile for LLMEngine by @jikunshang in #10665
  • [Bugfix] Fix for Spec model TP + Chunked Prefill by @andoorve in #10232
  • [Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson by @conroy-cheers in #9735
  • [Bugfix] Fix using -O[0,3] with LLM entrypoint by @mgoin in #10677
  • [Bugfix] Check bnb_4bit_quant_storage for bitsandbytes by @mgoin in #10642
  • [V1] Refactor model executable interface for multimodal models by @ywang96 in #10570
  • [Kernel] Remove hard-dependencies of Speculative decode to CUDA workers by @xuechendi in #10587
  • [V1] Update interface for idefics3 by @ywang96 in #10680
  • [Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. by @jeongin601 in #10198
  • [bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig by @yansh97 in #10657
  • [Hardware][Gaudi]add get_name method for HPUAttentionBackend by @jikunshang in #10667
  • [Misc]Further reduce BNB static variable by @jeejeelee in #10597
  • [Cleanup][Kernel] Remove if-else with identical branches in marlin 2:4 by @tlrmchlsmth in #10687
  • [Model] Support telechat2 by @shunxing12345 in #10311
  • [Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault by @bigPYJ1151 in #10700
  • [V1] Update interface for mistral-format Pixtral by @ywang96 in #10703
  • [ci] fix slow tests by @youkaichao in #10698
  • [torch.compile] fix shape specialization by @youkaichao in #10722
  • [Bugfix] Fix GGUF inference with FP16 unquantized checkpoint by @Isotr0py in #10675
  • [Bugfix][Mamba] Fix Multistep on Mamba-like models by @mzusman in #10705
  • [Bugfix] Ignore lm_head when loading embedding models by @DarkLight1337 in #10719
  • [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server by @tomeras91 in #10635
  • [misc] upgrade filelock version by @youkaichao in #10731
  • [Model] support bitsandbytes quantization with minicpm3 model by @zixuanzhang226 in #10682
  • [Doc] Update model in arch_overview.rst to match comment by @spacewander in #10701
  • [Bug][CLI] Allow users to disable prefix caching explicitly by @rickyyx in #10724
  • [V1] Do not allocate beyond the max_model_len by @WoosukKwon in #10730
  • [Kernel] Update vllm-flash-attn version by @WoosukKwon in #10736
  • Update requirements-tpu by @richardsliu in #10726
  • [Model] Added GLM-4 series hf format model support vllm==0.6.4 by @sixsixcoder in #10561
  • [Kernel] Update vllm-flash-attn version by @WoosukKwon in #10742
  • [V1] Optimize the CPU overheads in FlashAttention custom op by @WoosukKwon in #10733
  • [Model] Add Internlm2 LoRA support by @Isotr0py in #5064
  • [Model] Clean up MiniCPMV by @DarkLight1337 in #10751
  • [Misc] typo find in sampling_metadata.py by @noooop in #10740
  • [Bugfix] Fix Idefics3 bug by @jeejeelee in #10778
  • [platform] Add verify_quantization in platform. by @wangxiyuan in #10757
  • [Bugfix] Fix OpenVino/Neuron driver_worker init by @NickLucche in #10779
  • [Model] Refactor Molmo weights loading to use AutoWeightsLoader by @Isotr0py in #10771
  • Interleaving sliding window for Ministral-8B-Instruct-2410 by @patrickvonplaten in #10591
  • [doc] format fix by @wangxiyuan in #10789
  • [Model] Replace embedding models with pooling adapter by @DarkLight1337 in #10769
  • [Misc] Improve type annotations for support_torch_compile by @DarkLight1337 in #10763
  • [Misc] Rename embedding classes to pooling by @DarkLight1337 in #10801
  • [doc] add warning about comparing hf and vllm outputs by @youkaichao in #10805
  • [Misc] Adding MMMU-Pro vision dataset to serving benchmark by @ywang96 in #10804
  • [Core] Implement disagg prefill by StatelessProcessGroup by @KuntaiDu in #10502
  • [Model] Add BNB support to Llava and Pixtral-HF by @Isotr0py in #10795
  • [core] Avoid metrics log noise when idle - include speculative decodi… by @cduk in #10809
  • [Kernel] Use out in flash_attn_varlen_func by @WoosukKwon in #10811
  • Fill TorchSDPAAttentionMetadata seq_lens_field for prefill by @maxdebayser in #10799
  • [misc] remove xverse modeling file by @youkaichao in #10814
  • [doc]Update config docstring by @wangxiyuan in #10732
  • [Model]: add some tests for aria model by @xffxff in #10770
  • [CI/Build] Update mistral_common version for tests and docs by @DarkLight1337 in #10825
  • [misc] use out argument for flash attention by @youkaichao in #10822
  • [Misc][LoRA] Move the implementation of lora bias to punica.py by @jeejeelee in #10829
  • [Misc][XPU] Avoid torch compile for XPU platform by @yma11 in #10747
  • Fix openvino on GPU by @janimo in #10793
  • [Model] Add TP and BNB quantization support to LlavaMultiModalProjector by @Isotr0py in #10834
  • [Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts by @mgoin in #10753
  • [Model] support bitsandbytes quantization with minicpm model by @zixuanzhang226 in #10842
  • [Bugfix] Fix QKVParallelLinearWithShardedLora bias bug by @jeejeelee in #10844
  • [core][distributed] add pynccl broadcast by @youkaichao in #10843
  • [torch.compile] remove compilation_context and simplify code by @youkaichao in #10838
  • [Doc] Add github links for source code references by @russellb in #10672
  • [Misc] Remove deprecated names by @DarkLight1337 in #10817
  • [Core][Performance] Add XGrammar support for guided decoding and set it as default by @aarnphm in #10785
  • [Speculative Decoding] Move indices to device before filtering output by @zhengy001 in #10850
  • [V1] VLM - Run the mm_mapper preprocessor in the frontend process by @alexm-neuralmagic in #10640
  • [MISC][XPU] quick fix for XPU CI by @yma11 in #10859
  • [Bugfix] Only require XGrammar on x86 by @mgoin in #10865
  • [Bugfix][Frontend] correctly record prefill and decode time metrics by @tomeras91 in #10853
  • [Build][Bugfix] Using the correct type hint by @gshtras in #10866
  • [Benchmark] Benchmark structured output with datasets by @xuechendi in #10557
  • [CI/Build] Replace mean with torch.all in test_pynccl.py by @tlrmchlsmth in #10876
  • Drop ROCm load format check by @wangxiyuan in #10767
  • [ci/build] Change queue name for Release jobs by @khluu in #10875
  • [ci/build] Job to build and push release image by @khluu in #10877
  • [bugfix] fixed parameter “n” not work when set parameter “bestof” > 1 by @o2363286 in #10854
  • [ci/build] Update vLLM postmerge ECR repo by @khluu in #10887
  • [LoRA] Change lora_tokenizers capacity by @xyang16 in #10796
  • [Model] Consolidate ViTs attention implementation without mask by @Isotr0py in #10893
  • Benchmark serving structured output by @xuechendi in #10880
  • [CI/Build] improve python-only dev setup by @dtrifiro in #9621
  • [V1] Fix when max_model_len is not divisible by block_size by @WoosukKwon in #10903
  • [benchmark] Make H100 benchmark optional by @khluu in #10908
  • [Bugfix] Fallback to outlines for complex json schemas by @mgoin in #10899
  • [Doc] Create a new "Usage" section by @DarkLight1337 in #10827
  • [Bugfix] Fix BNB loader target_modules by @jeejeelee in #10720
  • [Misc] Update llama 3.2 template to support system prompt with images by @tjohnson31415 in #10901
  • [Misc][LoRA] Clean up the function interface of Punica by @jeejeelee in #10917
  • [CI/Build] Bump test transformers version by @Isotr0py in #10106
  • [Misc][Gaudi] Avoid torch.compile and enable lazy collectives by default for HPU lazy backend by @kzawora-intel in #10897
  • [ci][build] add tests for python only compilation by @youkaichao in #10915
  • [torch.compile] use size tuning for specific sizes by @youkaichao in #10933
  • [torch.compile] add logging for compilation time by @youkaichao in #10941
  • [CI/Build] Fix broken multimodal test by @DarkLight1337 in #10950
  • [torch.compile] fix deprecated code by @youkaichao in #10948
  • [Core] Support Lark grammars for XGrammar by @mgoin in #10870
  • [Doc] add KubeAI to serving integrations by @samos123 in #10837
  • [misc] fix typo by @youkaichao in #10960
  • [ci] fix broken tests by @youkaichao in #10956
  • [Core] Cleanup startup logging a bit by @russellb in #10961
  • [Bugfix] Fix test-pipeline.yaml by @jeejeelee in #10973
  • [Model] Implement merged input processor for LLaVA model by @DarkLight1337 in #10676
  • [Build] Fix for the Wswitch-bool clang warning by @gshtras in #10060
  • [Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation by @Isotr0py in #10958
  • [Model] Composite weight loading for multimodal Qwen2 by @DarkLight1337 in #10944
  • [Doc] Explicitly state that InternVL 2.5 is supported by @DarkLight1337 in #10978
  • [Model] Update multi-modal processor to support Mantis(LLaVA) model by @DarkLight1337 in #10711
  • [Doc] Explicitly state that PP isn't compatible with speculative decoding yet by @DarkLight1337 in #10975
  • [BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None by @xffxff in #10928
  • [core][executor] simplify instance id by @youkaichao in #10976
  • [core][misc] remove use_dummy driver for _run_workers by @youkaichao in #10920
  • [torch.compile] allow candidate compile sizes by @youkaichao in #10984
  • [V1] Initial support of multimodal models for V1 re-arch by @ywang96 in #10699
  • [torch.compile][misc] fix comments by @youkaichao in #10993
  • [misc] clean up and unify logging by @youkaichao in #10999
  • [Doc][V1] Add V1 support column for multimodal models by @ywang96 in #10998
  • [torch.compile] add dynamo time tracking by @youkaichao in #11005
  • [V1] Fix Detokenizer loading in AsyncLLM by @ywang96 in #10997
  • [Core] Require xgrammar >= 0.1.6 by @russellb in #11021
  • [Platform] Move async output check to platform by @wangxiyuan in #10768
  • [V1] Input Batch Relocation by @varun-sundar-rabindranath in #10962
  • [ci/build] Recompile CI dependencies list with Python 3.12 by @khluu in #11013
  • [V1] Further reduce CPU overheads in flash-attn by @WoosukKwon in #10989
  • [Misc][LoRA] Abstract PunicaWrapper by @jeejeelee in #10955
  • [Model] Implement merged input processor for Phi-3-Vision models by @Isotr0py in #10977
  • [Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version by @kzawora-intel in #11028
  • [v1] fix use compile sizes by @youkaichao in #11000
  • [Neuron] Upgrade neuron to 2.20.2 by @xendo in #11016
  • [ROCm][bugfix] Setting the value for the scpecilative decoding worker class on rocm platform by @gshtras in #11035
  • Build tpu image in release pipeline by @richardsliu in #10936
  • [V1] Do not store None in self.generators by @WoosukKwon in #11038
  • [Docs] Add dedicated tool calling page to docs by @mgoin in #10554
  • [Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba by @Isotr0py in #10739
  • [Bugfix] Fix usage of deprecated decorator by @DarkLight1337 in #11025
  • [Frontend] Use request id from header by @joerunde in #10968
  • [Pixtral] Improve loading by @patrickvonplaten in #11040
  • [V1] Multiprocessing Tensor Parallel Support for v1 by @tlrmchlsmth in #9856
  • monitor metrics of tokens per step using cudagraph batchsizes by @youkaichao in #11031
  • [Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. by @sjuxax in #11043
  • Update README.md by @dmoliveira in #11034
  • [Bugfix] cuda error running llama 3.2 by @GeneDer in #11047
  • Add example of helm chart for vllm deployment on k8s by @mfournioux in #9199
  • [Bugfix] Handle <|tool_call|> token in granite tool parser by @tjohnson31415 in #11039
  • [Misc][LoRA] Add PEFTHelper for LoRA by @jeejeelee in #11003
  • [Bugfix] Backport request id validation to v0 by @joerunde in #11036
  • [BUG] Remove token param #10921 by @flaviabeo in #11022
  • [Core] Update to outlines >= 0.1.8 by @russellb in #10576
  • [torch.compile] add a flag to track batchsize statistics by @youkaichao in #11059
  • [V1][Bugfix] Always set enable_chunked_prefill = True for V1 by @WoosukKwon in #11061
  • [Bugfix] Fix Mamba multistep by @tlrmchlsmth in #11071
  • [Misc] LoRA + Chunked Prefill by @aurickq in #9057
  • [Model] PP support for Mamba-like models by @mzusman in #10992
  • Fix streaming for granite tool call when <|tool_call|> is present by @maxdebayser in #11069
  • [CI/Build] Check transformers v4.47 by @DarkLight1337 in #10991
  • [ci/build] Fix AMD CI dependencies by @khluu in #11087
  • [ci/build] Fix entrypoints test and pin outlines version by @khluu in #11088
  • [Core] v1: Use atexit to handle engine core client shutdown by @russellb in #11076
  • [Bugfix] Fix Idefics3 fails during multi-image inference by @B-201 in #11080
  • [Bugfix]: Clamp -inf logprob values in prompt_logprobs by @rafvasq in #11073
  • [Misc] Split up pooling tasks by @DarkLight1337 in #10820
  • [Doc] Update docs to refer to pooling models by @DarkLight1337 in #11093
  • [CI/Build] Enable prefix caching test for AMD by @hissu-hyvarinen in #11098
  • [Doc] Installed version of llmcompressor for int8/fp8 quantization by @bingps in #11103
  • [torch.compile] use depyf to dump torch.compile internals by @youkaichao in #10972
  • [V1] Use input_ids as input for text-only models by @WoosukKwon in #11032
  • [torch.compile] remove graph logging in ci by @youkaichao in #11110
  • [core] Bump ray to use _overlap_gpu_communication in compiled graph tests by @ruisearch42 in #10410
  • [CI/Build] Split up VLM tests by @DarkLight1337 in #11083
  • [V1][Core] Remove should_shutdown to simplify core process termination by @tlrmchlsmth in #11113
  • [V1] VLM preprocessor hashing by @alexm-neuralmagic in #11020
  • [Bugfix] Multiple fixes to tool streaming with hermes and mistral by @cedonley in #10979
  • [Docs] Add media kit by @simon-mo in #11121
  • Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst by @terrytangyuan in #11112
  • [Core] cleanup zmq ipc sockets on exit by @russellb in #11115
  • [Model] Add support for embedding model GritLM by @pooyadavoodi in #10816
  • [V1] Use more persistent buffers to optimize input preparation overheads by @WoosukKwon in #11111
  • [Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) by @SanjuCSudhakaran in #10565
  • [core][distributed] initialization from StatelessProcessGroup by @youkaichao in #10986
  • [Misc][LoRA] Ensure Lora Adapter requests return adapter name by @Jeffwan in #11094
  • [V1] Fix torch profiling for offline inference by @ywang96 in #11125
  • fix(docs): typo in helm install instructions by @ramonziai in #11141
  • [Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c. by @sjuxax in #11024
  • [Misc] Validate grammar and fail early by @comaniac in #11119
  • Fix logging of the vLLM Config by @JArnoldAMD in #11143
  • [Bugfix] Fix value unpack error of simple connector for KVCache transfer. by @ShangmingCai in #11058
  • [Misc][V1] Fix type in v1 prefix caching by @comaniac in #11151
  • [torch.compile] Dynamic fp8 + rms_norm fusion by @ProExpertProg in #10906
  • [Bugfix] Use runner_type instead of task in GritLM by @pooyadavoodi in #11144
  • [Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization by @dsikka in #11148
  • [ROCm][AMD] Disable auto enabling chunked prefill on ROCm by @gshtras in #11146
  • [Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' by @comaniac in #11157
  • [core] clean up cudagraph batchsize padding logic by @youkaichao in #10996
  • PaliGemma 2 support by @janimo in #11142
  • [Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt by @bigPYJ1151 in #11159
  • [Frontend] Separate pooling APIs in offline inference by @DarkLight1337 in #11129
  • [V1][VLM] Fix edge case bug for InternVL2 by @ywang96 in #11165
  • [Refactor]A simple device-related refactor by @noemotiovon in #11163
  • [Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching by @llsj14 in #8240
  • [Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor by @zhangjf-nlp in #11156
  • [Misc] Add tokenizer_mode param to benchmark_serving.py by @alexm-neuralmagic in #11174
  • [Doc] Reorganize online pooling APIs by @DarkLight1337 in #11172
  • [Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend by @janimo in #11169
  • [Distributed] Allow the placement group more time to wait for resources to be ready by @Jeffwan in #11138
  • [Core] V1: Use multiprocessing by default by @russellb in #11074
  • [V1][Bugfix] Fix EngineCoreProc profile by @tlrmchlsmth in #11185
  • [Bugfix][V1] Re-compute an entire block when fully cache hit by @comaniac in #11186
  • update compressed-tensors to latest version by @dhuangnm in #11183
  • [Core] Update outlines and increase its threadpool size by @russellb in #11140
  • [V1][Bugfix] Fix V1 TP trust-remote-code by @tlrmchlsmth in #11182
  • [Misc] Minor improvements to the readability of PunicaWrapperBase by @jeejeelee in #11200
  • [Frontend] Add logits_processors as an extra completion argument by @bradhilton in #11150
  • [VLM] Fully dynamic prompt replacement in merged input processor by @DarkLight1337 in #11199
  • Enable mypy checking on V1 code by @markmc in #11105
  • [Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion by @llsj14 in #7209
  • [[Misc]Upgrade bitsandbytes to the latest version 0.45.0 by @jeejeelee in https://github.com//pull/11201
  • [torch.compile] allow tracking forward time by @youkaichao in #11081
  • [Misc] Clean up multi-modal processor by @DarkLight1337 in #11207
  • [Bugfix] Fix error handling of unsupported sliding window by @DarkLight1337 in #11213
  • [Doc] add documentation for disaggregated prefilling by @KuntaiDu in #11197
  • [Core] Support disaggregated prefill with Mooncake Transfer Engine by @ShangmingCai in #10884
  • [V1][Minor] Cache np arange to reduce input preparation overhead by @WoosukKwon in #11214
  • Update deploying_with_k8s.rst by @AlexHe99 in #10922
  • fix block-size description by @chenqianfzh in #10938
  • [Bugfix] Fix the default value for temperature in ChatCompletionRequest by @yansh97 in #11219
  • [CI/Build] simplify Dockerfile build for ARM64 / GH200 by @cennn in #11212
  • [Model] Support Cohere2ForCausalLM (Cohere R7B) by @janimo in #11203
  • [Model] Refactor Ultravox to use merged input processor by @Isotr0py in #11198
  • [Doc] Reorder vision language examples in alphabet order by @Isotr0py in #11228
  • [misc] Layerwise profile updates by @varun-sundar-rabindranath in #10242
  • [core] overhaul memory profiling and fix backward compatibility by @youkaichao in #10511
  • [Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving by @bk-TurbaAI in #11235
  • [ci][tests] add gh200 tests by @youkaichao in #11244
  • [torch.compile] fast inductor by @youkaichao in #11108
  • fix gh200 tests on main by @youkaichao in #11246
  • [CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse by @mgoin in #10935
  • [Frontend] Add OpenAI API support for input_audio by @kylehh in #11027
  • [V1][VLM] Proper memory profiling for image language models by @ywang96 in #11210
  • [Platform] platform agnostic for EngineArgs initialization by @wangxiyuan in #11225
  • [V1][Core] Use weakref.finalize instead of atexit by @tlrmchlsmth in #11242
  • [Misc] Kernel Benchmark for RMSNorm by @ywang96 in #11241
  • [Misc] Allow passing logits_soft_cap for xformers backend by @Isotr0py in #11252
  • [Bugfix] Fix request cancellation without polling by @joerunde in #11190

New Contributors

  • @wchen61 made their first contribution in #10347
  • @kakao-steve-ai made their first contribution in #10287
  • @Maybewuss made their first contribution in #10415
  • @ismael-dm made their first contribution in #9943
  • @andrew made their first contribution in #10426
  • @angusYuhao made their first contribution in #9014
  • @xiyuan-lee made their first contribution in #10398
  • @mikejuliet13 made their first contribution in #10421
  • @BBuf made their first contribution in #10494
  • @zixuanzhang226 made their first contribution in #10549
  • @shenoyvvarun made their first contribution in #10567
  • @CNTRYROA made their first contribution in #10572
  • @npanpaliya made their first contribution in #10538
  • @xffxff made their first contribution in #10514
  • @2015aroras made their first contribution in #10503
  • @sanketkaleoss made their first contribution in #9228
  • @conroy-cheers made their first contribution in #9735
  • @jeongin601 made their first contribution in #10198
  • @shunxing12345 made their first contribution in #10311
  • @spacewander made their first contribution in #10701
  • @wangxiyuan made their first contribution in #10757
  • @cduk made their first contribution in #10809
  • @o2363286 made their first contribution in #10854
  • @sjuxax made their first contribution in #11043
  • @dmoliveira made their first contribution in #11034
  • @mfournioux made their first contribution in #9199
  • @bingps made their first contribution in #11103
  • @cedonley made their first contribution in #10979
  • @SanjuCSudhakaran made their first contribution in #10565
  • @ramonziai made their first contribution in #11141
  • @noemotiovon made their first contribution in #11163
  • @zhangjf-nlp made their first contribution in #11156
  • @dhuangnm made their first contribution in #11183
  • @bradhilton made their first contribution in #11150
  • @AlexHe99 made their first contribution in #10922
  • @cennn made their first contribution in #11212
  • @bk-TurbaAI made their first contribution in #11235
  • @kylehh made their first contribution in #11027

Full Changelog: v0.6.4...v0.6.5

Don't miss a new vllm release

NewReleases is sending notifications on new releases.