vllm 0.6.4 on Python PyPI

What's Changed

[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
[Frontend] merge beam search implementations by @LunrEclipse in #9296
[Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
[Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
[Frontend] Clarify model_type error messages by @stevegrubb in #9345
[Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
[Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
[BugFix] Fix chat API continuous usage stats by @njhill in #9357
pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
[Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
[Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
[Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
[Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
[Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
[CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
[Core] Rename input data types by @DarkLight1337 in #8688
[Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
[Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
Support mistral interleaved attn by @patrickvonplaten in #9414
[Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
[Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
[CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
Add notes on the use of Slack by @terrytangyuan in #9442
[Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
[Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
[misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
[Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
[TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
[Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
[CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
[Misc] Remove commit id file by @DarkLight1337 in #9470
[torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
[Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
[Bugfix] Print warnings related to mistral_common tokenizer only once by @sasha0552 in #9468
[Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
Support BERTModel (first encoder-only embedding model) by @robertgshaw2-neuralmagic in #9056
[BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
[Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
[CI/Build] Use commit hash references for github actions by @russellb in #9430
[BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
[Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
[BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
[CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
[Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
[Misc] benchmark: Add option to set max concurrency by @russellb in #9390
[Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
[CI/Build] Add error matching config for mypy by @russellb in #9512
[Model] Support Pixtral models in the HF Transformers format by @mgoin in #9036
[MISC] Add lora requests to metrics by @coolkp in #9477
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py by @comaniac in #9510
[Kernel] Add env variable to force flashinfer backend to enable tensor cores by @tdoublep in #9497
[Bugfix] Fix offline mode when using mistral_common by @sasha0552 in #9457
🐛 fix torch memory profiling by @joerunde in #9516
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily by @njhill in #9521
[Doc] update gpu-memory-utilization flag docs by @joerunde in #9507
[CI/Build] Add error matching for ruff output by @russellb in #9513
[CI/Build] Configure matcher for actionlint workflow by @russellb in #9511
[Frontend] Support simpler image input format by @yue-anyscale in #9478
[Bugfix] Fix missing task for speculative decoding by @DarkLight1337 in #9524
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf by @mgoin in #9514
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger by @heheda12345 in #9530
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision by @mgoin in #9520
[Kernel] Support sliding window in flash attention backend by @heheda12345 in #9403
[Frontend][Misc] Goodput metric support by @Imss27 in #9338
[CI/Build] Split up decoder-only LM tests by @DarkLight1337 in #9488
[Doc] Consistent naming of attention backends by @tdoublep in #9498
[Model] FalconMamba Support by @dhiaEddineRhaiem in #9325
[Bugfix][Misc]: fix graph capture for decoder by @yudian0504 in #9549
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint by @varad-ahirwadkar in #9492
[Model][Bugfix] Fix batching with multi-image in PixtralHF by @mgoin in #9518
[Frontend] Reduce frequency of client cancellation checking by @njhill in #7959
[doc] fix format by @youkaichao in #9562
[BugFix] Update draft model TP size check to allow matching target TP size by @njhill in #9394
[Frontend] Don't log duplicate error stacktrace for every request in the batch by @wallashss in #9023
[CI] Make format checker error message more user-friendly by using emoji by @KuntaiDu in #9564
🐛 Fixup more test failures from memory profiling by @joerunde in #9563
[core] move parallel sampling out from vllm core by @youkaichao in #9302
[Bugfix]: serialize config instances by value when using --trust-remote-code by @tjohnson31415 in #6751
[CI/Build] Remove unnecessary fork_new_process by @DarkLight1337 in #9484
[Bugfix][OpenVINO] fix_dockerfile_openvino by @ngrozae in #9552
[Bugfix]: phi.py get rope_theta from config file by @Falko1 in #9503
[CI/Build] Replaced some models on tests for smaller ones by @wallashss in #9570
[Core] Remove evictor_v1 by @KuntaiDu in #9572
[Doc] Use shell code-blocks and fix section headers by @rafvasq in #9508
support TP in qwen2 bnb by @chenqianfzh in #9574
[Hardware][CPU] using current_platform.is_cpu by @wangshuai09 in #9536
[V1] Implement vLLM V1 [1/N] by @WoosukKwon in #9289
[CI/Build][LoRA] Temporarily fix long context failure issue by @jeejeelee in #9579
[Neuron] [Bugfix] Fix neuron startup by @xendo in #9374
[Model][VLM] Initialize support for Mono-InternVL model by @Isotr0py in #9528
[Bugfix] Eagle: change config name for fc bias by @gopalsarda in #9580
[Hardware][Intel CPU][DOC] Update docs for CPU backend by @zhouyuan in #6212
[Frontend] Support custom request_id from request by @guoyuhong in #9550
[BugFix] Prevent exporting duplicate OpenTelemetry spans by @ronensc in #9017
[torch.compile] auto infer dynamic_arg_dims from type annotation by @youkaichao in #9589
[Bugfix] fix detokenizer shallow copy by @aurickq in #5919
[Misc] Make benchmarks use EngineArgs by @JArnoldAMD in #9529
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing by @LucasWilkinson in #9487
[BugFix] Fix metrics error for --num-scheduler-steps > 1 by @yuleil in #8234
[Doc]: Update tensorizer docs to include vllm[tensorizer] by @sethkimmel3 in #7889
[Bugfix] Generate exactly input_len tokens in benchmark_throughput by @heheda12345 in #9592
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages by @sfc-gh-zhwang in #9590
[Model] Support E5-V by @DarkLight1337 in #9576
[Build] Fix FetchContent multiple build issue by @ProExpertProg in #9596
[Hardware][XPU] using current_platform.is_xpu by @MengqingCao in #9605
[Model] Initialize Florence-2 language backbone support by @Isotr0py in #9555
[VLM] Post-layernorm override and quant config in vision encoder by @DarkLight1337 in #9217
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs by @alex-jw-brooks in #9612
[Bugfix] Fix _init_vision_model in NVLM_D model by @DarkLight1337 in #9611
[misc] comment to avoid future confusion about baichuan by @youkaichao in #9620
[Bugfix] Fix divide by zero when serving Mamba models by @tlrmchlsmth in #9617
[Misc] Separate total and output tokens in benchmark_throughput.py by @mgoin in #8914
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9614
[Frontend] Enable Online Multi-image Support for MLlama by @alex-jw-brooks in #9393
[Model] Add Qwen2-Audio model support by @faychu in #9248
[CI/Build] Add bot to close stale issues and PRs by @russellb in #9436
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image by @mgoin in #9626
[Bugfix] Use "vision_model" prefix for MllamaVisionModel by @mgoin in #9628
[Bugfix]: Make chat content text allow type content by @vrdn-23 in #9358
[XPU] avoid triton import for xpu by @yma11 in #9440
[Bugfix] Fix PP for ChatGLM and Molmo, and weight loading for Qwen2.5-Math-RM by @DarkLight1337 in #9422
[V1][Bugfix] Clean up requests when aborted by @WoosukKwon in #9629
[core] simplify seq group code by @youkaichao in #9569
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9639
[Kernel] add kernel for FATReLU by @jeejeelee in #9610
[torch.compile] expanding support and fix allgather compilation by @CRZbulabula in #9637
[Doc] Move additional tips/notes to the top by @DarkLight1337 in #9647
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models by @litianjian in #9653
Increase operation per run limit for "Close inactive issues and PRs" workflow by @hmellor in #9661
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9641
[CI/Build] Fix VLM test failures when using transformers v4.46 by @DarkLight1337 in #9666
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints by @alex-jw-brooks in #9650
[Log][Bugfix] Fix default value check for image_url.detail by @mgoin in #9663
[Performance][Kernel] Fused_moe Performance Improvement by @charlifu in #9384
[Bugfix] Remove xformers requirement for Pixtral by @mgoin in #9597
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 by @khluu in #9676
[Model] add a lora module for granite 3.0 MoE models by @willmj in #9673
[V1] Support sliding window attention by @WoosukKwon in #9679
[Bugfix] Fix compressed_tensors_moe bad config.strategy by @mgoin in #9677
[Doc] Improve quickstart documentation by @rafvasq in #9256
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding by @tjohnson31415 in #9631
[Bugfix] Steaming continuous_usage_stats default to False by @samos123 in #9709
[Hardware][openvino] is_openvino --> current_platform.is_openvino by @MengqingCao in #9716
Fix: MI100 Support By Bypassing Custom Paged Attention by @MErkinSag in #9560
[Frontend] Bad words sampling parameter by @Alvant in #9717
[Model] Add classification Task with Qwen2ForSequenceClassification by @kakao-kevin-us in #9704
[Misc] SpecDecodeWorker supports profiling by @Abatom in #9719
[core] cudagraph output with tensor weak reference by @youkaichao in #9724
[Misc] Upgrade to pytorch 2.5 by @bnellnm in #9588
Fix cache management in "Close inactive issues and PRs" actions workflow by @hmellor in #9734
[Bugfix] Fix load config when using bools by @madt2709 in #9533
[Hardware][ROCM] using current_platform.is_rocm by @wangshuai09 in #9642
[torch.compile] support moe models by @youkaichao in #9632
Fix beam search eos by @robertgshaw2-neuralmagic in #9627
[Bugfix] Fix ray instance detect issue by @yma11 in #9439
[CI/Build] Adopt Mergify for auto-labeling PRs by @russellb in #9259
[Model][VLM] Add multi-video support for LLaVA-Onevision by @litianjian in #8905
[torch.compile] Adding "torch compile" annotations to some models by @CRZbulabula in #9758
[misc] avoid circular import by @youkaichao in #9765
[torch.compile] add deepseek v2 compile by @youkaichao in #9775
[Doc] fix third-party model example by @russellb in #9771
[Model][LoRA]LoRA support added for Qwen by @jeejeelee in #9622
[Doc] Specify async engine args in docs by @DarkLight1337 in #9726
[Bugfix] Use temporary directory in registry by @DarkLight1337 in #9721
[Frontend] re-enable multi-modality input in the new beam search implementation by @FerdinandZhong in #9427
[Model] Add BNB quantization support for Mllama by @Isotr0py in #9720
[Hardware] using current_platform.seed_everything by @wangshuai09 in #9785
[Misc] Add metrics for request queue time, forward time, and execute time by @Abatom in #9659
Fix the log to correct guide user to install modelscope by @tastelikefeet in #9793
[Bugfix] Use host argument to bind to interface by @svenseeberg in #9798
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) by @yannicks1 in #9801
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel by @jsato8094 in #9806
[CI][Bugfix] Skip chameleon for transformers 4.46.1 by @mgoin in #9808
[CI/Build] mergify: fix rules for ci/build label by @russellb in #9804
[MISC] Set label value to timestamp over 0, to keep track of recent history by @coolkp in #9777
[Bugfix][Frontend] Guard against bad token ids by @joerunde in #9634
[Model] tool calling support for ibm-granite/granite-20b-functioncalling by @wseaton in #8339
[Docs] Add notes about Snowflake Meetup by @simon-mo in #9814
[Bugfix] Fix prefix strings for quantized VLMs by @mgoin in #9772
[core][distributed] fix custom allreduce in pytorch 2.5 by @youkaichao in #9815
Update README.md by @LiuXiaoxuanPKU in #9819
[Bugfix][VLM] Make apply_fp8_linear work with >2D input by @mgoin in #9812
[ci/build] Pin CI dependencies version with pip-compile by @khluu in #9810
[Bugfix] Fix multi nodes TP+PP for XPU by @yma11 in #8884
[Doc] Add the DCO to CONTRIBUTING.md by @russellb in #9803
[torch.compile] rework compile control with piecewise cudagraph by @youkaichao in #9715
[Misc] Specify minimum pynvml version by @jeejeelee in #9827
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA by @WoosukKwon in #9438
[CI/Build] VLM Test Consolidation by @alex-jw-brooks in #9372
[Model] Support math-shepherd-mistral-7b-prm model by @Went-Liang in #9697
[Misc] Add chunked-prefill support on FlashInfer. by @elfiegg in #9781
[Bugfix][core] replace heartbeat with pid check by @joerunde in #9818
[Doc] link bug for multistep guided decoding by @joerunde in #9843
[Neuron] Update Dockerfile.neuron to fix build failure by @hbikki in #9822
[doc] update pp support by @youkaichao in #9853
[CI/Build] Simplify exception trace in api server tests by @CRZbulabula in #9787
[torch.compile] upgrade tests by @youkaichao in #9858
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint by @gcalmettes in #9837
Revert "[Bugfix] Use host argument to bind to interface (#9798)" by @khluu in #9852
[Model] Support quantization of Qwen2VisionTransformer for Qwen2-VL by @mgoin in #9817
[Misc] Remove deprecated arg for cuda graph capture by @ywang96 in #9864
[Doc] Update Qwen documentation by @jeejeelee in #9869
[CI/Build] Add Model Tests for Qwen2-VL by @alex-jw-brooks in #9846
[CI/Build] Adding a forced docker system prune to clean up space by @Alexei-V-Ivanov-AMD in #9849
[Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together by @sasha0552 in #9532
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 by @mzusman in #9838
[ci/build] Configure dependabot to update pip dependencies by @khluu in #9811
[Bugfix][Frontend] Reject guided decoding in multistep mode by @joerunde in #9892
[torch.compile] directly register custom op by @youkaichao in #9896
[Bugfix] Fix layer skip logic with bitsandbytes by @mgoin in #9887
[torch.compile] rework test plans by @youkaichao in #9866
[Model] Support bitsandbytes for MiniCPMV by @mgoin in #9891
[torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9876
[Doc] Update multi-input support by @DarkLight1337 in #9906
[Frontend] Chat-based Embeddings API by @DarkLight1337 in #9759
[CI/Build] Add Model Tests for PixtralHF by @mgoin in #9813
[Frontend] Use a proper chat template for VLM2Vec by @DarkLight1337 in #9912
[Bugfix] Fix edge cases for MistralTokenizer by @tjohnson31415 in #9625
[Core] Refactor: Clean up unused argument preemption_mode in Scheduler._preempt by @andrejonasson in #9696
[torch.compile] use interpreter with stable api from pytorch by @youkaichao in #9889
[Bugfix/Core] Remove assertion for Flashinfer k_scale and v_scale by @pavanimajety in #9861
[1/N] pass the complete config from engine to executor by @youkaichao in #9933
[Bugfix] PicklingError on RayTaskError by @GeneDer in #9934
Bump the patch-update group with 10 updates by @dependabot in #9897
[Core][VLM] Add precise multi-modal placeholder tracking by @petersalas in #8346
[ci/build] Have dependabot ignore pinned dependencies by @khluu in #9935
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models by @sroy745 in #9559
[torch.compile] fix cpu broken code by @youkaichao in #9947
[Docs] Update Granite 3.0 models in supported models table by @njhill in #9930
[Doc] Updated tpu-installation.rst with more details by @mikegre-google in #9926
[2/N] executor pass the complete config to worker/modelrunner by @youkaichao in #9938
[V1] Fix EngineArgs refactor on V1 by @robertgshaw2-neuralmagic in #9954
[bugfix] fix chatglm dummy_data_for_glmv by @youkaichao in #9955
[3/N] model runner pass the whole config to model by @youkaichao in #9958
[CI/Build] Quoting around > by @nokados in #9956
[torch.compile] Adding torch compile annotations to vision-language models by @CRZbulabula in #9946
[bugfix] fix tsts by @youkaichao in #9959
[V1] Support per-request seed by @njhill in #9945
[Model] Add support for H2OVL-Mississippi models by @cooleel in #9747
[V1] Fix Configs by @robertgshaw2-neuralmagic in #9971
[Bugfix] Fix MiniCPMV and Mllama BNB bug by @jeejeelee in #9917
[Bugfix]Using the correct type hints by @gshtras in #9885
[Misc] Compute query_start_loc/seq_start_loc on CPU by @zhengy001 in #9447
[Bugfix] Fix E2EL mean and median stats by @daitran2k1 in #9984
[Bugfix][OpenVINO] Fix circular reference #9939 by @MengqingCao in #9974
[Frontend] Multi-Modality Support for Loading Local Image Files by @chaunceyjiang in #9915
[4/N] make quant config first-class citizen by @youkaichao in #9978
[Misc]Reduce BNB static variable by @jeejeelee in #9987
[Model] factoring out MambaMixer out of Jamba by @mzusman in #8993
[CI] Basic Integration Test For TPU by @robertgshaw2-neuralmagic in #9968
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs by @hissu-hyvarinen in #9279
[Doc] Update VLM doc about loading from local files by @ywang96 in #9999
[Bugfix] Fix MQLLMEngine hanging by @robertgshaw2-neuralmagic in #9973
[Misc] Refactor benchmark_throughput.py by @lk-chen in #9779
[Frontend] Add max_tokens prometheus metric by @tomeras91 in #9881
[Bugfix] Upgrade to pytorch 2.5.1 by @bnellnm in #10001
[4.5/N] bugfix for quant config in speculative decode by @youkaichao in #10007
[Bugfix] Respect modules_to_not_convert within awq_marlin by @mgoin in #9895
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep by @tlrmchlsmth in #9994
[Core] Make encoder-decoder inputs a nested structure to be more composable by @DarkLight1337 in #9604
[Bugfix] Fixup Mamba by @tlrmchlsmth in #10004
[BugFix] Lazy import ray by @GeneDer in #10021
[Misc] vllm CLI flags should be ordered for better user readability by @chaunceyjiang in #10017
[Frontend] Fix tcp port reservation for api server by @russellb in #10012
Refactor TPU requirements file and pin build dependencies by @richardsliu in #10010
[Misc] Add logging for CUDA memory by @yangalan123 in #10027
[CI/Build] Limit github CI jobs based on files changed by @russellb in #9928
[Model] Support quantization of PixtralHFTransformer for PixtralHF by @mgoin in #9921
[Feature] Update benchmark_throughput.py to support image input by @lk-chen in #9851
[Misc] Modify BNB parameter name by @jeejeelee in #9997
[CI] Prune tests/models/decoder_only/language/* tests by @mgoin in #9940
[CI] Prune back the number of tests in tests/kernels/* by @mgoin in #9932
[bugfix] fix weak ref in piecewise cudagraph and tractable test by @youkaichao in #10048
[Bugfix] Properly propagate trust_remote_code settings by @zifeitong in #10047
[Bugfix] Fix pickle of input when async output processing is on by @wallashss in #9931
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode by @llsj14 in #9730
[v1] reduce graph capture time for piecewise cudagraph by @youkaichao in #10059
[Misc] Sort the list of embedding models by @DarkLight1337 in #10037
[Model][OpenVINO] Fix regressions from #8346 by @petersalas in #10045
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer by @tjohnson31415 in #10051
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path by @arakowsk-amd in #10063
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type by @zifeitong in #10054
[V1] Integrate Piecewise CUDA graphs by @WoosukKwon in #10058
[distributed] add function to create ipc buffers directly by @youkaichao in #10064
[CI/Build] drop support for Python 3.8 EOL by @aarnphm in #8464
[CI/Build] Fix large_gpu_mark reason by @Isotr0py in #10070
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend by @kzawora-intel in #6143
[Hotfix] Fix ruff errors by @WoosukKwon in #10073
[Model][LoRA]LoRA support added for LlamaEmbeddingModel by @jeejeelee in #10071
[Model] Add Idefics3 support by @jeejeelee in #9767
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration by @ericperfect in #10022
Remove ScaledActivation for AWQ by @mgoin in #10057
[CI/Build] Drop Python 3.8 support by @russellb in #10038
[CI/Build] change conflict PR comment from mergify by @russellb in #10080
[V1] Make v1 more testable by @joerunde in #9888
[CI/Build] Always run the ruff workflow by @russellb in #10092
[core][distributed] add stateless_init_process_group by @youkaichao in #10072
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 by @mgoin in #10095
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend by @yma11 in #9823
[Frontend] Adjust try/except blocks in API impl by @njhill in #10056
[Hardware][CPU] Update torch 2.5 by @bigPYJ1151 in #9911
[doc] add back Python 3.8 ABI by @youkaichao in #10100
[V1][BugFix] Fix Generator construction in greedy + seed case by @njhill in #10097
[Misc] Consolidate ModelConfig code related to HF config by @DarkLight1337 in #10104
[CI/Build] re-add codespell to CI by @russellb in #10083
[Doc] Improve benchmark documentation by @rafvasq in #9927
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce by @hanzhi713 in #10030
[CI/Build] Improve mypy + python version matrix by @russellb in #10041
Adds method to read the pooling types from model's files by @flaviabeo in #9506
[Frontend] Fix multiple values for keyword argument error (#10075) by @DIYer22 in #10076
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target by @bigPYJ1151 in #10108
[Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL by @li-plus in #10112
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. by @spliii in #10105
[Frontend] Tool calling parser for Granite 3.0 models by @maxdebayser in #9027
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding by @NickLucche in #9291
[CI/Build] Always run mypy by @russellb in #10122
[CI/Build] Add shell script linting using shellcheck by @russellb in #7925
[CI/Build] Automate PR body text cleanup by @russellb in #10082
Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #9745
Online video support for VLMs by @litianjian in #10020
Bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #9746
[Misc] Add environment variables collection in collect_env.py tool by @ycool in #9293
[V1] Add all_token_ids attribute to Request by @WoosukKwon in #10135
[V1] Prefix caching (take 2) by @comaniac in #9972
[CI/Build] Give PR cleanup job PR write access by @russellb in #10139
[Doc] Update FAQ links in spec_decode.rst by @whyiug in #9662
[Bugfix] Add error handling when server cannot respond any valid tokens by @DearPlanet in #5895
[Misc] Fix ImportError causing by triton by @MengqingCao in #9493
[Doc] Move CONTRIBUTING to docs site by @russellb in #9924
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. by @sighingnow in #9285
Add hf_transfer to testing image by @mgoin in #10096
[Misc] Fix typo in #5895 by @DarkLight1337 in #10145
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator by @yma11 in #10144
[Model] Expose size to Idefics3 as mm_processor_kwargs by @Isotr0py in #10146
[V1]Enable APC by default only for text models by @ywang96 in #10148
[CI/Build] Update CPU tests to include all "standard" tests by @DarkLight1337 in #5481
Fix edge case Mistral tokenizer by @patrickvonplaten in #10152
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 by @sroy745 in #10136
[Misc] Improve Web UI by @rafvasq in #10090
[V1] Fix non-cudagraph op name by @WoosukKwon in #10166
[CI/Build] Ignore .gitignored files for shellcheck by @ProExpertProg in #10162
Rename vllm.logging to vllm.logging_utils by @flozi00 in #10134
[torch.compile] Fuse RMSNorm with quant by @ProExpertProg in #9138
[Bugfix] Fix SymIntArrayRef expected to contain only concrete integers by @bnellnm in #10170
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case by @rasmith in #9857
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking by @bigPYJ1151 in #6892
[0/N] Rename MultiModalInputs to MultiModalKwargs by @DarkLight1337 in #10040
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module by @mgoin in #10169
[CI/Build] Fix VLM broadcast tests tensor_parallel_size passing by @Isotr0py in #10161
[Doc] Adjust RunLLM location by @DarkLight1337 in #10176
[5/N] pass the whole config to model by @youkaichao in #9983
[CI/Build] Add run-hpu-test.sh script by @xuechendi in #10167
[Bugfix] Enable some fp8 and quantized fullgraph tests by @bnellnm in #10171
[bugfix] fix broken tests of mlp speculator by @youkaichao in #10177
[doc] explaining the integration with huggingface by @youkaichao in #10173
bugfix: fix the bug that stream generate not work by @caijizhuo in #2756
[Frontend] add add_request_id middleware by @cjackal in #9594
[Frontend][Core] Override HF config.json via CLI by @KrishnaM251 in #5836
[CI/Build] Split up models tests by @DarkLight1337 in #10069
[ci][build] limit cmake version by @youkaichao in #10188
[Doc] Fix typo error in CONTRIBUTING.md by @FuryMartin in #10190
[doc] Polish the integration with huggingface doc by @CRZbulabula in #10195
[Misc] small fixes to function tracing file path by @ShawnD200 in #9543
[misc] improve cloudpickle registration and tests by @youkaichao in #10202
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py by @yansh97 in #10196
[doc] improve debugging code by @youkaichao in #10206
[6/N] pass whole config to inner model by @youkaichao in #10205
Bump the patch-update group with 5 updates by @dependabot in #10210
[Hardware][CPU] Add embedding models support for CPU backend by @Isotr0py in #10193
[LoRA][Kernel] Remove the unused libentry module by @jeejeelee in #10214
[V1] Allow tokenizer_mode and trust_remote_code for Detokenizer by @ywang96 in #10211
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner by @Isotr0py in #10218
[Metrics] add more metrics by @HarryWu99 in #4464
[Doc] fix doc string typo in block_manager swap_out function by @yyccli in #10212
[core][distributed] add stateless process group by @youkaichao in #10216
Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #10209
[V1] Fix detokenizer ports by @WoosukKwon in #10224
[V1] Do not use inductor for piecewise CUDA graphs by @WoosukKwon in #10225
[v1][torch.compile] support managing cudagraph buffer by @youkaichao in #10203
[V1] Use custom ops for piecewise CUDA graphs by @WoosukKwon in #10227
Add docs on serving with Llama Stack by @terrytangyuan in #10183
[misc][distributed] auto port selection and disable tests by @youkaichao in #10226
[V1] Enable custom ops with piecewise CUDA graphs by @WoosukKwon in #10228
Make shutil rename in python_only_dev by @shcheglovnd in #10233
[V1] AsyncLLM Implementation by @robertgshaw2-neuralmagic in #9826
[doc] update debugging guide by @youkaichao in #10236
[Doc] Update help text for --distributed-executor-backend by @russellb in #10231
[1/N] torch.compile user interface design by @youkaichao in #10237
[Misc][LoRA] Replace hardcoded cuda device with configurable argument by @jeejeelee in #10223
Splitting attention kernel file by @maleksan85 in #10091
[doc] explain the class hierarchy in vLLM by @youkaichao in #10240
[CI][CPU]refactor CPU tests to allow to bind with different cores by @zhouyuan in #10222
[BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. by @gcalmettes in #10000
[Misc]Fix Idefics3Model argument by @jeejeelee in #10255
[Bugfix] Fix QwenModel argument by @DamonFool in #10262
[Frontend] Add per-request number of cached token stats by @zifeitong in #10174
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest by @WoosukKwon in #10245
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers by @sroy745 in #9982
[LoRA] Adds support for bias in LoRA by @followumesh in #5733
[V1] Enable Inductor when using piecewise CUDA graphs by @WoosukKwon in #10268
[doc] fix location of runllm widget by @youkaichao in #10266
[doc] improve debugging doc by @youkaichao in #10270
Revert "[ci][build] limit cmake version" by @youkaichao in #10271
[V1] Fix CI tests on V1 engine by @WoosukKwon in #10272
[core][distributed] use tcp store directly by @youkaichao in #10275
[V1] Support VLMs with fine-grained scheduling by @WoosukKwon in #9871
Bump to compressed-tensors v0.8.0 by @dsikka in #10279
[Doc] Fix typo in arg_utils.py by @xyang16 in #10264
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions by @imkero in #10221
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by @FurtherAI in #9944
[Core] Flashinfer - Remove advance step size restriction by @pavanimajety in #10282
[Model][LoRA]LoRA support added for idefics3 by @B-201 in #10281
[V1] Add missing tokenizer options for Detokenizer by @ywang96 in #10288
[1/N] Initial prototype for multi-modal processor by @DarkLight1337 in #10044
[Bugfix] bitsandbytes models fail to run pipeline parallel by @HoangCongDuc in #10200
[Bugfix] Fix tensor parallel for qwen2 classification model by @Isotr0py in #10297
[misc] error early for old-style class by @youkaichao in #10304
[Misc] format.sh: Simplify tool_version_check by @russellb in #10305
[Frontend] Pythonic tool parser by @mdepinet in #9859
[BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used by @gcalmettes in #9951
[Model] Add BNB quantization support for Idefics3 by @B-201 in #10310
[ci][distributed] disable hanging tests by @youkaichao in #10317
[CI/Build] Fix CPU CI online inference timeout by @Isotr0py in #10314
[CI/Build] Make shellcheck happy by @DarkLight1337 in #10285
[Docs] Publish meetup slides by @WoosukKwon in #10331
Support Roberta embedding models by @maxdebayser in #9387
[Perf] Reduce peak memory usage of llama by @andoorve in #10339
[Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server by @jxpxxzj in #9583
[Tool parsing] Improve / correct mistral tool parsing by @patrickvonplaten in #10333
[Bugfix] Fix unable to load some models by @DarkLight1337 in #10312
[bugfix] Fix static asymmetric quantization case by @ProExpertProg in #10334
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug by @tlrmchlsmth in #10308
[Model] Support Qwen2 embeddings and use tags to select model tests by @DarkLight1337 in #10184
[Bugfix] Qwen-vl output is inconsistent in speculative decoding by @skylee-01 in #10350
[Misc] Consolidate pooler config overrides by @DarkLight1337 in #10351
[Build] skip renaming files for release wheels pipeline by @simon-mo in #9671

New Contributors

@gracehonv made their first contribution in #9349
@streaver91 made their first contribution in #9396
@wukaixingxp made their first contribution in #9013
@sssrijan-amazon made their first contribution in #9380
@coolkp made their first contribution in #9477
@yue-anyscale made their first contribution in #9478
@dhiaEddineRhaiem made their first contribution in #9325
@yudian0504 made their first contribution in #9549
@ngrozae made their first contribution in #9552
@Falko1 made their first contribution in #9503
@wangshuai09 made their first contribution in #9536
@gopalsarda made their first contribution in #9580
@guoyuhong made their first contribution in #9550
@JArnoldAMD made their first contribution in #9529
@yuleil made their first contribution in #8234
@sethkimmel3 made their first contribution in #7889
@MengqingCao made their first contribution in #9605
@CRZbulabula made their first contribution in #9614
@faychu made their first contribution in #9248
@vrdn-23 made their first contribution in #9358
@willmj made their first contribution in #9673
@samos123 made their first contribution in #9709
@MErkinSag made their first contribution in #9560
@Alvant made their first contribution in #9717
@kakao-kevin-us made their first contribution in #9704
@madt2709 made their first contribution in #9533
@FerdinandZhong made their first contribution in #9427
@svenseeberg made their first contribution in #9798
@yannicks1 made their first contribution in #9801
@wseaton made their first contribution in #8339
@Went-Liang made their first contribution in #9697
@andrejonasson made their first contribution in #9696
@GeneDer made their first contribution in #9934
@mikegre-google made their first contribution in #9926
@nokados made their first contribution in #9956
@cooleel made their first contribution in #9747
@zhengy001 made their first contribution in #9447
@daitran2k1 made their first contribution in #9984
@chaunceyjiang made their first contribution in #9915
@hissu-hyvarinen made their first contribution in #9279
@lk-chen made their first contribution in #9779
@yangalan123 made their first contribution in #10027
@llsj14 made their first contribution in #9730
@arakowsk-amd made their first contribution in #10063
@kzawora-intel made their first contribution in #6143
@DIYer22 made their first contribution in #10076
@li-plus made their first contribution in #10112
@spliii made their first contribution in #10105
@flozi00 made their first contribution in #10134
@xuechendi made their first contribution in #10167
@caijizhuo made their first contribution in #2756
@cjackal made their first contribution in #9594
@KrishnaM251 made their first contribution in #5836
@FuryMartin made their first contribution in #10190
@ShawnD200 made their first contribution in #9543
@yansh97 made their first contribution in #10196
@yyccli made their first contribution in #10212
@shcheglovnd made their first contribution in #10233
@maleksan85 made their first contribution in #10091
@followumesh made their first contribution in #5733
@imkero made their first contribution in #10221
@B-201 made their first contribution in #10281
@HoangCongDuc made their first contribution in #10200
@mdepinet made their first contribution in #9859
@jxpxxzj made their first contribution in #9583
@skylee-01 made their first contribution in #10350

Full Changelog: v0.6.3...v0.6.4

vllm 0.6.4 v0.6.4 on Python PyPI

What's Changed

New Contributors

vllm 0.6.4
v0.6.4

on Python PyPI