github vllm-project/vllm v0.6.4

pre-release8 hours ago

What's Changed

  • [TPU] Fix TPU SMEM OOM by Pallas paged attention kernel by @WoosukKwon in #9350
  • [Frontend] merge beam search implementations by @LunrEclipse in #9296
  • [Model] Make llama3.2 support multiple and interleaved images by @xiangxu-google in #9095
  • [Bugfix] Clean up some cruft in mamba.py by @tlrmchlsmth in #9343
  • [Frontend] Clarify model_type error messages by @stevegrubb in #9345
  • [Doc] Fix code formatting in spec_decode.rst by @mgoin in #9348
  • [Bugfix] Update InternVL input mapper to support image embeds by @hhzhang16 in #9351
  • [BugFix] Fix chat API continuous usage stats by @njhill in #9357
  • pass ignore_eos parameter to all benchmark_serving calls by @gracehonv in #9349
  • [Misc] Directly use compressed-tensors for checkpoint definitions by @mgoin in #8909
  • [Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids by @CatherineSue in #9034
  • [Bugfix][CI/Build] Fix CUDA 11.8 Build by @LucasWilkinson in #9386
  • [Bugfix] Molmo text-only input bug fix by @mrsalehi in #9397
  • [Misc] Standardize RoPE handling for Qwen2-VL by @DarkLight1337 in #9250
  • [Model] VLM2Vec, the first multimodal embedding model in vLLM by @DarkLight1337 in #9303
  • [CI/Build] Test VLM embeddings by @DarkLight1337 in #9406
  • [Core] Rename input data types by @DarkLight1337 in #8688
  • [Misc] Consolidate example usage of OpenAI client for multimodal models by @ywang96 in #9412
  • [Model] Support SDPA attention for Molmo vision backbone by @Isotr0py in #9410
  • Support mistral interleaved attn by @patrickvonplaten in #9414
  • [Kernel][Model] Improve continuous batching for Jamba and Mamba by @mzusman in #9189
  • [Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft by @streaver91 in #9396
  • [Performance][Spec Decode] Optimize ngram lookup performance by @LiuXiaoxuanPKU in #9333
  • [CI/Build] mypy: Resolve some errors from checking vllm/engine by @russellb in #9267
  • [Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel by @tlrmchlsmth in #9425
  • [BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels by @rasmith in #9391
  • Add notes on the use of Slack by @terrytangyuan in #9442
  • [Kernel] Add Exllama as a backend for compressed-tensors by @LucasWilkinson in #9395
  • [Misc] Print stack trace using logger.exception by @DarkLight1337 in #9461
  • [misc] CUDA Time Layerwise Profiler by @LucasWilkinson in #8337
  • [Bugfix] Allow prefill of assistant response when using mistral_common by @sasha0552 in #9446
  • [TPU] Call torch._sync(param) during weight loading by @WoosukKwon in #9437
  • [Hardware][CPU] compressed-tensor INT8 W8A8 AZP support by @bigPYJ1151 in #9344
  • [Core] Deprecating block manager v1 and make block manager v2 default by @KuntaiDu in #8704
  • [CI/Build] remove .github from .dockerignore, add dirty repo check by @dtrifiro in #9375
  • [Misc] Remove commit id file by @DarkLight1337 in #9470
  • [torch.compile] Fine-grained CustomOp enabling mechanism by @ProExpertProg in #9300
  • [Bugfix] Fix support for dimension like integers and ScalarType by @bnellnm in #9299
  • [Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script by @wukaixingxp in #9013
  • [Bugfix] Print warnings related to mistral_common tokenizer only once by @sasha0552 in #9468
  • [Hardwware][Neuron] Simplify model load for transformers-neuronx library by @sssrijan-amazon in #9380
  • Support BERTModel (first encoder-only embedding model) by @robertgshaw2-neuralmagic in #9056
  • [BugFix] Stop silent failures on compressed-tensors parsing by @dsikka in #9381
  • [Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage by @joerunde in #9352
  • [Qwen2.5] Support bnb quant for Qwen2.5 by @blueyo0 in #9467
  • [CI/Build] Use commit hash references for github actions by @russellb in #9430
  • [BugFix] Typing fixes to RequestOutput.prompt and beam search by @njhill in #9473
  • [Frontend][Feature] Add jamba tool parser by @tomeras91 in #9154
  • [BugFix] Fix and simplify completion API usage streaming by @njhill in #9475
  • [CI/Build] Fix lint errors in mistral tokenizer by @DarkLight1337 in #9504
  • [Bugfix] Fix offline_inference_with_prefix.py by @tlrmchlsmth in #9505
  • [Misc] benchmark: Add option to set max concurrency by @russellb in #9390
  • [Model] Add user-configurable task for models that support both generation and embedding by @DarkLight1337 in #9424
  • [CI/Build] Add error matching config for mypy by @russellb in #9512
  • [Model] Support Pixtral models in the HF Transformers format by @mgoin in #9036
  • [MISC] Add lora requests to metrics by @coolkp in #9477
  • [MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py by @comaniac in #9510
  • [Kernel] Add env variable to force flashinfer backend to enable tensor cores by @tdoublep in #9497
  • [Bugfix] Fix offline mode when using mistral_common by @sasha0552 in #9457
  • 🐛 fix torch memory profiling by @joerunde in #9516
  • [Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily by @njhill in #9521
  • [Doc] update gpu-memory-utilization flag docs by @joerunde in #9507
  • [CI/Build] Add error matching for ruff output by @russellb in #9513
  • [CI/Build] Configure matcher for actionlint workflow by @russellb in #9511
  • [Frontend] Support simpler image input format by @yue-anyscale in #9478
  • [Bugfix] Fix missing task for speculative decoding by @DarkLight1337 in #9524
  • [Model][Pixtral] Optimizations for input_processor_for_pixtral_hf by @mgoin in #9514
  • [Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger by @heheda12345 in #9530
  • [Model][Pixtral] Use memory_efficient_attention for PixtralHFVision by @mgoin in #9520
  • [Kernel] Support sliding window in flash attention backend by @heheda12345 in #9403
  • [Frontend][Misc] Goodput metric support by @Imss27 in #9338
  • [CI/Build] Split up decoder-only LM tests by @DarkLight1337 in #9488
  • [Doc] Consistent naming of attention backends by @tdoublep in #9498
  • [Model] FalconMamba Support by @dhiaEddineRhaiem in #9325
  • [Bugfix][Misc]: fix graph capture for decoder by @yudian0504 in #9549
  • [BugFix] Use correct python3 binary in Docker.ppc64le entrypoint by @varad-ahirwadkar in #9492
  • [Model][Bugfix] Fix batching with multi-image in PixtralHF by @mgoin in #9518
  • [Frontend] Reduce frequency of client cancellation checking by @njhill in #7959
  • [doc] fix format by @youkaichao in #9562
  • [BugFix] Update draft model TP size check to allow matching target TP size by @njhill in #9394
  • [Frontend] Don't log duplicate error stacktrace for every request in the batch by @wallashss in #9023
  • [CI] Make format checker error message more user-friendly by using emoji by @KuntaiDu in #9564
  • 🐛 Fixup more test failures from memory profiling by @joerunde in #9563
  • [core] move parallel sampling out from vllm core by @youkaichao in #9302
  • [Bugfix]: serialize config instances by value when using --trust-remote-code by @tjohnson31415 in #6751
  • [CI/Build] Remove unnecessary fork_new_process by @DarkLight1337 in #9484
  • [Bugfix][OpenVINO] fix_dockerfile_openvino by @ngrozae in #9552
  • [Bugfix]: phi.py get rope_theta from config file by @Falko1 in #9503
  • [CI/Build] Replaced some models on tests for smaller ones by @wallashss in #9570
  • [Core] Remove evictor_v1 by @KuntaiDu in #9572
  • [Doc] Use shell code-blocks and fix section headers by @rafvasq in #9508
  • support TP in qwen2 bnb by @chenqianfzh in #9574
  • [Hardware][CPU] using current_platform.is_cpu by @wangshuai09 in #9536
  • [V1] Implement vLLM V1 [1/N] by @WoosukKwon in #9289
  • [CI/Build][LoRA] Temporarily fix long context failure issue by @jeejeelee in #9579
  • [Neuron] [Bugfix] Fix neuron startup by @xendo in #9374
  • [Model][VLM] Initialize support for Mono-InternVL model by @Isotr0py in #9528
  • [Bugfix] Eagle: change config name for fc bias by @gopalsarda in #9580
  • [Hardware][Intel CPU][DOC] Update docs for CPU backend by @zhouyuan in #6212
  • [Frontend] Support custom request_id from request by @guoyuhong in #9550
  • [BugFix] Prevent exporting duplicate OpenTelemetry spans by @ronensc in #9017
  • [torch.compile] auto infer dynamic_arg_dims from type annotation by @youkaichao in #9589
  • [Bugfix] fix detokenizer shallow copy by @aurickq in #5919
  • [Misc] Make benchmarks use EngineArgs by @JArnoldAMD in #9529
  • [Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing by @LucasWilkinson in #9487
  • [BugFix] Fix metrics error for --num-scheduler-steps > 1 by @yuleil in #8234
  • [Doc]: Update tensorizer docs to include vllm[tensorizer] by @sethkimmel3 in #7889
  • [Bugfix] Generate exactly input_len tokens in benchmark_throughput by @heheda12345 in #9592
  • [Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages by @sfc-gh-zhwang in #9590
  • [Model] Support E5-V by @DarkLight1337 in #9576
  • [Build] Fix FetchContent multiple build issue by @ProExpertProg in #9596
  • [Hardware][XPU] using current_platform.is_xpu by @MengqingCao in #9605
  • [Model] Initialize Florence-2 language backbone support by @Isotr0py in #9555
  • [VLM] Post-layernorm override and quant config in vision encoder by @DarkLight1337 in #9217
  • [Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs by @alex-jw-brooks in #9612
  • [Bugfix] Fix _init_vision_model in NVLM_D model by @DarkLight1337 in #9611
  • [misc] comment to avoid future confusion about baichuan by @youkaichao in #9620
  • [Bugfix] Fix divide by zero when serving Mamba models by @tlrmchlsmth in #9617
  • [Misc] Separate total and output tokens in benchmark_throughput.py by @mgoin in #8914
  • [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9614
  • [Frontend] Enable Online Multi-image Support for MLlama by @alex-jw-brooks in #9393
  • [Model] Add Qwen2-Audio model support by @faychu in #9248
  • [CI/Build] Add bot to close stale issues and PRs by @russellb in #9436
  • [Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image by @mgoin in #9626
  • [Bugfix] Use "vision_model" prefix for MllamaVisionModel by @mgoin in #9628
  • [Bugfix]: Make chat content text allow type content by @vrdn-23 in #9358
  • [XPU] avoid triton import for xpu by @yma11 in #9440
  • [Bugfix] Fix PP for ChatGLM and Molmo, and weight loading for Qwen2.5-Math-RM by @DarkLight1337 in #9422
  • [V1][Bugfix] Clean up requests when aborted by @WoosukKwon in #9629
  • [core] simplify seq group code by @youkaichao in #9569
  • [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9639
  • [Kernel] add kernel for FATReLU by @jeejeelee in #9610
  • [torch.compile] expanding support and fix allgather compilation by @CRZbulabula in #9637
  • [Doc] Move additional tips/notes to the top by @DarkLight1337 in #9647
  • [Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models by @litianjian in #9653
  • Increase operation per run limit for "Close inactive issues and PRs" workflow by @hmellor in #9661
  • [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9641
  • [CI/Build] Fix VLM test failures when using transformers v4.46 by @DarkLight1337 in #9666
  • [Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints by @alex-jw-brooks in #9650
  • [Log][Bugfix] Fix default value check for image_url.detail by @mgoin in #9663
  • [Performance][Kernel] Fused_moe Performance Improvement by @charlifu in #9384
  • [Bugfix] Remove xformers requirement for Pixtral by @mgoin in #9597
  • [ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 by @khluu in #9676
  • [Model] add a lora module for granite 3.0 MoE models by @willmj in #9673
  • [V1] Support sliding window attention by @WoosukKwon in #9679
  • [Bugfix] Fix compressed_tensors_moe bad config.strategy by @mgoin in #9677
  • [Doc] Improve quickstart documentation by @rafvasq in #9256
  • [Bugfix] Fix crash with llama 3.2 vision models and guided decoding by @tjohnson31415 in #9631
  • [Bugfix] Steaming continuous_usage_stats default to False by @samos123 in #9709
  • [Hardware][openvino] is_openvino --> current_platform.is_openvino by @MengqingCao in #9716
  • Fix: MI100 Support By Bypassing Custom Paged Attention by @MErkinSag in #9560
  • [Frontend] Bad words sampling parameter by @Alvant in #9717
  • [Model] Add classification Task with Qwen2ForSequenceClassification by @kakao-kevin-us in #9704
  • [Misc] SpecDecodeWorker supports profiling by @Abatom in #9719
  • [core] cudagraph output with tensor weak reference by @youkaichao in #9724
  • [Misc] Upgrade to pytorch 2.5 by @bnellnm in #9588
  • Fix cache management in "Close inactive issues and PRs" actions workflow by @hmellor in #9734
  • [Bugfix] Fix load config when using bools by @madt2709 in #9533
  • [Hardware][ROCM] using current_platform.is_rocm by @wangshuai09 in #9642
  • [torch.compile] support moe models by @youkaichao in #9632
  • Fix beam search eos by @robertgshaw2-neuralmagic in #9627
  • [Bugfix] Fix ray instance detect issue by @yma11 in #9439
  • [CI/Build] Adopt Mergify for auto-labeling PRs by @russellb in #9259
  • [Model][VLM] Add multi-video support for LLaVA-Onevision by @litianjian in #8905
  • [torch.compile] Adding "torch compile" annotations to some models by @CRZbulabula in #9758
  • [misc] avoid circular import by @youkaichao in #9765
  • [torch.compile] add deepseek v2 compile by @youkaichao in #9775
  • [Doc] fix third-party model example by @russellb in #9771
  • [Model][LoRA]LoRA support added for Qwen by @jeejeelee in #9622
  • [Doc] Specify async engine args in docs by @DarkLight1337 in #9726
  • [Bugfix] Use temporary directory in registry by @DarkLight1337 in #9721
  • [Frontend] re-enable multi-modality input in the new beam search implementation by @FerdinandZhong in #9427
  • [Model] Add BNB quantization support for Mllama by @Isotr0py in #9720
  • [Hardware] using current_platform.seed_everything by @wangshuai09 in #9785
  • [Misc] Add metrics for request queue time, forward time, and execute time by @Abatom in #9659
  • Fix the log to correct guide user to install modelscope by @tastelikefeet in #9793
  • [Bugfix] Use host argument to bind to interface by @svenseeberg in #9798
  • [Misc]: Typo fix: Renaming classes (casualLM -> causalLM) by @yannicks1 in #9801
  • [Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel by @jsato8094 in #9806
  • [CI][Bugfix] Skip chameleon for transformers 4.46.1 by @mgoin in #9808
  • [CI/Build] mergify: fix rules for ci/build label by @russellb in #9804
  • [MISC] Set label value to timestamp over 0, to keep track of recent history by @coolkp in #9777
  • [Bugfix][Frontend] Guard against bad token ids by @joerunde in #9634
  • [Model] tool calling support for ibm-granite/granite-20b-functioncalling by @wseaton in #8339
  • [Docs] Add notes about Snowflake Meetup by @simon-mo in #9814
  • [Bugfix] Fix prefix strings for quantized VLMs by @mgoin in #9772
  • [core][distributed] fix custom allreduce in pytorch 2.5 by @youkaichao in #9815
  • Update README.md by @LiuXiaoxuanPKU in #9819
  • [Bugfix][VLM] Make apply_fp8_linear work with >2D input by @mgoin in #9812
  • [ci/build] Pin CI dependencies version with pip-compile by @khluu in #9810
  • [Bugfix] Fix multi nodes TP+PP for XPU by @yma11 in #8884
  • [Doc] Add the DCO to CONTRIBUTING.md by @russellb in #9803
  • [torch.compile] rework compile control with piecewise cudagraph by @youkaichao in #9715
  • [Misc] Specify minimum pynvml version by @jeejeelee in #9827
  • [TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA by @WoosukKwon in #9438
  • [CI/Build] VLM Test Consolidation by @alex-jw-brooks in #9372
  • [Model] Support math-shepherd-mistral-7b-prm model by @Went-Liang in #9697
  • [Misc] Add chunked-prefill support on FlashInfer. by @elfiegg in #9781
  • [Bugfix][core] replace heartbeat with pid check by @joerunde in #9818
  • [Doc] link bug for multistep guided decoding by @joerunde in #9843
  • [Neuron] Update Dockerfile.neuron to fix build failure by @hbikki in #9822
  • [doc] update pp support by @youkaichao in #9853
  • [CI/Build] Simplify exception trace in api server tests by @CRZbulabula in #9787
  • [torch.compile] upgrade tests by @youkaichao in #9858
  • [Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint by @gcalmettes in #9837
  • Revert "[Bugfix] Use host argument to bind to interface (#9798)" by @khluu in #9852
  • [Model] Support quantization of Qwen2VisionTransformer for Qwen2-VL by @mgoin in #9817
  • [Misc] Remove deprecated arg for cuda graph capture by @ywang96 in #9864
  • [Doc] Update Qwen documentation by @jeejeelee in #9869
  • [CI/Build] Add Model Tests for Qwen2-VL by @alex-jw-brooks in #9846
  • [CI/Build] Adding a forced docker system prune to clean up space by @Alexei-V-Ivanov-AMD in #9849
  • [Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together by @sasha0552 in #9532
  • [BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 by @mzusman in #9838
  • [ci/build] Configure dependabot to update pip dependencies by @khluu in #9811
  • [Bugfix][Frontend] Reject guided decoding in multistep mode by @joerunde in #9892
  • [torch.compile] directly register custom op by @youkaichao in #9896
  • [Bugfix] Fix layer skip logic with bitsandbytes by @mgoin in #9887
  • [torch.compile] rework test plans by @youkaichao in #9866
  • [Model] Support bitsandbytes for MiniCPMV by @mgoin in #9891
  • [torch.compile] Adding torch compile annotations to some models by @CRZbulabula in #9876
  • [Doc] Update multi-input support by @DarkLight1337 in #9906
  • [Frontend] Chat-based Embeddings API by @DarkLight1337 in #9759
  • [CI/Build] Add Model Tests for PixtralHF by @mgoin in #9813
  • [Frontend] Use a proper chat template for VLM2Vec by @DarkLight1337 in #9912
  • [Bugfix] Fix edge cases for MistralTokenizer by @tjohnson31415 in #9625
  • [Core] Refactor: Clean up unused argument preemption_mode in Scheduler._preempt by @andrejonasson in #9696
  • [torch.compile] use interpreter with stable api from pytorch by @youkaichao in #9889
  • [Bugfix/Core] Remove assertion for Flashinfer k_scale and v_scale by @pavanimajety in #9861
  • [1/N] pass the complete config from engine to executor by @youkaichao in #9933
  • [Bugfix] PicklingError on RayTaskError by @GeneDer in #9934
  • Bump the patch-update group with 10 updates by @dependabot in #9897
  • [Core][VLM] Add precise multi-modal placeholder tracking by @petersalas in #8346
  • [ci/build] Have dependabot ignore pinned dependencies by @khluu in #9935
  • [Encoder Decoder] Add flash_attn kernel support for encoder-decoder models by @sroy745 in #9559
  • [torch.compile] fix cpu broken code by @youkaichao in #9947
  • [Docs] Update Granite 3.0 models in supported models table by @njhill in #9930
  • [Doc] Updated tpu-installation.rst with more details by @mikegre-google in #9926
  • [2/N] executor pass the complete config to worker/modelrunner by @youkaichao in #9938
  • [V1] Fix EngineArgs refactor on V1 by @robertgshaw2-neuralmagic in #9954
  • [bugfix] fix chatglm dummy_data_for_glmv by @youkaichao in #9955
  • [3/N] model runner pass the whole config to model by @youkaichao in #9958
  • [CI/Build] Quoting around > by @nokados in #9956
  • [torch.compile] Adding torch compile annotations to vision-language models by @CRZbulabula in #9946
  • [bugfix] fix tsts by @youkaichao in #9959
  • [V1] Support per-request seed by @njhill in #9945
  • [Model] Add support for H2OVL-Mississippi models by @cooleel in #9747
  • [V1] Fix Configs by @robertgshaw2-neuralmagic in #9971
  • [Bugfix] Fix MiniCPMV and Mllama BNB bug by @jeejeelee in #9917
  • [Bugfix]Using the correct type hints by @gshtras in #9885
  • [Misc] Compute query_start_loc/seq_start_loc on CPU by @zhengy001 in #9447
  • [Bugfix] Fix E2EL mean and median stats by @daitran2k1 in #9984
  • [Bugfix][OpenVINO] Fix circular reference #9939 by @MengqingCao in #9974
  • [Frontend] Multi-Modality Support for Loading Local Image Files by @chaunceyjiang in #9915
  • [4/N] make quant config first-class citizen by @youkaichao in #9978
  • [Misc]Reduce BNB static variable by @jeejeelee in #9987
  • [Model] factoring out MambaMixer out of Jamba by @mzusman in #8993
  • [CI] Basic Integration Test For TPU by @robertgshaw2-neuralmagic in #9968
  • [Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs by @hissu-hyvarinen in #9279
  • [Doc] Update VLM doc about loading from local files by @ywang96 in #9999
  • [Bugfix] Fix MQLLMEngine hanging by @robertgshaw2-neuralmagic in #9973
  • [Misc] Refactor benchmark_throughput.py by @lk-chen in #9779
  • [Frontend] Add max_tokens prometheus metric by @tomeras91 in #9881
  • [Bugfix] Upgrade to pytorch 2.5.1 by @bnellnm in #10001
  • [4.5/N] bugfix for quant config in speculative decode by @youkaichao in #10007
  • [Bugfix] Respect modules_to_not_convert within awq_marlin by @mgoin in #9895
  • [Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep by @tlrmchlsmth in #9994
  • [Core] Make encoder-decoder inputs a nested structure to be more composable by @DarkLight1337 in #9604
  • [Bugfix] Fixup Mamba by @tlrmchlsmth in #10004
  • [BugFix] Lazy import ray by @GeneDer in #10021
  • [Misc] vllm CLI flags should be ordered for better user readability by @chaunceyjiang in #10017
  • [Frontend] Fix tcp port reservation for api server by @russellb in #10012
  • Refactor TPU requirements file and pin build dependencies by @richardsliu in #10010
  • [Misc] Add logging for CUDA memory by @yangalan123 in #10027
  • [CI/Build] Limit github CI jobs based on files changed by @russellb in #9928
  • [Model] Support quantization of PixtralHFTransformer for PixtralHF by @mgoin in #9921
  • [Feature] Update benchmark_throughput.py to support image input by @lk-chen in #9851
  • [Misc] Modify BNB parameter name by @jeejeelee in #9997
  • [CI] Prune tests/models/decoder_only/language/* tests by @mgoin in #9940
  • [CI] Prune back the number of tests in tests/kernels/* by @mgoin in #9932
  • [bugfix] fix weak ref in piecewise cudagraph and tractable test by @youkaichao in #10048
  • [Bugfix] Properly propagate trust_remote_code settings by @zifeitong in #10047
  • [Bugfix] Fix pickle of input when async output processing is on by @wallashss in #9931
  • [Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode by @llsj14 in #9730
  • [v1] reduce graph capture time for piecewise cudagraph by @youkaichao in #10059
  • [Misc] Sort the list of embedding models by @DarkLight1337 in #10037
  • [Model][OpenVINO] Fix regressions from #8346 by @petersalas in #10045
  • [Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer by @tjohnson31415 in #10051
  • [Bugfix] Gpt-j-6B patch kv_scale to k_scale path by @arakowsk-amd in #10063
  • [Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type by @zifeitong in #10054
  • [V1] Integrate Piecewise CUDA graphs by @WoosukKwon in #10058
  • [distributed] add function to create ipc buffers directly by @youkaichao in #10064
  • [CI/Build] drop support for Python 3.8 EOL by @aarnphm in #8464
  • [CI/Build] Fix large_gpu_mark reason by @Isotr0py in #10070
  • [Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend by @kzawora-intel in #6143
  • [Hotfix] Fix ruff errors by @WoosukKwon in #10073
  • [Model][LoRA]LoRA support added for LlamaEmbeddingModel by @jeejeelee in #10071
  • [Model] Add Idefics3 support by @jeejeelee in #9767
  • [Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration by @ericperfect in #10022
  • Remove ScaledActivation for AWQ by @mgoin in #10057
  • [CI/Build] Drop Python 3.8 support by @russellb in #10038
  • [CI/Build] change conflict PR comment from mergify by @russellb in #10080
  • [V1] Make v1 more testable by @joerunde in #9888
  • [CI/Build] Always run the ruff workflow by @russellb in #10092
  • [core][distributed] add stateless_init_process_group by @youkaichao in #10072
  • [Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 by @mgoin in #10095
  • [Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend by @yma11 in #9823
  • [Frontend] Adjust try/except blocks in API impl by @njhill in #10056
  • [Hardware][CPU] Update torch 2.5 by @bigPYJ1151 in #9911
  • [doc] add back Python 3.8 ABI by @youkaichao in #10100
  • [V1][BugFix] Fix Generator construction in greedy + seed case by @njhill in #10097
  • [Misc] Consolidate ModelConfig code related to HF config by @DarkLight1337 in #10104
  • [CI/Build] re-add codespell to CI by @russellb in #10083
  • [Doc] Improve benchmark documentation by @rafvasq in #9927
  • [Core][Distributed] Refactor ipc buffer init in CustomAllreduce by @hanzhi713 in #10030
  • [CI/Build] Improve mypy + python version matrix by @russellb in #10041
  • Adds method to read the pooling types from model's files by @flaviabeo in #9506
  • [Frontend] Fix multiple values for keyword argument error (#10075) by @DIYer22 in #10076
  • [Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target by @bigPYJ1151 in #10108
  • [Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL by @li-plus in #10112
  • [Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. by @spliii in #10105
  • [Frontend] Tool calling parser for Granite 3.0 models by @maxdebayser in #9027
  • [Feature] [Spec decode]: Combine chunked prefill with speculative decoding by @NickLucche in #9291
  • [CI/Build] Always run mypy by @russellb in #10122
  • [CI/Build] Add shell script linting using shellcheck by @russellb in #7925
  • [CI/Build] Automate PR body text cleanup by @russellb in #10082
  • Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #9745
  • Online video support for VLMs by @litianjian in #10020
  • Bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #9746
  • [Misc] Add environment variables collection in collect_env.py tool by @ycool in #9293
  • [V1] Add all_token_ids attribute to Request by @WoosukKwon in #10135
  • [V1] Prefix caching (take 2) by @comaniac in #9972
  • [CI/Build] Give PR cleanup job PR write access by @russellb in #10139
  • [Doc] Update FAQ links in spec_decode.rst by @whyiug in #9662
  • [Bugfix] Add error handling when server cannot respond any valid tokens by @DearPlanet in #5895
  • [Misc] Fix ImportError causing by triton by @MengqingCao in #9493
  • [Doc] Move CONTRIBUTING to docs site by @russellb in #9924
  • Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. by @sighingnow in #9285
  • Add hf_transfer to testing image by @mgoin in #10096
  • [Misc] Fix typo in #5895 by @DarkLight1337 in #10145
  • [Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator by @yma11 in #10144
  • [Model] Expose size to Idefics3 as mm_processor_kwargs by @Isotr0py in #10146
  • [V1]Enable APC by default only for text models by @ywang96 in #10148
  • [CI/Build] Update CPU tests to include all "standard" tests by @DarkLight1337 in #5481
  • Fix edge case Mistral tokenizer by @patrickvonplaten in #10152
  • Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 by @sroy745 in #10136
  • [Misc] Improve Web UI by @rafvasq in #10090
  • [V1] Fix non-cudagraph op name by @WoosukKwon in #10166
  • [CI/Build] Ignore .gitignored files for shellcheck by @ProExpertProg in #10162
  • Rename vllm.logging to vllm.logging_utils by @flozi00 in #10134
  • [torch.compile] Fuse RMSNorm with quant by @ProExpertProg in #9138
  • [Bugfix] Fix SymIntArrayRef expected to contain only concrete integers by @bnellnm in #10170
  • [Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case by @rasmith in #9857
  • [CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking by @bigPYJ1151 in #6892
  • [0/N] Rename MultiModalInputs to MultiModalKwargs by @DarkLight1337 in #10040
  • [Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module by @mgoin in #10169
  • [CI/Build] Fix VLM broadcast tests tensor_parallel_size passing by @Isotr0py in #10161
  • [Doc] Adjust RunLLM location by @DarkLight1337 in #10176
  • [5/N] pass the whole config to model by @youkaichao in #9983
  • [CI/Build] Add run-hpu-test.sh script by @xuechendi in #10167
  • [Bugfix] Enable some fp8 and quantized fullgraph tests by @bnellnm in #10171
  • [bugfix] fix broken tests of mlp speculator by @youkaichao in #10177
  • [doc] explaining the integration with huggingface by @youkaichao in #10173
  • bugfix: fix the bug that stream generate not work by @caijizhuo in #2756
  • [Frontend] add add_request_id middleware by @cjackal in #9594
  • [Frontend][Core] Override HF config.json via CLI by @KrishnaM251 in #5836
  • [CI/Build] Split up models tests by @DarkLight1337 in #10069
  • [ci][build] limit cmake version by @youkaichao in #10188
  • [Doc] Fix typo error in CONTRIBUTING.md by @FuryMartin in #10190
  • [doc] Polish the integration with huggingface doc by @CRZbulabula in #10195
  • [Misc] small fixes to function tracing file path by @ShawnD200 in #9543
  • [misc] improve cloudpickle registration and tests by @youkaichao in #10202
  • [Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py by @yansh97 in #10196
  • [doc] improve debugging code by @youkaichao in #10206
  • [6/N] pass whole config to inner model by @youkaichao in #10205
  • Bump the patch-update group with 5 updates by @dependabot in #10210
  • [Hardware][CPU] Add embedding models support for CPU backend by @Isotr0py in #10193
  • [LoRA][Kernel] Remove the unused libentry module by @jeejeelee in #10214
  • [V1] Allow tokenizer_mode and trust_remote_code for Detokenizer by @ywang96 in #10211
  • [Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner by @Isotr0py in #10218
  • [Metrics] add more metrics by @HarryWu99 in #4464
  • [Doc] fix doc string typo in block_manager swap_out function by @yyccli in #10212
  • [core][distributed] add stateless process group by @youkaichao in #10216
  • Bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #10209
  • [V1] Fix detokenizer ports by @WoosukKwon in #10224
  • [V1] Do not use inductor for piecewise CUDA graphs by @WoosukKwon in #10225
  • [v1][torch.compile] support managing cudagraph buffer by @youkaichao in #10203
  • [V1] Use custom ops for piecewise CUDA graphs by @WoosukKwon in #10227
  • Add docs on serving with Llama Stack by @terrytangyuan in #10183
  • [misc][distributed] auto port selection and disable tests by @youkaichao in #10226
  • [V1] Enable custom ops with piecewise CUDA graphs by @WoosukKwon in #10228
  • Make shutil rename in python_only_dev by @shcheglovnd in #10233
  • [V1] AsyncLLM Implementation by @robertgshaw2-neuralmagic in #9826
  • [doc] update debugging guide by @youkaichao in #10236
  • [Doc] Update help text for --distributed-executor-backend by @russellb in #10231
  • [1/N] torch.compile user interface design by @youkaichao in #10237
  • [Misc][LoRA] Replace hardcoded cuda device with configurable argument by @jeejeelee in #10223
  • Splitting attention kernel file by @maleksan85 in #10091
  • [doc] explain the class hierarchy in vLLM by @youkaichao in #10240
  • [CI][CPU]refactor CPU tests to allow to bind with different cores by @zhouyuan in #10222
  • [BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. by @gcalmettes in #10000
  • [Misc]Fix Idefics3Model argument by @jeejeelee in #10255
  • [Bugfix] Fix QwenModel argument by @DamonFool in #10262
  • [Frontend] Add per-request number of cached token stats by @zifeitong in #10174
  • [V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest by @WoosukKwon in #10245
  • [Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers by @sroy745 in #9982
  • [LoRA] Adds support for bias in LoRA by @followumesh in #5733
  • [V1] Enable Inductor when using piecewise CUDA graphs by @WoosukKwon in #10268
  • [doc] fix location of runllm widget by @youkaichao in #10266
  • [doc] improve debugging doc by @youkaichao in #10270
  • Revert "[ci][build] limit cmake version" by @youkaichao in #10271
  • [V1] Fix CI tests on V1 engine by @WoosukKwon in #10272
  • [core][distributed] use tcp store directly by @youkaichao in #10275
  • [V1] Support VLMs with fine-grained scheduling by @WoosukKwon in #9871
  • Bump to compressed-tensors v0.8.0 by @dsikka in #10279
  • [Doc] Fix typo in arg_utils.py by @xyang16 in #10264
  • [Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions by @imkero in #10221
  • [Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 by @FurtherAI in #9944
  • [Core] Flashinfer - Remove advance step size restriction by @pavanimajety in #10282
  • [Model][LoRA]LoRA support added for idefics3 by @B-201 in #10281
  • [V1] Add missing tokenizer options for Detokenizer by @ywang96 in #10288
  • [1/N] Initial prototype for multi-modal processor by @DarkLight1337 in #10044
  • [Bugfix] bitsandbytes models fail to run pipeline parallel by @HoangCongDuc in #10200
  • [Bugfix] Fix tensor parallel for qwen2 classification model by @Isotr0py in #10297
  • [misc] error early for old-style class by @youkaichao in #10304
  • [Misc] format.sh: Simplify tool_version_check by @russellb in #10305
  • [Frontend] Pythonic tool parser by @mdepinet in #9859
  • [BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used by @gcalmettes in #9951
  • [Model] Add BNB quantization support for Idefics3 by @B-201 in #10310
  • [ci][distributed] disable hanging tests by @youkaichao in #10317
  • [CI/Build] Fix CPU CI online inference timeout by @Isotr0py in #10314
  • [CI/Build] Make shellcheck happy by @DarkLight1337 in #10285
  • [Docs] Publish meetup slides by @WoosukKwon in #10331
  • Support Roberta embedding models by @maxdebayser in #9387
  • [Perf] Reduce peak memory usage of llama by @andoorve in #10339
  • [Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server by @jxpxxzj in #9583
  • [Tool parsing] Improve / correct mistral tool parsing by @patrickvonplaten in #10333
  • [Bugfix] Fix unable to load some models by @DarkLight1337 in #10312
  • [bugfix] Fix static asymmetric quantization case by @ProExpertProg in #10334
  • [Misc] Change RedundantReshapesPass and FusionPass logging from info to debug by @tlrmchlsmth in #10308
  • [Model] Support Qwen2 embeddings and use tags to select model tests by @DarkLight1337 in #10184
  • [Bugfix] Qwen-vl output is inconsistent in speculative decoding by @skylee-01 in #10350
  • [Misc] Consolidate pooler config overrides by @DarkLight1337 in #10351
  • [Build] skip renaming files for release wheels pipeline by @simon-mo in #9671

New Contributors

  • @gracehonv made their first contribution in #9349
  • @streaver91 made their first contribution in #9396
  • @wukaixingxp made their first contribution in #9013
  • @sssrijan-amazon made their first contribution in #9380
  • @coolkp made their first contribution in #9477
  • @yue-anyscale made their first contribution in #9478
  • @dhiaEddineRhaiem made their first contribution in #9325
  • @yudian0504 made their first contribution in #9549
  • @ngrozae made their first contribution in #9552
  • @Falko1 made their first contribution in #9503
  • @wangshuai09 made their first contribution in #9536
  • @gopalsarda made their first contribution in #9580
  • @guoyuhong made their first contribution in #9550
  • @JArnoldAMD made their first contribution in #9529
  • @yuleil made their first contribution in #8234
  • @sethkimmel3 made their first contribution in #7889
  • @MengqingCao made their first contribution in #9605
  • @CRZbulabula made their first contribution in #9614
  • @faychu made their first contribution in #9248
  • @vrdn-23 made their first contribution in #9358
  • @willmj made their first contribution in #9673
  • @samos123 made their first contribution in #9709
  • @MErkinSag made their first contribution in #9560
  • @Alvant made their first contribution in #9717
  • @kakao-kevin-us made their first contribution in #9704
  • @madt2709 made their first contribution in #9533
  • @FerdinandZhong made their first contribution in #9427
  • @svenseeberg made their first contribution in #9798
  • @yannicks1 made their first contribution in #9801
  • @wseaton made their first contribution in #8339
  • @Went-Liang made their first contribution in #9697
  • @andrejonasson made their first contribution in #9696
  • @GeneDer made their first contribution in #9934
  • @mikegre-google made their first contribution in #9926
  • @nokados made their first contribution in #9956
  • @cooleel made their first contribution in #9747
  • @zhengy001 made their first contribution in #9447
  • @daitran2k1 made their first contribution in #9984
  • @chaunceyjiang made their first contribution in #9915
  • @hissu-hyvarinen made their first contribution in #9279
  • @lk-chen made their first contribution in #9779
  • @yangalan123 made their first contribution in #10027
  • @llsj14 made their first contribution in #9730
  • @arakowsk-amd made their first contribution in #10063
  • @kzawora-intel made their first contribution in #6143
  • @DIYer22 made their first contribution in #10076
  • @li-plus made their first contribution in #10112
  • @spliii made their first contribution in #10105
  • @flozi00 made their first contribution in #10134
  • @xuechendi made their first contribution in #10167
  • @caijizhuo made their first contribution in #2756
  • @cjackal made their first contribution in #9594
  • @KrishnaM251 made their first contribution in #5836
  • @FuryMartin made their first contribution in #10190
  • @ShawnD200 made their first contribution in #9543
  • @yansh97 made their first contribution in #10196
  • @yyccli made their first contribution in #10212
  • @shcheglovnd made their first contribution in #10233
  • @maleksan85 made their first contribution in #10091
  • @followumesh made their first contribution in #5733
  • @imkero made their first contribution in #10221
  • @B-201 made their first contribution in #10281
  • @HoangCongDuc made their first contribution in #10200
  • @mdepinet made their first contribution in #9859
  • @jxpxxzj made their first contribution in #9583
  • @skylee-01 made their first contribution in #10350

Full Changelog: v0.6.3...v0.6.4

Don't miss a new vllm release

NewReleases is sending notifications on new releases.