github vllm-project/vllm v0.6.3

18 hours ago

Highlights

Model Support

  • New Models:
  • Expansion in functionality:
    • Add Gemma2 embedding model (#9004)
    • Support input embeddings for qwen2vl (#8856), minicpmv (#9237)
    • LoRA:
      • LoRA support for MiniCPMV2.5 (#7199), MiniCPMV2.6 (#8943)
      • Expand lora modules for mixtral (#9008)
    • Pipeline parallelism support to remaining text and embedding models (#7168, #9090)
    • Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (#9148)
    • Tool use:
      • Add support for Llama 3.1 and 3.2 tool use (#8343)
      • Support tool calling for InternLM2.5 (#8405)
  • Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)

Documentation

  • New compatibility matrix for mutual exclusive features (#8512)
  • Reorganized installation doc, note that we publish a per-commit docker image (#8931)

Hardware Support:

  • Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
  • Support AWQ for CPU backend (#7515)
  • Add async output processor for xpu (#8897)
  • Add on-device sampling support for Neuron (#8746)

Architectural Enhancements

  • Progress in vLLM's refactoring to a core core:
    • Spec decode removing batch expansion (#8839, #9298).
    • We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
    • Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
    • Move guided decoding params into sampling params (#8252)
  • Torch Compile:
    • You can now set an env var VLLM_TORCH_COMPILE_LEVEL to control torch.compile various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), using VLLM_TORCH_COMPILE_LEVEL=3 can turn on Inductor's full graph compilation without vLLM's custom ops.

Others

  • Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
  • Enhancements towards priority scheduling (#8965, #8956, #8850)

What's Changed

  • [Misc] Update config loading for Qwen2-VL and remove Granite by @ywang96 in #8837
  • [Build/CI] Upgrade to gcc 10 in the base build Docker image by @tlrmchlsmth in #8814
  • [Docs] Add README to the build docker image by @mgoin in #8825
  • [CI/Build] Fix missing ci dependencies by @fyuan1316 in #8834
  • [misc][installation] build from source without compilation by @youkaichao in #8818
  • [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM by @khluu in #8872
  • [Bugfix] Include encoder prompts len to non-stream api usage response by @Pernekhan in #8861
  • [Misc] Change dummy profiling and BOS fallback warns to log once by @mgoin in #8820
  • [Bugfix] Fix print_warning_once's line info by @tlrmchlsmth in #8867
  • fix validation: Only set tool_choice auto if at least one tool is provided by @chiragjn in #8568
  • [Bugfix] Fixup advance_step.cu warning by @tlrmchlsmth in #8815
  • [BugFix] Fix test breakages from transformers 4.45 upgrade by @njhill in #8829
  • [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility by @DarkLight1337 in #8764
  • [Feature] Add support for Llama 3.1 and 3.2 tool use by @maxdebayser in #8343
  • [Core] Rename PromptInputs and inputs with backward compatibility by @DarkLight1337 in #8876
  • [misc] fix collect env by @youkaichao in #8894
  • [MISC] Fix invalid escape sequence '' by @panpan0000 in #8830
  • [Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 by @Isotr0py in #8892
  • [TPU] Update pallas.py to support trillium by @bvrockwell in #8871
  • [torch.compile] use empty tensor instead of None for profiling by @youkaichao in #8875
  • [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method by @ProExpertProg in #7271
  • [Bugfix] fix for deepseek w4a16 by @LucasWilkinson in #8906
  • [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path by @varun-sundar-rabindranath in #8378
  • [misc][distributed] add VLLM_SKIP_P2P_CHECK flag by @youkaichao in #8911
  • [Core] Priority-based scheduling in async engine by @schoennenbeck in #8850
  • [misc] fix wheel name by @youkaichao in #8919
  • [Bugfix][Intel] Fix XPU Dockerfile Build by @tylertitsworth in #7824
  • [Misc] Remove vLLM patch of BaichuanTokenizer by @DarkLight1337 in #8921
  • [Bugfix] Fix code for downloading models from modelscope by @tastelikefeet in #8443
  • [Bugfix] Fix PP for Multi-Step by @varun-sundar-rabindranath in #8887
  • [CI/Build] Update models tests & examples by @DarkLight1337 in #8874
  • [Frontend] Make beam search emulator temperature modifiable by @nFunctor in #8928
  • [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 by @heheda12345 in #8891
  • [doc] organize installation doc and expose per-commit docker by @youkaichao in #8931
  • [Core] Improve choice of Python multiprocessing method by @russellb in #8823
  • [Bugfix] Block manager v2 with preemption and lookahead slots by @sroy745 in #8824
  • [Bugfix] Fix Marlin MoE act order when is_k_full == False by @ElizaWszola in #8741
  • [CI/Build] Add test decorator for minimum GPU memory by @DarkLight1337 in #8925
  • [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching by @tlrmchlsmth in #8930
  • [Model] Support Qwen2.5-Math-RM-72B by @zhuzilin in #8896
  • [Model][LoRA]LoRA support added for MiniCPMV2.5 by @jeejeelee in #7199
  • [BugFix] Fix seeded random sampling with encoder-decoder models by @njhill in #8870
  • [Misc] Fix typo in BlockSpaceManagerV1 by @juncheoll in #8944
  • [Frontend] Added support for HF's new continue_final_message parameter by @danieljannai21 in #8942
  • [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model by @mzusman in #8533
  • [Model] support input embeddings for qwen2vl by @whyiug in #8856
  • [Misc][CI/Build] Include cv2 via mistral_common[opencv] by @ywang96 in #8951
  • [Model][LoRA]LoRA support added for MiniCPMV2.6 by @jeejeelee in #8943
  • [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg by @Isotr0py in #8946
  • [Core] Make scheduling policy settable via EngineArgs by @schoennenbeck in #8956
  • [Misc] Adjust max_position_embeddings for LoRA compatibility by @jeejeelee in #8957
  • [ci] Add CODEOWNERS for test directories by @khluu in #8795
  • [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. by @LiuXiaoxuanPKU in #8975
  • [Frontend][Core] Move guided decoding params into sampling params by @joerunde in #8252
  • [CI/Build] Fix machete generated kernel files ordering by @khluu in #8976
  • [torch.compile] fix tensor alias by @youkaichao in #8982
  • [Misc] add process_weights_after_loading for DummyLoader by @divakar-amd in #8969
  • [Bugfix] Fix Fuyu tensor parallel inference by @Isotr0py in #8986
  • [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders by @alex-jw-brooks in #8991
  • [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API by @schoennenbeck in #8965
  • [Doc] Update list of supported models by @DarkLight1337 in #8987
  • Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows by @vlsav in #8997
  • [Spec Decode] (1/2) Remove batch expansion by @LiuXiaoxuanPKU in #8839
  • [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching by @afeldman-nm in #8804
  • [Misc] Update Default Image Mapper Error Log by @alex-jw-brooks in #8977
  • [Core] CUDA Graphs for Multi-Step + Chunked-Prefill by @varun-sundar-rabindranath in #8645
  • [OpenVINO] Enable GPU support for OpenVINO vLLM backend by @sshlyapn in #8192
  • [Model] Adding Granite MoE. by @shawntan in #8206
  • [Doc] Update Granite model docs by @njhill in #9025
  • [Bugfix] example template should not add parallel_tool_prompt if tools is none by @tjohnson31415 in #9007
  • [Misc] log when using default MoE config by @divakar-amd in #8971
  • [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser by @gcalmettes in #9020
  • [Core] Make BlockSpaceManagerV2 the default BlockManager to use. by @sroy745 in #8678
  • [Frontend] [Neuron] Parse literals out of override-neuron-config by @xendo in #8959
  • [misc] add forward context for attention by @youkaichao in #9029
  • Fix failing spec decode test by @sroy745 in #9054
  • [Bugfix] Weight loading fix for OPT model by @domenVres in #9042
  • [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model by @sydnash in #8405
  • [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) by @LucasWilkinson in #8845
  • [Misc] Enable multi-step output streaming by default by @mgoin in #9047
  • [Models] Add remaining model PP support by @andoorve in #7168
  • [Misc] Move registry to its own file by @DarkLight1337 in #9064
  • [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL by @whyiug in #9071
  • [Bugfix] Flash attention arches not getting set properly by @LucasWilkinson in #9062
  • [Model] add a bunch of supported lora modules for mixtral by @prashantgupta24 in #9008
  • Remove AMD Ray Summit Banner by @simon-mo in #9075
  • [Hardware][PowerPC] Make oneDNN dependency optional for Power by @varad-ahirwadkar in #9039
  • [Core][VLM] Test registration for OOT multimodal models by @ywang96 in #8717
  • Adds truncate_prompt_tokens param for embeddings creation by @flaviabeo in #8999
  • [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE by @ElizaWszola in #8973
  • [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang by @KuntaiDu in #7412
  • [Misc] Improved prefix cache example by @Imss27 in #9077
  • [Misc] Add random seed for prefix cache benchmark by @Imss27 in #9081
  • [Misc] Fix CI lint by @comaniac in #9085
  • [Hardware][Neuron] Add on-device sampling support for Neuron by @chongmni-aws in #8746
  • [torch.compile] improve allreduce registration by @youkaichao in #9061
  • [Doc] Update README.md with Ray summit slides by @zhuohan123 in #9088
  • [Bugfix] use blockmanagerv1 for encoder-decoder by @heheda12345 in #9084
  • [Bugfix] Fixes for Phi3v and Ultravox Multimodal EmbeddingInputs Support by @hhzhang16 in #8979
  • [Model] Support Gemma2 embedding model by @xyang16 in #9004
  • [Bugfix] Deprecate registration of custom configs to huggingface by @heheda12345 in #9083
  • [Bugfix] Fix order of arguments matters in config.yaml by @Imss27 in #8960
  • [core] use forward context for flash infer by @youkaichao in #9097
  • [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model by @tjtanaa in #9101
  • [Frontend] API support for beam search by @LunrEclipse in #9087
  • [Misc] Remove user-facing error for removed VLM args by @DarkLight1337 in #9104
  • [Model] PP support for embedding models and update docs by @DarkLight1337 in #9090
  • [Bugfix] fix tool_parser error handling when serve a model not support it by @liuyanyi in #8709
  • [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling by @varun-sundar-rabindranath in #9038
  • [Bugfix][Hardware][CPU] Fix CPU model input for decode by @Isotr0py in #9044
  • [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None by @sroy745 in #9103
  • [core] remove beam search from the core by @youkaichao in #9105
  • [Model] Explicit interface for vLLM models and support OOT embedding models by @DarkLight1337 in #9108
  • [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend by @Isotr0py in #9089
  • [Core] Refactor GGUF parameters packing and forwarding by @Isotr0py in #8859
  • [Model] Support NVLM-D and fix QK Norm in InternViT by @DarkLight1337 in #9045
  • [Doc]: Add deploying_with_k8s guide by @haitwang-cloud in #8451
  • [CI/Build] Add linting for github actions workflows by @russellb in #7876
  • [Doc] Include performance benchmark in README by @KuntaiDu in #9135
  • [misc] fix comment and variable name by @youkaichao in #9139
  • Add Slack to README by @simon-mo in #9137
  • [misc] update utils to support comparing multiple settings by @youkaichao in #9140
  • [Intel GPU] Fix xpu decode input by @jikunshang in #9145
  • [misc] improve ux on readme by @youkaichao in #9147
  • [Frontend] API support for beam search for MQLLMEngine by @LunrEclipse in #9117
  • [Core][Frontend] Add Support for Inference Time mm_processor_kwargs by @alex-jw-brooks in #9131
  • [Frontend] Add Early Validation For Chat Template / Tool Call Parser by @alex-jw-brooks in #9151
  • [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models by @panpan0000 in #8758
  • [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing by @dtrifiro in #8537
  • [Doc] Update vlm.rst to include an example on videos by @sayakpaul in #9155
  • [Doc] Improve contributing and installation documentation by @rafvasq in #9132
  • [Bugfix] Try to handle older versions of pytorch by @bnellnm in #9086
  • mypy: check additional directories by @russellb in #9162
  • Add lm-eval directly to requirements-test.txt by @mgoin in #9161
  • support bitsandbytes quantization with more models by @chenqianfzh in #9148
  • Add classifiers in setup.py by @terrytangyuan in #9171
  • Update link to KServe deployment guide by @terrytangyuan in #9173
  • [Misc] Improve validation errors around best_of and n by @tjohnson31415 in #9167
  • [Bugfix][Doc] Report neuron error in output by @joerowell in #9159
  • [Model] Remap FP8 kv_scale in CommandR and DBRX by @hliuca in #9174
  • [Frontend] Log the maximum supported concurrency by @AlpinDale in #8831
  • [Bugfix] Optimize composite weight loading and fix EAGLE weight loading by @DarkLight1337 in #9160
  • [ci][test] use load dummy for testing by @youkaichao in #9165
  • [Doc] Fix VLM prompt placeholder sample bug by @ycool in #9170
  • [Bugfix] Fix lora loading for Compressed Tensors in #9120 by @fahadh4ilyas in #9179
  • [Bugfix] Access get_vocab instead of vocab in tool parsers by @DarkLight1337 in #9188
  • Add Dependabot configuration for GitHub Actions updates by @EwoutH in #1217
  • [Hardware][CPU] Support AWQ for CPU backend by @bigPYJ1151 in #7515
  • [CI/Build] mypy: check vllm/entrypoints by @russellb in #9194
  • [CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 by @mgoin in #9130
  • [Core] Fix invalid args to _process_request by @russellb in #9201
  • [misc] improve model support check in another process by @youkaichao in #9208
  • [Bugfix] Fix Weight Loading Multiple GPU Test - Large Models by @mgoin in #9213
  • [Bugfix] Machete garbage results for some models (large K dim) by @LucasWilkinson in #9212
  • [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 by @sroy745 in #9149
  • [Bugfix] Fix lm_head weights tying with lora for llama by @Isotr0py in #9227
  • [Model] support input image embedding for minicpmv by @whyiug in #9237
  • [OpenVINO] Use torch 2.4.0 and newer optimim version by @ilya-lavrenov in #9121
  • [Bugfix] Fix Machete unittests failing with NotImplementedError by @LucasWilkinson in #9218
  • [Doc] Improve debugging documentation by @rafvasq in #9204
  • [CI/Build] Make the Dockerfile.cpu file's PIP_EXTRA_INDEX_URL Configurable as a Build Argument by @jyono in #9252
  • Suggest codeowners for the core componenets by @simon-mo in #9210
  • [torch.compile] integration with compilation control by @youkaichao in #9058
  • Bump actions/github-script from 6 to 7 by @dependabot in #9197
  • Bump actions/checkout from 3 to 4 by @dependabot in #9196
  • Bump actions/setup-python from 3 to 5 by @dependabot in #9195
  • [ci/build] Add placeholder command for custom models test and add comments by @khluu in #9262
  • [torch.compile] generic decorators by @youkaichao in #9258
  • [Doc][Neuron] add note to neuron documentation about resolving triton issue by @omrishiv in #9257
  • [Misc] Fix sampling from sonnet for long context case by @Imss27 in #9235
  • [misc] hide best_of from engine by @youkaichao in #9261
  • [Misc] Collect model support info in a single process per model by @DarkLight1337 in #9233
  • [Misc][LoRA] Support loading LoRA weights for target_modules in reg format by @jeejeelee in #9275
  • [Bugfix] Fix priority in multiprocessing engine by @schoennenbeck in #9277
  • [Model] Support Mamba by @tlrmchlsmth in #6484
  • [Kernel] adding fused moe kernel config for L40S TP4 by @bringlein in #9245
  • [Model] Add GLM-4v support and meet vllm==0.6.2 by @sixsixcoder in #9242
  • [Doc] Remove outdated comment to avoid misunderstanding by @homeffjy in #9287
  • [Doc] Compatibility matrix for mutual exclusive features by @wallashss in #8512
  • [Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected by @LucasWilkinson in #9254
  • [Bugfix] Sets is_first_step_output for TPUModelRunner by @allenwang28 in #9202
  • [bugfix] fix f-string for error by @prashantgupta24 in #9295
  • [BugFix] Fix tool call finish reason in streaming case by @maxdebayser in #9209
  • [SpecDec] Remove Batch Expansion (2/3) by @LiuXiaoxuanPKU in #9298
  • [Bugfix] Fix bug of xformer prefill for encoder-decoder by @xiangxu-google in #9026
  • [Misc][Installation] Improve source installation script and related documentation by @cermeng in #9309
  • [Bugfix]Fix MiniCPM's LoRA bug by @jeejeelee in #9286
  • [CI] Fix merge conflict by @LiuXiaoxuanPKU in #9317
  • [Bugfix] Bandaid fix for speculative decoding tests by @tlrmchlsmth in #9327
  • [Model] Molmo vLLM Integration by @mrsalehi in #9016
  • [Hardware][intel GPU] add async output process for xpu by @jikunshang in #8897
  • [CI/Build] setuptools-scm fixes by @dtrifiro in #8900
  • [Docs] Remove PDF build from Readtehdocs by @simon-mo in #9347

New Contributors

  • @fyuan1316 made their first contribution in #8834
  • @panpan0000 made their first contribution in #8830
  • @bvrockwell made their first contribution in #8871
  • @tylertitsworth made their first contribution in #7824
  • @tastelikefeet made their first contribution in #8443
  • @nFunctor made their first contribution in #8928
  • @zhuzilin made their first contribution in #8896
  • @juncheoll made their first contribution in #8944
  • @vlsav made their first contribution in #8997
  • @sshlyapn made their first contribution in #8192
  • @gcalmettes made their first contribution in #9020
  • @xendo made their first contribution in #8959
  • @domenVres made their first contribution in #9042
  • @sydnash made their first contribution in #8405
  • @varad-ahirwadkar made their first contribution in #9039
  • @flaviabeo made their first contribution in #8999
  • @chongmni-aws made their first contribution in #8746
  • @hhzhang16 made their first contribution in #8979
  • @xyang16 made their first contribution in #9004
  • @LunrEclipse made their first contribution in #9087
  • @sayakpaul made their first contribution in #9155
  • @joerowell made their first contribution in #9159
  • @AlpinDale made their first contribution in #8831
  • @ycool made their first contribution in #9170
  • @fahadh4ilyas made their first contribution in #9179
  • @EwoutH made their first contribution in #1217
  • @jyono made their first contribution in #9252
  • @dependabot made their first contribution in #9197
  • @bringlein made their first contribution in #9245
  • @sixsixcoder made their first contribution in #9242
  • @homeffjy made their first contribution in #9287
  • @allenwang28 made their first contribution in #9202
  • @cermeng made their first contribution in #9309
  • @mrsalehi made their first contribution in #9016

Full Changelog: v0.6.2...v0.6.3

Don't miss a new vllm release

NewReleases is sending notifications on new releases.