vllm-project/vllm v0.6.3 on GitHub

Highlights

Model Support

New Models:
- Text: Granite MoE (#8206), Mamba (#6484, #8533)
- Vision: GLM-4V (#9242), Molmo (#9016), NVLM-D (#9045)
- Reward model support: Qwen2.5-Math-RM-72B (#8896)
Expansion in functionality:
- Add Gemma2 embedding model (#9004)
- Support input embeddings for qwen2vl (#8856), minicpmv (#9237)
- LoRA:
  - LoRA support for MiniCPMV2.5 (#7199), MiniCPMV2.6 (#8943)
  - Expand lora modules for mixtral (#9008)
- Pipeline parallelism support to remaining text and embedding models (#7168, #9090)
- Expanded bitsandbytes quantization support for Falcon, OPT, Gemma, Gemma2, and Phi (#9148)
- Tool use:
  - Add support for Llama 3.1 and 3.2 tool use (#8343)
  - Support tool calling for InternLM2.5 (#8405)
Out of tree support enhancements: Explicit interface for vLLM models and support OOT embedding models (#9108)

Documentation

New compatibility matrix for mutual exclusive features (#8512)
Reorganized installation doc, note that we publish a per-commit docker image (#8931)

Hardware Support:

Cross-attention and Encoder-Decoder models support on x86 CPU backend (#9089)
Support AWQ for CPU backend (#7515)
Add async output processor for xpu (#8897)
Add on-device sampling support for Neuron (#8746)

Architectural Enhancements

Progress in vLLM's refactoring to a core core:
- Spec decode removing batch expansion (#8839, #9298).
- We have made block manager V2 the default. This is an internal refactoring for cleaner and more tested code path (#8678).
- Moving beam search from the core to the API level (#9105, #9087, #9117, #8928)
- Move guided decoding params into sampling params (#8252)
Torch Compile:
- You can now set an env var VLLM_TORCH_COMPILE_LEVEL to control torch.compile various levels of compilation control and integration (#9058). Along with various improvements (#8982, #9258, #906, #8875), using VLLM_TORCH_COMPILE_LEVEL=3 can turn on Inductor's full graph compilation without vLLM's custom ops.

Others

Performance enhancements to turn on multi-step scheeduling by default (#8804, #8645, #8378)
Enhancements towards priority scheduling (#8965, #8956, #8850)

What's Changed

[Misc] Update config loading for Qwen2-VL and remove Granite by @ywang96 in #8837
[Build/CI] Upgrade to gcc 10 in the base build Docker image by @tlrmchlsmth in #8814
[Docs] Add README to the build docker image by @mgoin in #8825
[CI/Build] Fix missing ci dependencies by @fyuan1316 in #8834
[misc][installation] build from source without compilation by @youkaichao in #8818
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM by @khluu in #8872
[Bugfix] Include encoder prompts len to non-stream api usage response by @Pernekhan in #8861
[Misc] Change dummy profiling and BOS fallback warns to log once by @mgoin in #8820
[Bugfix] Fix print_warning_once's line info by @tlrmchlsmth in #8867
fix validation: Only set tool_choice auto if at least one tool is provided by @chiragjn in #8568
[Bugfix] Fixup advance_step.cu warning by @tlrmchlsmth in #8815
[BugFix] Fix test breakages from transformers 4.45 upgrade by @njhill in #8829
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility by @DarkLight1337 in #8764
[Feature] Add support for Llama 3.1 and 3.2 tool use by @maxdebayser in #8343
[Core] Rename PromptInputs and inputs with backward compatibility by @DarkLight1337 in #8876
[misc] fix collect env by @youkaichao in #8894
[MISC] Fix invalid escape sequence '' by @panpan0000 in #8830
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 by @Isotr0py in #8892
[TPU] Update pallas.py to support trillium by @bvrockwell in #8871
[torch.compile] use empty tensor instead of None for profiling by @youkaichao in #8875
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method by @ProExpertProg in #7271
[Bugfix] fix for deepseek w4a16 by @LucasWilkinson in #8906
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path by @varun-sundar-rabindranath in #8378
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag by @youkaichao in #8911
[Core] Priority-based scheduling in async engine by @schoennenbeck in #8850
[misc] fix wheel name by @youkaichao in #8919
[Bugfix][Intel] Fix XPU Dockerfile Build by @tylertitsworth in #7824
[Misc] Remove vLLM patch of BaichuanTokenizer by @DarkLight1337 in #8921
[Bugfix] Fix code for downloading models from modelscope by @tastelikefeet in #8443
[Bugfix] Fix PP for Multi-Step by @varun-sundar-rabindranath in #8887
[CI/Build] Update models tests & examples by @DarkLight1337 in #8874
[Frontend] Make beam search emulator temperature modifiable by @nFunctor in #8928
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 by @heheda12345 in #8891
[doc] organize installation doc and expose per-commit docker by @youkaichao in #8931
[Core] Improve choice of Python multiprocessing method by @russellb in #8823
[Bugfix] Block manager v2 with preemption and lookahead slots by @sroy745 in #8824
[Bugfix] Fix Marlin MoE act order when is_k_full == False by @ElizaWszola in #8741
[CI/Build] Add test decorator for minimum GPU memory by @DarkLight1337 in #8925
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching by @tlrmchlsmth in #8930
[Model] Support Qwen2.5-Math-RM-72B by @zhuzilin in #8896
[Model][LoRA]LoRA support added for MiniCPMV2.5 by @jeejeelee in #7199
[BugFix] Fix seeded random sampling with encoder-decoder models by @njhill in #8870
[Misc] Fix typo in BlockSpaceManagerV1 by @juncheoll in #8944
[Frontend] Added support for HF's new continue_final_message parameter by @danieljannai21 in #8942
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model by @mzusman in #8533
[Model] support input embeddings for qwen2vl by @whyiug in #8856
[Misc][CI/Build] Include cv2 via mistral_common[opencv] by @ywang96 in #8951
[Model][LoRA]LoRA support added for MiniCPMV2.6 by @jeejeelee in #8943
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg by @Isotr0py in #8946
[Core] Make scheduling policy settable via EngineArgs by @schoennenbeck in #8956
[Misc] Adjust max_position_embeddings for LoRA compatibility by @jeejeelee in #8957
[ci] Add CODEOWNERS for test directories by @khluu in #8795
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. by @LiuXiaoxuanPKU in #8975
[Frontend][Core] Move guided decoding params into sampling params by @joerunde in #8252
[CI/Build] Fix machete generated kernel files ordering by @khluu in #8976
[torch.compile] fix tensor alias by @youkaichao in #8982
[Misc] add process_weights_after_loading for DummyLoader by @divakar-amd in #8969
[Bugfix] Fix Fuyu tensor parallel inference by @Isotr0py in #8986
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders by @alex-jw-brooks in #8991
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API by @schoennenbeck in #8965
[Doc] Update list of supported models by @DarkLight1337 in #8987
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows by @vlsav in #8997
[Spec Decode] (1/2) Remove batch expansion by @LiuXiaoxuanPKU in #8839
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching by @afeldman-nm in #8804
[Misc] Update Default Image Mapper Error Log by @alex-jw-brooks in #8977
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill by @varun-sundar-rabindranath in #8645
[OpenVINO] Enable GPU support for OpenVINO vLLM backend by @sshlyapn in #8192
[Model] Adding Granite MoE. by @shawntan in #8206
[Doc] Update Granite model docs by @njhill in #9025
[Bugfix] example template should not add parallel_tool_prompt if tools is none by @tjohnson31415 in #9007
[Misc] log when using default MoE config by @divakar-amd in #8971
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser by @gcalmettes in #9020
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. by @sroy745 in #8678
[Frontend] [Neuron] Parse literals out of override-neuron-config by @xendo in #8959
[misc] add forward context for attention by @youkaichao in #9029
Fix failing spec decode test by @sroy745 in #9054
[Bugfix] Weight loading fix for OPT model by @domenVres in #9042
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model by @sydnash in #8405
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) by @LucasWilkinson in #8845
[Misc] Enable multi-step output streaming by default by @mgoin in #9047
[Models] Add remaining model PP support by @andoorve in #7168
[Misc] Move registry to its own file by @DarkLight1337 in #9064
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL by @whyiug in #9071
[Bugfix] Flash attention arches not getting set properly by @LucasWilkinson in #9062
[Model] add a bunch of supported lora modules for mixtral by @prashantgupta24 in #9008
Remove AMD Ray Summit Banner by @simon-mo in #9075
[Hardware][PowerPC] Make oneDNN dependency optional for Power by @varad-ahirwadkar in #9039
[Core][VLM] Test registration for OOT multimodal models by @ywang96 in #8717
Adds truncate_prompt_tokens param for embeddings creation by @flaviabeo in #8999
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE by @ElizaWszola in #8973
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang by @KuntaiDu in #7412
[Misc] Improved prefix cache example by @Imss27 in #9077
[Misc] Add random seed for prefix cache benchmark by @Imss27 in #9081
[Misc] Fix CI lint by @comaniac in #9085
[Hardware][Neuron] Add on-device sampling support for Neuron by @chongmni-aws in #8746
[torch.compile] improve allreduce registration by @youkaichao in #9061
[Doc] Update README.md with Ray summit slides by @zhuohan123 in #9088
[Bugfix] use blockmanagerv1 for encoder-decoder by @heheda12345 in #9084
[Bugfix] Fixes for Phi3v and Ultravox Multimodal EmbeddingInputs Support by @hhzhang16 in #8979
[Model] Support Gemma2 embedding model by @xyang16 in #9004
[Bugfix] Deprecate registration of custom configs to huggingface by @heheda12345 in #9083
[Bugfix] Fix order of arguments matters in config.yaml by @Imss27 in #8960
[core] use forward context for flash infer by @youkaichao in #9097
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model by @tjtanaa in #9101
[Frontend] API support for beam search by @LunrEclipse in #9087
[Misc] Remove user-facing error for removed VLM args by @DarkLight1337 in #9104
[Model] PP support for embedding models and update docs by @DarkLight1337 in #9090
[Bugfix] fix tool_parser error handling when serve a model not support it by @liuyanyi in #8709
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling by @varun-sundar-rabindranath in #9038
[Bugfix][Hardware][CPU] Fix CPU model input for decode by @Isotr0py in #9044
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None by @sroy745 in #9103
[core] remove beam search from the core by @youkaichao in #9105
[Model] Explicit interface for vLLM models and support OOT embedding models by @DarkLight1337 in #9108
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend by @Isotr0py in #9089
[Core] Refactor GGUF parameters packing and forwarding by @Isotr0py in #8859
[Model] Support NVLM-D and fix QK Norm in InternViT by @DarkLight1337 in #9045
[Doc]: Add deploying_with_k8s guide by @haitwang-cloud in #8451
[CI/Build] Add linting for github actions workflows by @russellb in #7876
[Doc] Include performance benchmark in README by @KuntaiDu in #9135
[misc] fix comment and variable name by @youkaichao in #9139
Add Slack to README by @simon-mo in #9137
[misc] update utils to support comparing multiple settings by @youkaichao in #9140
[Intel GPU] Fix xpu decode input by @jikunshang in #9145
[misc] improve ux on readme by @youkaichao in #9147
[Frontend] API support for beam search for MQLLMEngine by @LunrEclipse in #9117
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs by @alex-jw-brooks in #9131
[Frontend] Add Early Validation For Chat Template / Tool Call Parser by @alex-jw-brooks in #9151
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models by @panpan0000 in #8758
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing by @dtrifiro in #8537
[Doc] Update vlm.rst to include an example on videos by @sayakpaul in #9155
[Doc] Improve contributing and installation documentation by @rafvasq in #9132
[Bugfix] Try to handle older versions of pytorch by @bnellnm in #9086
mypy: check additional directories by @russellb in #9162
Add lm-eval directly to requirements-test.txt by @mgoin in #9161
support bitsandbytes quantization with more models by @chenqianfzh in #9148
Add classifiers in setup.py by @terrytangyuan in #9171
Update link to KServe deployment guide by @terrytangyuan in #9173
[Misc] Improve validation errors around best_of and n by @tjohnson31415 in #9167
[Bugfix][Doc] Report neuron error in output by @joerowell in #9159
[Model] Remap FP8 kv_scale in CommandR and DBRX by @hliuca in #9174
[Frontend] Log the maximum supported concurrency by @AlpinDale in #8831
[Bugfix] Optimize composite weight loading and fix EAGLE weight loading by @DarkLight1337 in #9160
[ci][test] use load dummy for testing by @youkaichao in #9165
[Doc] Fix VLM prompt placeholder sample bug by @ycool in #9170
[Bugfix] Fix lora loading for Compressed Tensors in #9120 by @fahadh4ilyas in #9179
[Bugfix] Access get_vocab instead of vocab in tool parsers by @DarkLight1337 in #9188
Add Dependabot configuration for GitHub Actions updates by @EwoutH in #1217
[Hardware][CPU] Support AWQ for CPU backend by @bigPYJ1151 in #7515
[CI/Build] mypy: check vllm/entrypoints by @russellb in #9194
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 by @mgoin in #9130
[Core] Fix invalid args to _process_request by @russellb in #9201
[misc] improve model support check in another process by @youkaichao in #9208
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models by @mgoin in #9213
[Bugfix] Machete garbage results for some models (large K dim) by @LucasWilkinson in #9212
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 by @sroy745 in #9149
[Bugfix] Fix lm_head weights tying with lora for llama by @Isotr0py in #9227
[Model] support input image embedding for minicpmv by @whyiug in #9237
[OpenVINO] Use torch 2.4.0 and newer optimim version by @ilya-lavrenov in #9121
[Bugfix] Fix Machete unittests failing with NotImplementedError by @LucasWilkinson in #9218
[Doc] Improve debugging documentation by @rafvasq in #9204
[CI/Build] Make the Dockerfile.cpu file's PIP_EXTRA_INDEX_URL Configurable as a Build Argument by @jyono in #9252
Suggest codeowners for the core componenets by @simon-mo in #9210
[torch.compile] integration with compilation control by @youkaichao in #9058
Bump actions/github-script from 6 to 7 by @dependabot in #9197
Bump actions/checkout from 3 to 4 by @dependabot in #9196
Bump actions/setup-python from 3 to 5 by @dependabot in #9195
[ci/build] Add placeholder command for custom models test and add comments by @khluu in #9262
[torch.compile] generic decorators by @youkaichao in #9258
[Doc][Neuron] add note to neuron documentation about resolving triton issue by @omrishiv in #9257
[Misc] Fix sampling from sonnet for long context case by @Imss27 in #9235
[misc] hide best_of from engine by @youkaichao in #9261
[Misc] Collect model support info in a single process per model by @DarkLight1337 in #9233
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format by @jeejeelee in #9275
[Bugfix] Fix priority in multiprocessing engine by @schoennenbeck in #9277
[Model] Support Mamba by @tlrmchlsmth in #6484
[Kernel] adding fused moe kernel config for L40S TP4 by @bringlein in #9245
[Model] Add GLM-4v support and meet vllm==0.6.2 by @sixsixcoder in #9242
[Doc] Remove outdated comment to avoid misunderstanding by @homeffjy in #9287
[Doc] Compatibility matrix for mutual exclusive features by @wallashss in #8512
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected by @LucasWilkinson in #9254
[Bugfix] Sets is_first_step_output for TPUModelRunner by @allenwang28 in #9202
[bugfix] fix f-string for error by @prashantgupta24 in #9295
[BugFix] Fix tool call finish reason in streaming case by @maxdebayser in #9209
[SpecDec] Remove Batch Expansion (2/3) by @LiuXiaoxuanPKU in #9298
[Bugfix] Fix bug of xformer prefill for encoder-decoder by @xiangxu-google in #9026
[Misc][Installation] Improve source installation script and related documentation by @cermeng in #9309
[Bugfix]Fix MiniCPM's LoRA bug by @jeejeelee in #9286
[CI] Fix merge conflict by @LiuXiaoxuanPKU in #9317
[Bugfix] Bandaid fix for speculative decoding tests by @tlrmchlsmth in #9327
[Model] Molmo vLLM Integration by @mrsalehi in #9016
[Hardware][intel GPU] add async output process for xpu by @jikunshang in #8897
[CI/Build] setuptools-scm fixes by @dtrifiro in #8900
[Docs] Remove PDF build from Readtehdocs by @simon-mo in #9347

New Contributors

@fyuan1316 made their first contribution in #8834
@panpan0000 made their first contribution in #8830
@bvrockwell made their first contribution in #8871
@tylertitsworth made their first contribution in #7824
@tastelikefeet made their first contribution in #8443
@nFunctor made their first contribution in #8928
@zhuzilin made their first contribution in #8896
@juncheoll made their first contribution in #8944
@vlsav made their first contribution in #8997
@sshlyapn made their first contribution in #8192
@gcalmettes made their first contribution in #9020
@xendo made their first contribution in #8959
@domenVres made their first contribution in #9042
@sydnash made their first contribution in #8405
@varad-ahirwadkar made their first contribution in #9039
@flaviabeo made their first contribution in #8999
@chongmni-aws made their first contribution in #8746
@hhzhang16 made their first contribution in #8979
@xyang16 made their first contribution in #9004
@LunrEclipse made their first contribution in #9087
@sayakpaul made their first contribution in #9155
@joerowell made their first contribution in #9159
@AlpinDale made their first contribution in #8831
@ycool made their first contribution in #9170
@fahadh4ilyas made their first contribution in #9179
@EwoutH made their first contribution in #1217
@jyono made their first contribution in #9252
@dependabot made their first contribution in #9197
@bringlein made their first contribution in #9245
@sixsixcoder made their first contribution in #9242
@homeffjy made their first contribution in #9287
@allenwang28 made their first contribution in #9202
@cermeng made their first contribution in #9309
@mrsalehi made their first contribution in #9016

Full Changelog: v0.6.2...v0.6.3