What's Changed
- [Misc] Update config loading for Qwen2-VL and remove Granite by @ywang96 in #8837
- [Build/CI] Upgrade to gcc 10 in the base build Docker image by @tlrmchlsmth in #8814
- [Docs] Add README to the build docker image by @mgoin in #8825
- [CI/Build] Fix missing ci dependencies by @fyuan1316 in #8834
- [misc][installation] build from source without compilation by @youkaichao in #8818
- [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM by @khluu in #8872
- [Bugfix] Include encoder prompts len to non-stream api usage response by @Pernekhan in #8861
- [Misc] Change dummy profiling and BOS fallback warns to log once by @mgoin in #8820
- [Bugfix] Fix print_warning_once's line info by @tlrmchlsmth in #8867
- fix validation: Only set tool_choice
auto
if at least one tool is provided by @chiragjn in #8568 - [Bugfix] Fixup advance_step.cu warning by @tlrmchlsmth in #8815
- [BugFix] Fix test breakages from transformers 4.45 upgrade by @njhill in #8829
- [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility by @DarkLight1337 in #8764
- [Feature] Add support for Llama 3.1 and 3.2 tool use by @maxdebayser in #8343
- [Core] Rename
PromptInputs
andinputs
with backward compatibility by @DarkLight1337 in #8876 - [misc] fix collect env by @youkaichao in #8894
- [MISC] Fix invalid escape sequence '' by @panpan0000 in #8830
- [Bugfix][VLM] Fix Fuyu batching inference with
max_num_seqs>1
by @Isotr0py in #8892 - [TPU] Update pallas.py to support trillium by @bvrockwell in #8871
- [torch.compile] use empty tensor instead of None for profiling by @youkaichao in #8875
- [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method by @ProExpertProg in #7271
- [Bugfix] fix for deepseek w4a16 by @LucasWilkinson in #8906
- [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path by @varun-sundar-rabindranath in #8378
- [misc][distributed] add VLLM_SKIP_P2P_CHECK flag by @youkaichao in #8911
- [Core] Priority-based scheduling in async engine by @schoennenbeck in #8850
- [misc] fix wheel name by @youkaichao in #8919
- [Bugfix][Intel] Fix XPU Dockerfile Build by @tylertitsworth in #7824
- [Misc] Remove vLLM patch of
BaichuanTokenizer
by @DarkLight1337 in #8921 - [Bugfix] Fix code for downloading models from modelscope by @tastelikefeet in #8443
- [Bugfix] Fix PP for Multi-Step by @varun-sundar-rabindranath in #8887
- [CI/Build] Update models tests & examples by @DarkLight1337 in #8874
- [Frontend] Make beam search emulator temperature modifiable by @nFunctor in #8928
- [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 by @heheda12345 in #8891
- [doc] organize installation doc and expose per-commit docker by @youkaichao in #8931
- [Core] Improve choice of Python multiprocessing method by @russellb in #8823
- [Bugfix] Block manager v2 with preemption and lookahead slots by @sroy745 in #8824
- [Bugfix] Fix Marlin MoE act order when is_k_full == False by @ElizaWszola in #8741
- [CI/Build] Add test decorator for minimum GPU memory by @DarkLight1337 in #8925
- [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching by @tlrmchlsmth in #8930
- [Model] Support Qwen2.5-Math-RM-72B by @zhuzilin in #8896
- [Model][LoRA]LoRA support added for MiniCPMV2.5 by @jeejeelee in #7199
- [BugFix] Fix seeded random sampling with encoder-decoder models by @njhill in #8870
- [Misc] Fix typo in BlockSpaceManagerV1 by @juncheoll in #8944
- [Frontend] Added support for HF's new
continue_final_message
parameter by @danieljannai21 in #8942 - [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model by @mzusman in #8533
- [Model] support input embeddings for qwen2vl by @whyiug in #8856
- [Misc][CI/Build] Include
cv2
viamistral_common[opencv]
by @ywang96 in #8951 - [Model][LoRA]LoRA support added for MiniCPMV2.6 by @jeejeelee in #8943
- [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg by @Isotr0py in #8946
- [Core] Make scheduling policy settable via EngineArgs by @schoennenbeck in #8956
- [Misc] Adjust max_position_embeddings for LoRA compatibility by @jeejeelee in #8957
- [ci] Add CODEOWNERS for test directories by @khluu in #8795
- [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. by @LiuXiaoxuanPKU in #8975
- [Frontend][Core] Move guided decoding params into sampling params by @joerunde in #8252
- [CI/Build] Fix machete generated kernel files ordering by @khluu in #8976
- [torch.compile] fix tensor alias by @youkaichao in #8982
- [Misc] add process_weights_after_loading for DummyLoader by @divakar-amd in #8969
- [Bugfix] Fix Fuyu tensor parallel inference by @Isotr0py in #8986
- [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders by @alex-jw-brooks in #8991
- [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API by @schoennenbeck in #8965
- [Doc] Update list of supported models by @DarkLight1337 in #8987
- Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows by @vlsav in #8997
- [Spec Decode] (1/2) Remove batch expansion by @LiuXiaoxuanPKU in #8839
- [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching by @afeldman-nm in #8804
- [Misc] Update Default Image Mapper Error Log by @alex-jw-brooks in #8977
- [Core] CUDA Graphs for Multi-Step + Chunked-Prefill by @varun-sundar-rabindranath in #8645
- [OpenVINO] Enable GPU support for OpenVINO vLLM backend by @sshlyapn in #8192
- [Model] Adding Granite MoE. by @shawntan in #8206
- [Doc] Update Granite model docs by @njhill in #9025
- [Bugfix] example template should not add parallel_tool_prompt if tools is none by @tjohnson31415 in #9007
- [Misc] log when using default MoE config by @divakar-amd in #8971
- [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser by @gcalmettes in #9020
- [Core] Make BlockSpaceManagerV2 the default BlockManager to use. by @sroy745 in #8678
- [Frontend] [Neuron] Parse literals out of override-neuron-config by @xendo in #8959
- [misc] add forward context for attention by @youkaichao in #9029
- Fix failing spec decode test by @sroy745 in #9054
- [Bugfix] Weight loading fix for OPT model by @domenVres in #9042
- [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model by @sydnash in #8405
- [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) by @LucasWilkinson in #8845
- [Misc] Enable multi-step output streaming by default by @mgoin in #9047
- [Models] Add remaining model PP support by @andoorve in #7168
- [Misc] Move registry to its own file by @DarkLight1337 in #9064
- [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL by @whyiug in #9071
- [Bugfix] Flash attention arches not getting set properly by @LucasWilkinson in #9062
- [Model] add a bunch of supported lora modules for mixtral by @prashantgupta24 in #9008
- Remove AMD Ray Summit Banner by @simon-mo in #9075
- [Hardware][PowerPC] Make oneDNN dependency optional for Power by @varad-ahirwadkar in #9039
- [Core][VLM] Test registration for OOT multimodal models by @ywang96 in #8717
- Adds truncate_prompt_tokens param for embeddings creation by @flaviabeo in #8999
- [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE by @ElizaWszola in #8973
- [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang by @KuntaiDu in #7412
- [Misc] Improved prefix cache example by @Imss27 in #9077
- [Misc] Add random seed for prefix cache benchmark by @Imss27 in #9081
- [Misc] Fix CI lint by @comaniac in #9085
- [Hardware][Neuron] Add on-device sampling support for Neuron by @chongmni-aws in #8746
- [torch.compile] improve allreduce registration by @youkaichao in #9061
- [Doc] Update README.md with Ray summit slides by @zhuohan123 in #9088
- [Bugfix] use blockmanagerv1 for encoder-decoder by @heheda12345 in #9084
- [Bugfix] Fixes for Phi3v and Ultravox Multimodal EmbeddingInputs Support by @hhzhang16 in #8979
- [Model] Support Gemma2 embedding model by @xyang16 in #9004
- [Bugfix] Deprecate registration of custom configs to huggingface by @heheda12345 in #9083
- [Bugfix] Fix order of arguments matters in config.yaml by @Imss27 in #8960
- [core] use forward context for flash infer by @youkaichao in #9097
- [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model by @tjtanaa in #9101
- [Frontend] API support for beam search by @LunrEclipse in #9087
- [Misc] Remove user-facing error for removed VLM args by @DarkLight1337 in #9104
- [Model] PP support for embedding models and update docs by @DarkLight1337 in #9090
- [Bugfix] fix tool_parser error handling when serve a model not support it by @liuyanyi in #8709
- [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling by @varun-sundar-rabindranath in #9038
- [Bugfix][Hardware][CPU] Fix CPU model input for decode by @Isotr0py in #9044
- [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None by @sroy745 in #9103
- [core] remove beam search from the core by @youkaichao in #9105
- [Model] Explicit interface for vLLM models and support OOT embedding models by @DarkLight1337 in #9108
- [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend by @Isotr0py in #9089
- [Core] Refactor GGUF parameters packing and forwarding by @Isotr0py in #8859
- [Model] Support NVLM-D and fix QK Norm in InternViT by @DarkLight1337 in #9045
- [Doc]: Add deploying_with_k8s guide by @haitwang-cloud in #8451
- [CI/Build] Add linting for github actions workflows by @russellb in #7876
- [Doc] Include performance benchmark in README by @KuntaiDu in #9135
- [misc] fix comment and variable name by @youkaichao in #9139
- Add Slack to README by @simon-mo in #9137
- [misc] update utils to support comparing multiple settings by @youkaichao in #9140
- [Intel GPU] Fix xpu decode input by @jikunshang in #9145
- [misc] improve ux on readme by @youkaichao in #9147
- [Frontend] API support for beam search for MQLLMEngine by @LunrEclipse in #9117
- [Core][Frontend] Add Support for Inference Time mm_processor_kwargs by @alex-jw-brooks in #9131
- [Frontend] Add Early Validation For Chat Template / Tool Call Parser by @alex-jw-brooks in #9151
- [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models by @panpan0000 in #8758
- [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing by @dtrifiro in #8537
- [Doc] Update vlm.rst to include an example on videos by @sayakpaul in #9155
- [Doc] Improve contributing and installation documentation by @rafvasq in #9132
- [Bugfix] Try to handle older versions of pytorch by @bnellnm in #9086
- mypy: check additional directories by @russellb in #9162
- Add
lm-eval
directly to requirements-test.txt by @mgoin in #9161 - support bitsandbytes quantization with more models by @chenqianfzh in #9148
- Add classifiers in setup.py by @terrytangyuan in #9171
- Update link to KServe deployment guide by @terrytangyuan in #9173
- [Misc] Improve validation errors around best_of and n by @tjohnson31415 in #9167
- [Bugfix][Doc] Report neuron error in output by @joerowell in #9159
- [Model] Remap FP8 kv_scale in CommandR and DBRX by @hliuca in #9174
- [Frontend] Log the maximum supported concurrency by @AlpinDale in #8831
- [Bugfix] Optimize composite weight loading and fix EAGLE weight loading by @DarkLight1337 in #9160
- [ci][test] use load dummy for testing by @youkaichao in #9165
- [Doc] Fix VLM prompt placeholder sample bug by @ycool in #9170
- [Bugfix] Fix lora loading for Compressed Tensors in #9120 by @fahadh4ilyas in #9179
- [Bugfix] Access
get_vocab
instead ofvocab
in tool parsers by @DarkLight1337 in #9188 - Add Dependabot configuration for GitHub Actions updates by @EwoutH in #1217
- [Hardware][CPU] Support AWQ for CPU backend by @bigPYJ1151 in #7515
- [CI/Build] mypy: check vllm/entrypoints by @russellb in #9194
- [CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 by @mgoin in #9130
- [Core] Fix invalid args to _process_request by @russellb in #9201
- [misc] improve model support check in another process by @youkaichao in #9208
- [Bugfix] Fix Weight Loading Multiple GPU Test - Large Models by @mgoin in #9213
- [Bugfix] Machete garbage results for some models (large K dim) by @LucasWilkinson in #9212
- [Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 by @sroy745 in #9149
- [Bugfix] Fix lm_head weights tying with lora for llama by @Isotr0py in #9227
- [Model] support input image embedding for minicpmv by @whyiug in #9237
- [OpenVINO] Use torch 2.4.0 and newer optimim version by @ilya-lavrenov in #9121
- [Bugfix] Fix Machete unittests failing with
NotImplementedError
by @LucasWilkinson in #9218 - [Doc] Improve debugging documentation by @rafvasq in #9204
- [CI/Build] Make the
Dockerfile.cpu
file'sPIP_EXTRA_INDEX_URL
Configurable as a Build Argument by @jyono in #9252 - Suggest codeowners for the core componenets by @simon-mo in #9210
- [torch.compile] integration with compilation control by @youkaichao in #9058
- Bump actions/github-script from 6 to 7 by @dependabot in #9197
- Bump actions/checkout from 3 to 4 by @dependabot in #9196
- Bump actions/setup-python from 3 to 5 by @dependabot in #9195
- [ci/build] Add placeholder command for custom models test and add comments by @khluu in #9262
- [torch.compile] generic decorators by @youkaichao in #9258
- [Doc][Neuron] add note to neuron documentation about resolving triton issue by @omrishiv in #9257
- [Misc] Fix sampling from sonnet for long context case by @Imss27 in #9235
- [misc] hide best_of from engine by @youkaichao in #9261
- [Misc] Collect model support info in a single process per model by @DarkLight1337 in #9233
- [Misc][LoRA] Support loading LoRA weights for target_modules in reg format by @jeejeelee in #9275
- [Bugfix] Fix priority in multiprocessing engine by @schoennenbeck in #9277
- [Model] Support Mamba by @tlrmchlsmth in #6484
- [Kernel] adding fused moe kernel config for L40S TP4 by @bringlein in #9245
- [Model] Add GLM-4v support and meet vllm==0.6.2 by @sixsixcoder in #9242
- [Doc] Remove outdated comment to avoid misunderstanding by @homeffjy in #9287
- [Doc] Compatibility matrix for mutual exclusive features by @wallashss in #8512
- [Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected by @LucasWilkinson in #9254
- [Bugfix] Sets
is_first_step_output
for TPUModelRunner by @allenwang28 in #9202 - [bugfix] fix f-string for error by @prashantgupta24 in #9295
- [BugFix] Fix tool call finish reason in streaming case by @maxdebayser in #9209
- [SpecDec] Remove Batch Expansion (2/3) by @LiuXiaoxuanPKU in #9298
- [Bugfix] Fix bug of xformer prefill for encoder-decoder by @xiangxu-google in #9026
- [Misc][Installation] Improve source installation script and related documentation by @cermeng in #9309
- [Bugfix]Fix MiniCPM's LoRA bug by @jeejeelee in #9286
- [CI] Fix merge conflict by @LiuXiaoxuanPKU in #9317
- [Bugfix] Bandaid fix for speculative decoding tests by @tlrmchlsmth in #9327
- [Model] Molmo vLLM Integration by @mrsalehi in #9016
- [Hardware][intel GPU] add async output process for xpu by @jikunshang in #8897
- [CI/Build] setuptools-scm fixes by @dtrifiro in #8900
- [Docs] Remove PDF build from Readtehdocs by @simon-mo in #9347
New Contributors
- @fyuan1316 made their first contribution in #8834
- @panpan0000 made their first contribution in #8830
- @bvrockwell made their first contribution in #8871
- @tylertitsworth made their first contribution in #7824
- @tastelikefeet made their first contribution in #8443
- @nFunctor made their first contribution in #8928
- @zhuzilin made their first contribution in #8896
- @juncheoll made their first contribution in #8944
- @vlsav made their first contribution in #8997
- @sshlyapn made their first contribution in #8192
- @gcalmettes made their first contribution in #9020
- @xendo made their first contribution in #8959
- @domenVres made their first contribution in #9042
- @sydnash made their first contribution in #8405
- @varad-ahirwadkar made their first contribution in #9039
- @flaviabeo made their first contribution in #8999
- @chongmni-aws made their first contribution in #8746
- @hhzhang16 made their first contribution in #8979
- @xyang16 made their first contribution in #9004
- @LunrEclipse made their first contribution in #9087
- @sayakpaul made their first contribution in #9155
- @joerowell made their first contribution in #9159
- @AlpinDale made their first contribution in #8831
- @ycool made their first contribution in #9170
- @fahadh4ilyas made their first contribution in #9179
- @EwoutH made their first contribution in #1217
- @jyono made their first contribution in #9252
- @dependabot made their first contribution in #9197
- @bringlein made their first contribution in #9245
- @sixsixcoder made their first contribution in #9242
- @homeffjy made their first contribution in #9287
- @allenwang28 made their first contribution in #9202
- @cermeng made their first contribution in #9309
- @mrsalehi made their first contribution in #9016
Full Changelog: v0.6.2...v0.6.3