Highlights

Production Features

FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. Please try it out and let us know your thoughts! (#5352, #5388, #5159, #5238, #5294, #5183, #5144, #5231)
Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported. We are working on adding more models in the next release. (#5237, #5383, #4199, #5374, #4197)
Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases. (#5400, #5157, #5137, #5324)
Default to multiprocessing backend for single-node distributed case (#5230)
Support bitsandbytes quantization and QLoRA (#4776)

Hardware Support

Improvements to the Intel CPU CI (#4113, #5241)
Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047)

Others

Debugging tips documentation (#5409, #5430)
Dynamic Per-Token Activation Quantization (#5037)
Customizable RoPE theta (#5197)
Enable passing multiple LoRA adapters at once to generate() (#5300)
OpenAI tools support named functions (#5032)
Support stream_options for OpenAI protocol (#5319, #5135)
Update Outlines Integration from FSM to Guide (#4109)

What's Changed

[CI/Build] CMakeLists: build all extensions' cmake targets at the same time by @dtrifiro in #5034
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU by @tlrmchlsmth in #5137
[Kernel] Update Cutlass fp8 configs by @varun-sundar-rabindranath in #5144
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py by @dashanji in #5151
[Bugfix] Fix call to init_logger in openai server by @NadavShmayo in #4765
[Feature][Kernel] Support bitsandbytes quantization and QLoRA by @chenqianfzh in #4776
[Bugfix] Remove deprecated @abstractproperty by @zhuohan123 in #5174
[Bugfix]: Fix issues related to prefix caching example (#5177) by @Delviet in #5180
[BugFix] Prevent LLM.encode for non-generation Models by @robertgshaw2-neuralmagic in #5184
Update test_ignore_eos by @simon-mo in #4898
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response by @Avinash-Raj in #4643
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer by @divakar-amd in #4927
[Misc] Simplify code and fix type annotations in conftest.py by @DarkLight1337 in #5118
[Core] Support image processor by @DarkLight1337 in #4197
[Core] Remove unnecessary copies in flash attn backend by @Yard1 in #5138
[Kernel] Pass a device pointer into the quantize kernel for the scales by @tlrmchlsmth in #5159
[CI/BUILD] enable intel queue for longer CPU tests by @zhouyuan in #4113
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 by @Kaiyang-Chen in #3834
New CI template on AWS stack by @khluu in #5110
[FRONTEND] OpenAI tools support named functions by @br3no in #5032
[Bugfix] Support prompt_logprobs==0 by @toslunar in #5217
[Bugfix] Add warmup for prefix caching example by @zhuohan123 in #5235
[Kernel] Enhance MoE benchmarking & tuning script by @WoosukKwon in #4921
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend by @afeldman-nm in #5210
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor by @zifeitong in #5229
[CI/Build] Add inputs tests by @DarkLight1337 in #5215
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend by @DamonFool in #5249
[Kernel] Add back batch size 1536 and 3072 to MoE tuning by @WoosukKwon in #5242
[CI/Build] Simplify model loading for HfRunner by @DarkLight1337 in #5251
[CI/Build] Reducing CPU CI execution time by @bigPYJ1151 in #5241
[CI] mark AMD test as softfail to prevent blockage by @simon-mo in #5256
[Misc] Add transformers version to collect_env.py by @mgoin in #5259
[Misc] update collect env by @youkaichao in #5261
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True by @zifeitong in #5226
[Misc] Add CustomOp interface for device portability by @WoosukKwon in #5255
[Misc] Fix docstring of get_attn_backend by @WoosukKwon in #5271
[Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) by @tomeras91 in #5278
[CI] Add nightly benchmarks by @simon-mo in #5260
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results by @tlrmchlsmth in #5263
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size by @tlrmchlsmth in #5157
[Model] Correct Mixtral FP8 checkpoint loading by @comaniac in #5231
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM by @DriverSong in #5207
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 by @pcmoritz in #5238
[Docs] Add Sequoia as sponsors by @simon-mo in #5287
[Speculative Decoding] Add ProposerWorkerBase abstract class by @njhill in #5252
[BugFix] Fix log message about default max model length by @njhill in #5284
[Bugfix] Make EngineArgs use named arguments for config construction by @mgoin in #5285
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. by @wuisawesome in #5290
[Misc] Skip for logits_scale == 1.0 by @WoosukKwon in #5291
[Docs] Add Ray Summit CFP by @simon-mo in #5295
[CI] Disable flash_attn backend for spec decode by @simon-mo in #5286
[Frontend][Core] Update Outlines Integration from FSM to Guide by @br3no in #4109
[CI/Build] Update vision tests by @DarkLight1337 in #5307
Bugfix: fix broken of download models from modelscope by @liuyhwangyh in #5233
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 by @pcmoritz in #5294
[Frontend] enable passing multiple LoRA adapters at once to generate() by @mgoldey in #5300
[Core] Avoid copying prompt/output tokens if no penalties are used by @Yard1 in #5289
[Core] Change LoRA embedding sharding to support loading methods by @Yard1 in #5038
[Misc] Missing error message for custom ops import by @DamonFool in #5282
[Feature][Frontend]: Add support for stream_options in ChatCompletionRequest by @Etelis in #5135
[Misc][Utils] allow get_open_port to be called for multiple times by @youkaichao in #5333
[Kernel] Switch fp8 layers to use the CUTLASS kernels by @tlrmchlsmth in #5183
Remove Ray health check by @Yard1 in #4693
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill by @JamesLim-sy in #5296
[Kernel] Dynamic Per-Token Activation Quantization by @dsikka in #5037
[Frontend] Add OpenAI Vision API Support by @ywang96 in #5237
[Misc] Remove unused cuda_utils.h in CPU backend by @DamonFool in #5345
fix DbrxFusedNormAttention missing cache_config by @Calvinnncy97 in #5340
[Bug Fix] Fix the support check for FP8 CUTLASS by @cli99 in #5352
[Misc] Add args for selecting distributed executor to benchmarks by @BKitor in #5335
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention by @hongxiayang in #4965
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) by @youkaichao in #5347
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) by @youkaichao in #5357
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale by @mgoin in #5353
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint by @youkaichao in #5074
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py by @youkaichao in #5361
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops by @bnellnm in #5047
[Bugfix] Fix KeyError: 1 When Using LoRA adapters by @BlackBird-Coding in #5164
[Misc] Update to comply with the new compressed-tensors config by @dsikka in #5350
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server by @ywang96 in #5374
[misc][typo] fix typo by @youkaichao in #5372
[Misc] Improve error message when LoRA parsing fails by @DarkLight1337 in #5194
[Model] Initial support for LLaVA-NeXT by @DarkLight1337 in #4199
[Feature][Frontend]: Continued stream_options implementation also in CompletionRequest by @Etelis in #5319
[Bugfix] Fix LLaVA-NeXT by @DarkLight1337 in #5380
[ci] Use small_cpu_queue for doc build by @khluu in #5331
[ci] Mount buildkite agent on Docker container to upload benchmark results by @khluu in #5330
[Docs] Add Docs on Limitations of VLM Support by @ywang96 in #5383
[Docs] Alphabetically sort sponsors by @WoosukKwon in #5386
Bump version to v0.5.0 by @simon-mo in #5384
[Doc] Add documentation for FP8 W8A8 by @mgoin in #5388
[ci] Fix Buildkite agent path by @khluu in #5392
[Misc] Various simplifications and typing fixes by @njhill in #5368
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs by @maor-ps in #5312
[Bugfix][Frontend] Cleanup "fix chat logprobs" by @DarkLight1337 in #5026
[Doc] add debugging tips by @youkaichao in #5409
[Doc][Typo] Fixing Missing Comma by @ywang96 in #5403
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable by @WoosukKwon in #5389
[CI] docfix by @rkooo567 in #5410
[Speculative decoding] Initial spec decode docs by @cadedaniel in #5400
[Doc] Add an automatic prefix caching section in vllm documentation by @KuntaiDu in #5324
[Docs] [Spec decode] Fix docs error in code example by @cadedaniel in #5427
[Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 by @jsato8094 in #5254
[Bugfix] fix lora_dtype value type in arg_utils.py by @c3-ali in #5398
[Frontend] Customizable RoPE theta by @sasha0552 in #5197
[Core][Distributed] add same-node detection by @youkaichao in #5369
[Core][Doc] Default to multiprocessing for single-node distributed case by @njhill in #5230
[Doc] add common case for long waiting time by @youkaichao in #5430

New Contributors

@dtrifiro made their first contribution in #5034
@varun-sundar-rabindranath made their first contribution in #5144
@dashanji made their first contribution in #5151
@chenqianfzh made their first contribution in #4776
@Delviet made their first contribution in #5180
@Avinash-Raj made their first contribution in #4643
@zhouyuan made their first contribution in #4113
@Kaiyang-Chen made their first contribution in #3834
@khluu made their first contribution in #5110
@toslunar made their first contribution in #5217
@DamonFool made their first contribution in #5249
@tomeras91 made their first contribution in #5278
@DriverSong made their first contribution in #5207
@mgoldey made their first contribution in #5300
@JamesLim-sy made their first contribution in #5296
@Calvinnncy97 made their first contribution in #5340
@cli99 made their first contribution in #5352
@BKitor made their first contribution in #5335
@BlackBird-Coding made their first contribution in #5164
@maor-ps made their first contribution in #5312
@c3-ali made their first contribution in #5398

Full Changelog: v0.4.3...v0.5.0

vllm 0.5.0 v0.5.0 on Python PyPI

Highlights

Production Features

Hardware Support

Others

What's Changed

New Contributors

vllm 0.5.0
v0.5.0

on Python PyPI