Highlights
Performance Update
- We are excited to announce a faster vLLM delivering 2x more throughput compared to v0.5.3. The default parameter should achieve great speed up, but we recommend also try out turning on multi step scheduling. You can do so by setting
--num-scheduler-steps 8
in the engine arguments.- Multi-step scheduler now supports LLMEngine and log_probs (#7789, #7652)
- Asynchronous output processor overlaps the output data structures construction with GPU works, delivering 12% throughput increase. (#7049, #7911, #7921, #8050)
- Using FlashInfer backend for FP8 KV Cache (#7798, #7985), rejection sampling in Speculative Decoding (#7244)
Model Support
- Support bitsandbytes 8-bit and FP4 quantized models (#7445)
- New LLMs: Exaone (#7819), Granite (#7436), Phi-3.5-MoE (#7729)
- A new tokenizer mode for mistral models to use the native mistral-commons package (#7739)
- Multi-modality:
Hardware Support
- NVIDIA GPU: extend cuda graph size for H200 (#7894)
- AMD: Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
- Intel GPU: pipeline parallel support (#7810)
- Neuron: context lengths and token generation buckets (#7885, #8062)
- TPU: single and multi-host TPUs on GKE (#7613), Async output processing (#8011)
Production Features
- OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models! (#5649)
- Add json_schema support from OpenAI protocol (#7654)
- Enable chunked prefill and prefix caching together (#7753, #8120)
- Multimodal support in offline chat (#8098), and multiple multi-modal items in the OpenAI frontend (#8049)
Misc
- Support benchmarking async engine in benchmark_throughput.py (#7964)
- Progress in integration with
torch.compile
: avoid Dynamo guard evaluation overhead (#7898), skip compile for profiling (#7796)
What's Changed
- [Core] Add multi-step support to LLMEngine by @alexm-neuralmagic in #7789
- [Bugfix] Fix run_batch logger by @pooyadavoodi in #7640
- [Frontend] Publish Prometheus metrics in run_batch API by @pooyadavoodi in #7641
- [Frontend] add json_schema support from OpenAI protocol by @rockwotj in #7654
- [misc][core] lazy import outlines by @youkaichao in #7831
- [ci][test] exclude model download time in server start time by @youkaichao in #7834
- [ci][test] fix RemoteOpenAIServer by @youkaichao in #7838
- [Bugfix] Fix Phi-3v crash when input images are of certain sizes by @zifeitong in #7840
- [Model][VLM] Support multi-images inputs for Phi-3-vision models by @Isotr0py in #7783
- [Misc] Remove snapshot_download usage in InternVL2 test by @Isotr0py in #7835
- [misc][cuda] improve pynvml warning by @youkaichao in #7852
- [Spec Decoding] Streamline batch expansion tensor manipulation by @njhill in #7851
- [Bugfix]: Use float32 for base64 embedding by @HollowMan6 in #7855
- [CI/Build] Avoid downloading all HF files in
RemoteOpenAIServer
by @DarkLight1337 in #7836 - [Performance][BlockManagerV2] Mark prefix cache block as computed after schedule by @comaniac in #7822
- [Misc] Update
qqq
to use vLLMParameters by @dsikka in #7805 - [Misc] Update
gptq_marlin_24
to use vLLMParameters by @dsikka in #7762 - [misc] fix custom allreduce p2p cache file generation by @youkaichao in #7853
- [Bugfix] neuron: enable tensor parallelism by @omrishiv in #7562
- [Misc] Update compressed tensors lifecycle to remove
prefix
fromcreate_weights
by @dsikka in #7825 - [Core] Asynchronous Output Processor by @megha95 in #7049
- [Tests] Disable retries and use context manager for openai client by @njhill in #7565
- [core][torch.compile] not compile for profiling by @youkaichao in #7796
- Revert #7509 by @comaniac in #7887
- [Model] Add Mistral Tokenization to improve robustness and chat encoding by @patrickvonplaten in #7739
- [CI/Build][VLM] Cleanup multiple images inputs model test by @Isotr0py in #7897
- [Hardware][Intel GPU] Add intel GPU pipeline parallel support. by @jikunshang in #7810
- [CI/Build][ROCm] Enabling tensorizer tests for ROCm by @alexeykondrat in #7237
- [Bugfix] Fix phi3v incorrect image_idx when using async engine by @Isotr0py in #7916
- [cuda][misc] error on empty CUDA_VISIBLE_DEVICES by @youkaichao in #7924
- [Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7766
- [benchmark] Update TGI version by @philschmid in #7917
- [Model] Add multi-image input support for LLaVA-Next offline inference by @zifeitong in #7230
- [mypy] Enable mypy type checking for
vllm/core
by @jberkhahn in #7229 - [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt by @petersalas in #7902
- [hardware][rocm] allow rocm to override default env var by @youkaichao in #7926
- [Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. by @bnellnm in #7886
- [mypy][CI/Build] Fix mypy errors by @DarkLight1337 in #7929
- [Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) by @alexm-neuralmagic in #7911
- [Performance] Enable chunked prefill and prefix caching together by @comaniac in #7753
- [ci][test] fix pp test failure by @youkaichao in #7945
- [Doc] fix the autoAWQ example by @stas00 in #7937
- [Bugfix][VLM] Fix incompatibility between #7902 and #7230 by @DarkLight1337 in #7948
- [Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. by @pavanimajety in #7798
- [Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ by @rasmith in #7386
- [TPU] Upgrade PyTorch XLA nightly by @WoosukKwon in #7967
- [Doc] fix 404 link by @stas00 in #7966
- [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM by @mzusman in #7651
- [Bugfix] Make torch registration of punica ops optional by @bnellnm in #7970
- [torch.compile] avoid Dynamo guard evaluation overhead by @youkaichao in #7898
- Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test by @mgoin in #7961
- [Frontend] Minor optimizations to zmq decoupled front-end by @njhill in #7957
- [torch.compile] remove reset by @youkaichao in #7975
- [VLM][Core] Fix exceptions on ragged NestedTensors by @petersalas in #7974
- Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." by @youkaichao in #7982
- [Bugfix] Unify rank computation across regular decoding and speculative decoding by @jmkuebler in #7899
- [Core] Combine async postprocessor and multi-step by @alexm-neuralmagic in #7921
- [Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto by @pavanimajety in #7985
- extend cuda graph size for H200 by @kushanam in #7894
- [Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism by @Isotr0py in #7954
- [misc] update tpu int8 to use new vLLM Parameters by @dsikka in #7973
- [Neuron] Adding support for context-lenght, token-gen buckets. by @hbikki in #7885
- support bitsandbytes 8-bit and FP4 quantized models by @chenqianfzh in #7445
- Add more percentiles and latencies by @wschin in #7759
- [VLM] Disallow overflowing
max_model_len
for multimodal models by @DarkLight1337 in #7998 - [Core] Logprobs support in Multi-step by @afeldman-nm in #7652
- [TPU] Async output processing for TPU by @WoosukKwon in #8011
- [Kernel] changing fused moe kernel chunk size default to 32k by @avshalomman in #7995
- [MODEL] add Exaone model support by @nayohan in #7819
- Support vLLM single and multi-host TPUs on GKE by @richardsliu in #7613
- [Bugfix] Fix import error in Exaone model by @DarkLight1337 in #8034
- [VLM][Model] TP support for ViTs by @ChristopherCho in #7186
- [Core] Increase default
max_num_batched_tokens
for multimodal models by @DarkLight1337 in #8028 - [Frontend]-config-cli-args by @KaunilD in #7737
- [TPU][Bugfix] Fix tpu type api by @WoosukKwon in #8035
- [Model] Adding support for MSFT Phi-3.5-MoE by @wenxcs in #7729
- [Bugfix] Address #8009 and add model test for flashinfer fp8 kv cache. by @pavanimajety in #8013
- [Bugfix] Fix import error in Phi-3.5-MoE by @DarkLight1337 in #8052
- [Bugfix] Fix ModelScope models in v0.5.5 by @NickLucche in #8037
- [BugFix][Core] Multistep Fix Crash on Request Cancellation by @robertgshaw2-neuralmagic in #8059
- [Frontend][VLM] Add support for multiple multi-modal items in the OpenAI frontend by @ywang96 in #8049
- [Misc] Optional installation of audio related packages by @ywang96 in #8063
- [Model] Adding Granite model. by @shawntan in #7436
- [SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding by @LiuXiaoxuanPKU in #7244
- [TPU] Align worker index with node boundary by @WoosukKwon in #7932
- [Core][Bugfix] Accept GGUF model without .gguf extension by @Isotr0py in #8056
- [Bugfix] Fix internlm2 tensor parallel inference by @Isotr0py in #8055
- [Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. by @noooop in #7874
- [Bugfix] Fix single output condition in output processor by @WoosukKwon in #7881
- [Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend by @Isotr0py in #8061
- [Performance] Enable chunked prefill and prefix caching together by @comaniac in #8120
- [CI] Only PR reviewers/committers can trigger CI on PR by @khluu in #8124
- [Core] Optimize Async + Multi-step by @alexm-neuralmagic in #8050
- [Misc] Raise a more informative exception in add/remove_logger by @Yard1 in #7750
- [CI/Build] fix: Add the +empty tag to the version only when the VLLM_TARGET_DEVICE envvar was explicitly set to "empty" by @tomeras91 in #8118
- [ci] Fix GHA workflow by @khluu in #8129
- [TPU][Bugfix] Fix next_token_ids shape by @WoosukKwon in #8128
- [CI] Change PR remainder to avoid at-mentions by @simon-mo in #8134
- [Misc] Update
GPTQ
to usevLLMParameters
by @dsikka in #7976 - [Benchmark] Add
--async-engine
option to benchmark_throughput.py by @njhill in #7964 - [TPU][Bugfix] Use XLA rank for persistent cache path by @WoosukKwon in #8137
- [Misc] Update fbgemmfp8 to use
vLLMParameters
by @dsikka in #7972 - [Model] Add Ultravox support for multiple audio chunks by @petersalas in #7963
- [Frontend] Multimodal support in offline chat by @DarkLight1337 in #8098
- chore: Update check-wheel-size.py to read VLLM_MAX_SIZE_MB from env by @haitwang-cloud in #8103
- [Bugfix] remove post_layernorm in siglip by @wnma3mz in #8106
- [MISC] Consolidate FP8 kv-cache tests by @comaniac in #8131
- [CI/Build][ROCm] Enabling LoRA tests on ROCm by @alexeykondrat in #7369
- [CI] Change test input in Gemma LoRA test by @WoosukKwon in #8163
- [Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models by @K-Mistele in #5649
- [MISC] Replace input token throughput with total token throughput by @comaniac in #8164
- [Neuron] Adding support for adding/ overriding neuron configuration a… by @hbikki in #8062
- Bump version to v0.6.0 by @simon-mo in #8166
New Contributors
- @rockwotj made their first contribution in #7654
- @HollowMan6 made their first contribution in #7855
- @patrickvonplaten made their first contribution in #7739
- @philschmid made their first contribution in #7917
- @jberkhahn made their first contribution in #7229
- @pavanimajety made their first contribution in #7798
- @rasmith made their first contribution in #7386
- @jmkuebler made their first contribution in #7899
- @kushanam made their first contribution in #7894
- @hbikki made their first contribution in #7885
- @wschin made their first contribution in #7759
- @nayohan made their first contribution in #7819
- @richardsliu made their first contribution in #7613
- @KaunilD made their first contribution in #7737
- @wenxcs made their first contribution in #7729
- @NickLucche made their first contribution in #8037
- @shawntan made their first contribution in #7436
- @noooop made their first contribution in #7874
- @haitwang-cloud made their first contribution in #8103
- @wnma3mz made their first contribution in #8106
- @K-Mistele made their first contribution in #5649
Full Changelog: v0.5.5...v0.6.0