vllm-project/vllm v0.7.1 on GitHub

Highlights

This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism

MLA Kernel (#12601, #12642,#12528).
FP8 Kernels (#11589, #11868, #12587)

V1

For the V1 architecture, we

Added a new design document for zero overhead prefix caching here (#12598)
Add metrics and enhance logging for V1 engine (#12569, #12561, #12416, #12516, #12530, #12478)

Models

New Model: MiniCPM-o (text outputs only) (#12069)

Hardwares

Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
AMD: llama 3.2 support upstreaming (#12421)

Others

Support override generation config in engine arguments (#12409)
Support reasoning content in API for deepseek R1 (#12473)

What's Changed

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata by @Isotr0py in #12464
[V1][Minor] Minor optimizations for update_from_output by @WoosukKwon in #12454
[Bugfix] Fix gpt2 GGUF inference by @Isotr0py in #12467
[Build] Only build 9.0a for scaled_mm and sparse kernels by @LucasWilkinson in #12339
[V1][Metrics] Add initial Prometheus logger by @markmc in #12416
[V1][CI/Test] Do basic test for top-p & top-k sampling by @WoosukKwon in #12469
[FlashInfer] Upgrade to 0.2.0 by @abmfy in #11194
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill by @NickLucche in #10132
Update pre-commit hooks by @hmellor in #12475
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache by @liangfu in #11277
Fix bad path in prometheus example by @mgoin in #12481
[CI/Build] Fixed the xla nightly issue report in #12451 by @hosseinsarshar in #12453
[FEATURE] Enables offline /score for embedding models by @gmarinho2 in #12021
[CI] fix pre-commit error by @MengqingCao in #12494
Update README.md with V1 alpha release by @ywang96 in #12495
[V1] Include Engine Version in Logs by @robertgshaw2-redhat in #12496
[Core] Make raw_request optional in ServingCompletion by @schoennenbeck in #12503
[VLM] Merged multi-modal processor and V1 support for Qwen-VL by @DarkLight1337 in #12504
[Doc] Fix typo for x86 CPU installation by @waltforme in #12514
[V1][Metrics] Hook up IterationStats for Prometheus metrics by @markmc in #12478
Replace missed warning_once for rerank API by @mgoin in #12472
Do not run suggestion pre-commit hook multiple times by @hmellor in #12521
[V1][Metrics] Add per-request prompt/generation_tokens histograms by @markmc in #12516
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels by @fenghuizhang in #12482
[TPU] Add example for profiling TPU inference by @mgoin in #12531
[Frontend] Support reasoning content for deepseek r1 by @gaocegege in #12473
[Doc] Convert docs to use colon fences by @hmellor in #12471
[V1][Metrics] Add TTFT and TPOT histograms by @markmc in #12530
Bugfix for whisper quantization due to fake k_proj bias by @mgoin in #12524
[V1] Improve Error Message for Unsupported Config by @robertgshaw2-redhat in #12535
Fix the pydantic logging validator by @maxdebayser in #12420
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense by @tjohnson31415 in #12347
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM by @HwwwwwwwH in #12069
[Frontend] Support override generation config in args by @liuyanyi in #12409
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. by @pavanimajety in #11787
[Kernel] add triton fused moe kernel for gptq/awq by @jinzhen-lin in #12185
Revert "[Build/CI] Fix libcuda.so linkage" by @tlrmchlsmth in #12552
[V1][BugFix] Free encoder cache for aborted requests by @WoosukKwon in #12545
[Misc][MoE] add Deepseek-V3 moe tuning support by @divakar-amd in #12558
[V1][Metrics] Add GPU cache usage % gauge by @markmc in #12561
Set ?device={device} when changing tab in installation guides by @hmellor in #12560
[Misc] fix typo: add missing space in lora adapter error message by @Beim in #12564
[Kernel] Triton Configs for Fp8 Block Quantization by @robertgshaw2-redhat in #11589
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies by @npanpaliya in #12555
[V1][Log] Add max request concurrency log to V1 by @mgoin in #12569
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling by @LucasWilkinson in #11868
[ROCm][AMD][Model] llama 3.2 support upstreaming by @maleksan85 in #12421
[Attention] MLA decode optimizations by @LucasWilkinson in #12528
[Bugfix] Gracefully handle huggingface hub http error by @ywang96 in #12571
Add favicon to docs by @hmellor in #12611
[BugFix] Fix Torch.Compile For DeepSeek by @robertgshaw2-redhat in #12594
[Git] Automatically sign-off commits by @comaniac in #12595
[Docs][V1] Prefix caching design by @comaniac in #12598
[v1][Bugfix] Add extra_keys to block_hash for prefix caching by @heheda12345 in #12603
[release] Add input step to ask for Release version by @khluu in #12631
[Bugfix] Revert MoE Triton Config Default by @robertgshaw2-redhat in #12629
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 by @tlrmchlsmth in #12587
[Feature] Fix guided decoding blocking bitmask memcpy by @xpbowler in #12563
[Doc] Improve installation signposting by @hmellor in #12575
[Doc] int4 w4a16 example by @brian-dellabetta in #12585
[V1] Bugfix: Validate Model Input Length by @robertgshaw2-redhat in #12600
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 by @sleepwalker2017 in #11161
Fix target matching for fused layers with compressed-tensors by @eldarkurtic in #12617
[ci] Upgrade transformers to 4.48.2 in CI dependencies by @khluu in #12599
[Bugfix/CI] Fixup benchmark_moe.py by @tlrmchlsmth in #12562
Fix: Respect sparsity_config.ignore in Cutlass Integration by @rahul-tuli in #12517
[Attention] Deepseek v3 MLA support with FP8 compute by @LucasWilkinson in #12601
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 by @russellb in #12280
Disable chunked prefill and/or prefix caching when MLA is enabled by @simon-mo in #12642

New Contributors

@abmfy made their first contribution in #11194
@hosseinsarshar made their first contribution in #12453
@gmarinho2 made their first contribution in #12021
@waltforme made their first contribution in #12514
@fenghuizhang made their first contribution in #12482
@gaocegege made their first contribution in #12473
@Beim made their first contribution in #12564
@xpbowler made their first contribution in #12563
@brian-dellabetta made their first contribution in #12585
@sleepwalker2017 made their first contribution in #11161
@eldarkurtic made their first contribution in #12617

Full Changelog: v0.7.0...v0.7.1