Highlights
This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism
V1
For the V1 architecture, we
- Added a new design document for zero overhead prefix caching here (#12598)
- Add metrics and enhance logging for V1 engine (#12569, #12561, #12416, #12516, #12530, #12478)
Hardwares
- Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
- AMD: llama 3.2 support upstreaming (#12421)
Others
- Support override generation config in engine arguments (#12409)
- Support reasoning content in API for deepseek R1 (#12473)
What's Changed
- [Bugfix] Fix missing seq_start_loc in xformers prefill metadata by @Isotr0py in #12464
- [V1][Minor] Minor optimizations for update_from_output by @WoosukKwon in #12454
- [Bugfix] Fix gpt2 GGUF inference by @Isotr0py in #12467
- [Build] Only build 9.0a for scaled_mm and sparse kernels by @LucasWilkinson in #12339
- [V1][Metrics] Add initial Prometheus logger by @markmc in #12416
- [V1][CI/Test] Do basic test for top-p & top-k sampling by @WoosukKwon in #12469
- [FlashInfer] Upgrade to 0.2.0 by @abmfy in #11194
- [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and
prompt_logprobs
with ChunkedPrefill by @NickLucche in #10132 - Update
pre-commit
hooks by @hmellor in #12475 - [Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache by @liangfu in #11277
- Fix bad path in prometheus example by @mgoin in #12481
- [CI/Build] Fixed the xla nightly issue report in #12451 by @hosseinsarshar in #12453
- [FEATURE] Enables offline /score for embedding models by @gmarinho2 in #12021
- [CI] fix pre-commit error by @MengqingCao in #12494
- Update README.md with V1 alpha release by @ywang96 in #12495
- [V1] Include Engine Version in Logs by @robertgshaw2-redhat in #12496
- [Core] Make raw_request optional in ServingCompletion by @schoennenbeck in #12503
- [VLM] Merged multi-modal processor and V1 support for Qwen-VL by @DarkLight1337 in #12504
- [Doc] Fix typo for x86 CPU installation by @waltforme in #12514
- [V1][Metrics] Hook up IterationStats for Prometheus metrics by @markmc in #12478
- Replace missed warning_once for rerank API by @mgoin in #12472
- Do not run
suggestion
pre-commit
hook multiple times by @hmellor in #12521 - [V1][Metrics] Add per-request prompt/generation_tokens histograms by @markmc in #12516
- [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels by @fenghuizhang in #12482
- [TPU] Add example for profiling TPU inference by @mgoin in #12531
- [Frontend] Support reasoning content for deepseek r1 by @gaocegege in #12473
- [Doc] Convert docs to use colon fences by @hmellor in #12471
- [V1][Metrics] Add TTFT and TPOT histograms by @markmc in #12530
- Bugfix for whisper quantization due to fake k_proj bias by @mgoin in #12524
- [V1] Improve Error Message for Unsupported Config by @robertgshaw2-redhat in #12535
- Fix the pydantic logging validator by @maxdebayser in #12420
- [Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense by @tjohnson31415 in #12347
- [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM by @HwwwwwwwH in #12069
- [Frontend] Support override generation config in args by @liuyanyi in #12409
- [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. by @pavanimajety in #11787
- [Kernel] add triton fused moe kernel for gptq/awq by @jinzhen-lin in #12185
- Revert "[Build/CI] Fix libcuda.so linkage" by @tlrmchlsmth in #12552
- [V1][BugFix] Free encoder cache for aborted requests by @WoosukKwon in #12545
- [Misc][MoE] add Deepseek-V3 moe tuning support by @divakar-amd in #12558
- [V1][Metrics] Add GPU cache usage % gauge by @markmc in #12561
- Set
?device={device}
when changing tab in installation guides by @hmellor in #12560 - [Misc] fix typo: add missing space in lora adapter error message by @Beim in #12564
- [Kernel] Triton Configs for Fp8 Block Quantization by @robertgshaw2-redhat in #11589
- [CPU][PPC] Updated torch, torchvision, torchaudio dependencies by @npanpaliya in #12555
- [V1][Log] Add max request concurrency log to V1 by @mgoin in #12569
- [Kernel] Update
cutlass_scaled_mm
to support 2d group (blockwise) scaling by @LucasWilkinson in #11868 - [ROCm][AMD][Model] llama 3.2 support upstreaming by @maleksan85 in #12421
- [Attention] MLA decode optimizations by @LucasWilkinson in #12528
- [Bugfix] Gracefully handle huggingface hub http error by @ywang96 in #12571
- Add favicon to docs by @hmellor in #12611
- [BugFix] Fix Torch.Compile For DeepSeek by @robertgshaw2-redhat in #12594
- [Git] Automatically sign-off commits by @comaniac in #12595
- [Docs][V1] Prefix caching design by @comaniac in #12598
- [v1][Bugfix] Add extra_keys to block_hash for prefix caching by @heheda12345 in #12603
- [release] Add input step to ask for Release version by @khluu in #12631
- [Bugfix] Revert MoE Triton Config Default by @robertgshaw2-redhat in #12629
- [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 by @tlrmchlsmth in #12587
- [Feature] Fix guided decoding blocking bitmask memcpy by @xpbowler in #12563
- [Doc] Improve installation signposting by @hmellor in #12575
- [Doc] int4 w4a16 example by @brian-dellabetta in #12585
- [V1] Bugfix: Validate Model Input Length by @robertgshaw2-redhat in #12600
- [BugFix] fix wrong output when using lora and num_scheduler_steps=8 by @sleepwalker2017 in #11161
- Fix target matching for fused layers with compressed-tensors by @eldarkurtic in #12617
- [ci] Upgrade transformers to 4.48.2 in CI dependencies by @khluu in #12599
- [Bugfix/CI] Fixup benchmark_moe.py by @tlrmchlsmth in #12562
- Fix: Respect
sparsity_config.ignore
in Cutlass Integration by @rahul-tuli in #12517 - [Attention] Deepseek v3 MLA support with FP8 compute by @LucasWilkinson in #12601
- [CI/Build] Add label automation for structured-output, speculative-decoding, v1 by @russellb in #12280
- Disable chunked prefill and/or prefix caching when MLA is enabled by @simon-mo in #12642
New Contributors
- @abmfy made their first contribution in #11194
- @hosseinsarshar made their first contribution in #12453
- @gmarinho2 made their first contribution in #12021
- @waltforme made their first contribution in #12514
- @fenghuizhang made their first contribution in #12482
- @gaocegege made their first contribution in #12473
- @Beim made their first contribution in #12564
- @xpbowler made their first contribution in #12563
- @brian-dellabetta made their first contribution in #12585
- @sleepwalker2017 made their first contribution in #11161
- @eldarkurtic made their first contribution in #12617
Full Changelog: v0.7.0...v0.7.1