What's Changed
- [FEAT]Support reset prefix cache by specified device by @maobaolong in #15003
- [BugFix][V1] Update stats.py by @WrRan in #15139
- [V1][TPU] Change kv cache shape. by @vanbasten23 in #15145
- [FrontEnd][Perf]
merge_async_iterators
fast-path for single-prompt requests by @njhill in #15150 - [Docs] Annouce Ollama and Singapore Meetups by @simon-mo in #15161
- [V1] TPU - Tensor parallel MP support by @alexm-redhat in #15059
- [BugFix] Lazily import XgrammarBackend to avoid early cuda init by @njhill in #15171
- [Doc] Clarify run vllm only on one node in distributed inference by @ruisearch42 in #15148
- Fix broken tests by @jovsa in #14713
- [Bugfix] Fix embedding assignment for InternVL-based models by @DarkLight1337 in #15086
- fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… by @sywangyi in #14673
- [V1][TPU] Support V1 Sampler for ragged attention by @NickLucche in #14227
- [Benchmark] Allow oversample request in benchmark dataset by @JenZhao in #15170
- [Core][V0] Add guidance backend for structured output by @russellb in #14589
- [Doc] Update Mistral Small 3.1/Pixtral example by @ywang96 in #15184
- [Misc] support --disable-uvicorn-access-log parameters by @chaunceyjiang in #14754
- [Attention] Flash Attention 3 - fp8 by @mickaelseznec in #14570
- [Doc] Update README.md by @DarkLight1337 in #15187
- Enable CUDA graph support for llama 3.2 vision by @mritterfigma in #14917
- typo: Update config.py by @WrRan in #15189
- [Frontend][Bugfix] support prefill decode disaggregation on deepseek by @billishyahao in #14824
- [release] Tag vllm-cpu with latest upon new version released by @khluu in #15193
- Fixing Imprecise Type Annotations by @WrRan in #15192
- [macOS] Ugrade pytorch to 2.6.0 by @linktohack in #15129
- [Bugfix] Multi-video inference on LLaVA-Onevision by @DarkLight1337 in #15082
- Add user forum to README by @hmellor in #15220
- Fix env vars for running Ray distributed backend on GKE by @richardsliu in #15166
- Replace
misc
issues with link to forum by @hmellor in #15226 - [ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 by @vermouth1992 in #15172
- [Bugfix] fix V1 Engine crash while handling requests with duplicate request id by @JasonJ2021 in #15043
- [V1] Add flag to disable cascade attention by @WoosukKwon in #15243
- Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. by @fabianlim in #14617
- [V1] Scheduler Refactoring [1/N] - Add Scheduler Interface by @WoosukKwon in #15250
- [CI/Build] LoRA : make add_lora_test safer by @varun-sundar-rabindranath in #15181
- Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 by @houseroad in #15159
- [Misc] Clean up the BitsAndBytes arguments by @jeejeelee in #15140
- [ROCM] Upgrade torch to 2.6 by @SageMoore in #15244
- [Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation by @Isotr0py in #15200
- Mention
extra_body
as a way top pass vLLM only parameters using the OpenAI client by @hmellor in #15240 - [V1][TPU] Speed up top-k on TPU by using torch.topk by @hyeygit in #15242
- [Bugfix] detect alibi and revert to FA2 by @tjohnson31415 in #15231
- [Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14857
- [Docs] Trim the latest news in README by @WoosukKwon in #15261
- [Misc] Better RayExecutor and multiprocessing compatibility by @comaniac in #14705
- Add an example for reproducibility by @WoosukKwon in #15262
- [Hardware][TPU] Add check for no additional graph compilation during runtime by @lsy323 in #14710
- [V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs by @Isotr0py in #14071
- [Doc] Update LWS docs by @Edwinhr716 in #15163
- [V1] Avoid redundant input processing in n>1 case by @njhill in #14985
- [Feature] specify model in config.yaml by @wayzeng in #14855
- [Bugfix] Add int8 torch dtype for KVCache by @shen-shanshan in #15260
- [Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL by @Isotr0py in #15273
- [Bugfix] Fix incorrect resolving order for transformers fallback by @Isotr0py in #15279
- [V1] Fix wrong import path of get_flash_attn_version by @lhtin in #15280
- [Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend by @Isotr0py in #15282
- [Misc] Add cProfile helpers by @russellb in #15074
- [v1] Refactor KVCacheConfig by @heheda12345 in #14079
- [Bugfix][VLM] fix llava processor by @MengqingCao in #15285
- Revert "[Feature] specify model in config.yaml (#14855)" by @DarkLight1337 in #15293
- [TPU][V1] MHA Pallas backend by @NickLucche in #15288
- [Build/CI] Fix env var typo by @russellb in #15305
- [Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout by @ruisearch42 in #15301
- [Bugfix][V0] Multi-sequence logprobs streaming edge case by @andylolu2 in #15259
- [FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature by @tjtanaa in #14959
- [Doc] add load_format items in docs by @wwl2755 in #14804
- [Bugfix] Fix torch.compile raise FileNotFoundError by @jeejeelee in #15278
- [Bugfix] LoRA V0 - Fix case where
max_num_seqs
is between cudagraph capture sizes by @varun-sundar-rabindranath in #15308 - [Model] Support Tele-FLM Model by @atone in #15023
- [V1] Add
disable-any-whitespace
option support for xgrammar by @russellb in #15316 - [BugFix][Typing] Fix Imprecise Type Annotations by @WrRan in #15208
- Remove openvino support in favor of external plugin by @russellb in #15339
- [doc] Add back previous news by @heheda12345 in #15331
- Fix v1 supported oracle for worker-cls and worker-extension-cls by @hijkzzz in #15324
- [V1][Usage] Refactor speculative decoding configuration and tests by @ShangmingCai in #14434
- [ci/build] update torch nightly version for GH200 by @youkaichao in #15135
- [ci/build] fix broken tests in LLM.collective_rpc by @youkaichao in #15350
- [Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 by @DefTruth in #15322
- [Bugfix] fix torch.compiled cache hash error by @DefTruth in #14953
- [V1][Spec Decode] Respect prompt_lookup_max by @WoosukKwon in #15348
- [V1][Spec Decode] Use better defaults for N-gram by @WoosukKwon in #15358
- [Frontend] Support tool calling and reasoning parser by @WangErXiao in #14511
- [Misc][Doc] Add note regarding loading
generation_config
by default by @ywang96 in #15281 - [V1] Enable V1 Fp8 cache for FA3 in the oracle by @LucasWilkinson in #15191
- [Fix] [torch.compile] Improve UUID system for custom passes by @ProExpertProg in #15249
- Fix non-contiguous input passed to Marlin kernel by @Qubitium in #15319
- [Misc] Upgrade BNB version by @jeejeelee in #15183
- [Misc] Remove ignore_reinit_error for ray.init() by @ruisearch42 in #15373
- [Bugfix][V1] Avoid importing PreTrainedModel by @HollowMan6 in #15366
- [Misc] Update guided decoding logs to debug by @sfbemerk in #15310
- Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" by @simon-mo in #15377
- [Kernel] allow non-contiguous input for marlin kernel by @jinzhen-lin in #14658
- Fix zmq IPv6 URL format error by @russellb in #15341
- [Bugfix] Fix chat template loading by @DarkLight1337 in #15143
- [distributed] fix dp group by @youkaichao in #15355
- [Core] Integrate
fastsafetensors
loader for loading model weights by @manish-sethi in #10647 - [Core] Don't force uppercase for VLLM_LOGGING_LEVEL by @russellb in #15306
- [V1][Minor] fix comments by @Chen-0210 in #15392
- [MISC] Refine no available block debug msg by @yiliu30 in #15076
- [V1] Aggregate chunked prompt logprobs in model runner by @njhill in #14875
- [Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral by @zhenwei-intel in #12303
- [DOC] Add Kubernetes deployment guide with CPUs by @terrytangyuan in #14865
- [Doc] Update docs on handling OOM by @DarkLight1337 in #15357
- [V1][Perf] Simpler request output queues by @njhill in #15156
- [BugFix][V1] Quick fix for min_tokens with multiple EOS by @njhill in #15407
- [Hardware][TPU] Skip failed compilation test by @lsy323 in #15421
- [Build] Cython compilation support fix by @gshtras in #14296
- [ROCm][Kernel] MoE weights padding by @gshtras in #14454
- [V1][Spec Decode] Enable spec decode for top-p & top-k sampling by @WoosukKwon in #15063
- [Minor][Spec Decode] Remove compiled_softmax by @WoosukKwon in #15416
- Add pipeline parallel support to
TransformersModel
by @hmellor in #12832 - [Misc] Remove LoRA log by @jeejeelee in #15388
- Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" by @tlrmchlsmth in #15398
- [Bugfix] Fixed the issue of not being able to input video and image simultaneously by @chaunceyjiang in #15387
- [V1] guidance backend for structured output +
auto
fallback mode by @russellb in #14779 - [V1][Spec Decode] Update target_logits in place for rejection sampling by @WoosukKwon in #15427
New Contributors
- @maobaolong made their first contribution in #15003
- @jovsa made their first contribution in #14713
- @mickaelseznec made their first contribution in #14570
- @mritterfigma made their first contribution in #14917
- @billishyahao made their first contribution in #14824
- @linktohack made their first contribution in #15129
- @vermouth1992 made their first contribution in #15172
- @JasonJ2021 made their first contribution in #15043
- @hyeygit made their first contribution in #15242
- @wayzeng made their first contribution in #14855
- @lhtin made their first contribution in #15280
- @wwl2755 made their first contribution in #14804
- @atone made their first contribution in #15023
- @hijkzzz made their first contribution in #15324
- @sfbemerk made their first contribution in #15310
- @manish-sethi made their first contribution in #10647
- @yiliu30 made their first contribution in #15076
Full Changelog: v0.8.1...v0.8.2