Highlights
This release features approximately 120 commits across 120+ pull requests from 50+ contributors, including 13 new contributors.
Expanded Model Support
This release continues to grow the multimodal model ecosystem with several major additions:
- Added FLUX.2-dev image generation model (#1629).
- Added Bagel multistage img2img support (#1669).
- Added HunyuanVideo-1.5 text-to-video and image-to-video support (#1516).
- Added Voxtral TTS model (#1803, #2026, #2056).
- Added Fish Speech S2 Pro with online serving and voice cloning (#1798).
- Added Dreamid-Omni from ByteDance (#1855).
- Extended NPU support for HunyuanImage3 diffusion model (#1689).
- Added OmniGen2 transformer config loading for HF models (#1934).
Performance Improvements
Multiple optimizations improve throughput, latency, and runtime efficiency:
- Qwen3-Omni code predictor re-prefill + SDPA to eliminate decode hot-path CPU round-trips (#2012).
- Qwen3-TTS high-concurrency throughput & latency boost (#1852).
- Qwen3-TTS Code2Wav triton SnakeBeta kernel and CUDA Graph support (#1797).
- Qwen3-TTS CodePredictor torch.compile with reduce-overhead and dynamic=False (#1913).
- Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls (#1985).
- Simple dynamic TTFA based on Code2Wav load for Qwen3-TTS (#1714).
- Enabled async_scheduling by default for Qwen3-TTS (#1853).
- Fish Speech S2 Pro inference performance improvements (#1859).
- Fix slow hasattr in CUDAGraphWrapper.getattr (#1982).
- Diffusion timing profiling improvements (#1757).
Inference Infrastructure & Parallelism
New infrastructure capabilities improve scalability and production readiness:
- Model Pipeline Configuration System refactor (Part 1) (#1115).
- vLLM-Omni entrypoint refactoring for cleaner startup flow (#1908).
- Expert parallel for diffusion MoE layers (#1323).
- Sequence parallelism (SP) support for FLUX.2-klein (#1250) and HSDP for Flux family (#1900).
- T5 Tensor Parallelism support (#1881).
- LongCat Sequence Parallelism refactored to use SP Plan (#1772).
- PD disaggregation scaffolding (Split #1303 Part 1) (#1863).
- Coordinator module with unit tests (#1465).
- Refactored pipeline stage/step pipeline (#1368).
- Helm Chart to deploy vLLM-Omni on Kubernetes (#1337).
Text-to-Speech Improvements
Major TTS pipeline improvements for streaming, quality, and new models:
- Streaming audio output via WebSocket for Qwen3-TTS (#1719).
- Gradio demo for Qwen3-TTS online serving (#1231).
- Added wav response_format when stream is true in
/v1/audio/speech(#1819). - Fixed Base voice clone streaming quality and stop-token crash (#1945).
- Fixed streaming initial chunk — removed dynamic initial chunk, compute only on initial request (#1930).
- Preserved ref_code decoder context for Base ICL in Qwen3-TTS (#1731).
- Restored voice upload API and profiler endpoints reverted by #1719 (#1879).
- BugFix for CodePredictor CudaGraph Pool (#2059).
Quantization & Hardware Support
- Int8 quantization support for DiT (Z-Image & Qwen-Image) (#1470).
- Added cache-dit support for HunyuanImage3 (#1848) and Flux.2-dev (#1814).
- Enabled CPU offloading and Cache-DiT together on diffusion models (#1723).
- Upgraded cache-dit from 1.2.0 to 1.3.0 (#1834).
- NPU upgrade to v0.17.0 (#1890).
- Updated Bagel modeling to remove CUDA hardcode and added XPU stage_config (#1931).
- Updated GpuMemoryMonitor to DeviceMemoryMonitor for all hardware (#1526).
- ROCm bugfix for device environment issues and CI setup (#1984, #2017).
- Intel CI dispatch in Buildkite folder (#1721).
Frontend & Serving
- ComfyUI video & LoRA support (#1596).
- Rewrote video API for async job lifecycle (#1665).
- Fix /chat/completion not reading extra_body for diffusion models (#2042).
- Fix online server returning multiple images (#2007).
- Fix Ovis Image crash when guidance_scale is set without negative_prompt (#1956).
- Fix config misalignment between offline and online diffusion inference (#1979).
Reliability, Tooling & Developer Experience
- OmniStage.try_collect() patched with process alive checks (#1560) and Ray alive checks (#1561).
- Nightly Buildkite Pytest test case statistics with HTML report by email (#1674).
- Nightly Benchmark HTML generator and updated EXCEL generator (#1831).
- Added multimodal processing correctness tests for Omni models (#1445).
- Added Qwen3-TTS nightly performance benchmark (#1700) and benchmark scripts (#1573).
- Added Governance section (#1889).
- Rebase to vllm v0.18.0 (#2037, #2038).
- Numerous bug fixes across models, configuration, parallelism, and CI pipelines.
What's Changed
- [Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
- [Bagel]: Support multistage img2img by @princepride in #1669
- [BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
- [Doc] CLI Args Naming Style Correction by @wtomin in #1750
- [Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
- [Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
- Add online serving to Stable Audio Diffusion and introduce
v1/audio/generateendpoint by @ekagra-ranjan in #1255 - [Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
- [CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
- [Doc][skip-ci] Update installation instructions by @tzhouam in #1762
- Revert "Add online serving to Stable Audio Diffusion and introduce
v1/audio/generateendpoint" by @hsliuustc0106 in #1789 - [BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
- [feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
- [Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
- [Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
- [Docs] update async chunk performance diagram by @R2-Y in #1741
- [Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
- [Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
- [skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
- update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
- Add coordinator module and corresponding unit test by @NumberWan in #1465
- [Model]: add FLUX.2-dev model by @nuclearwu in #1629
- [skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
- [Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
- [Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
- [Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
- [Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
- [Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
- [Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
- Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
- [Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
- Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
- [Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
- [Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
- [Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
- [Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
- [Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever in #1561
- [Feat][Diffusion]: Implement Component-Level VRAM Quota and Resource Domain Isolation by @Flink-ddd in #1582
- [Feature]: Enable directly use OmniLLM init AR model by @princepride in #1821
- [Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0 by @SamitHuang in #1834
- [Bugfix] Modify _resolve_pytest_target to support glob patterns and return multiple paths by @yenuo26 in #1843
- [Feat] add wav response_format when stream is true in /v1/audio/speec… by @lengrongfu in #1819
- [BugFix]: Revert #1582 by @princepride in #1842
- [Feature]: support Flux.2-dev cache_dit by @nuclearwu in #1814
- [skip ci] update readme slides link by @hsliuustc0106 in #1850
- [Model] Extend NPU support for HunyuanImage3 Diffusion Model by @ElleElleWu in #1689
- [Config Refactor][1/2] Model Pipeline Configuration System by @lishunyang12 in #1115
- [Test] Reduce SP & Offloading test cases for L2 by @fhfuih in #1839
- [bugfix] Add Interleaved 2D Rotary Embedding for HunyuanImage3 by @usberkeley in #1784
- [Bugfix] Fix Helios text_encoder embed_tokens all-zeros due to untied weights by @dubin555 in #1728
- Enable async_scheduling by default for Qwen3-TTS by @linyueqian in #1853
- [CI failure] Comment out test_zimage_vae_patch_parallel_tp2 by @Gaohan123 in #1856
- Add Fish Speech S2 Pro support with online serving and voice cloning by @linyueqian in #1798
- [skip CI][Docs] add connector design document by @natureofnature in #1737
- [BugFix] Readme and example runner file for cosyvoice3 missed in refactoring by @divyanshsinghvi in #1685
- [Refactor] Use SP Plan for LongCat Sequence Parallelism by @alex-jw-brooks in #1772
- [CI failed] Disable test for zimage tensor parallelism by @Gaohan123 in #1870
- [Bugfix] Fix SD3.5-medium attn2 uninitialized weights by @lishunyang12 in #1659
- [Bugfix] fix layer-wise offload incompatible with cache-dit by @RuixiangMa in #1786
- [CI failed] Disable Diffusion Tensor Parallelism Test by @Gaohan123 in #1876
- [BugFix]: Fix bagel online inference bug by @princepride in #1804
- [Frontend] Rewrite video API for async job lifecycle by @ieaves in #1665
- [Diffusion] [Model] Dreamid-Omni from bytedance by @Bounty-hunter in #1855
- [Bugfix] Restore voice upload API and profiler endpoints reverted by #1719 by @linyueqian in #1879
- [BugFix] Fix Max Rank Handling in LoRA by @alex-jw-brooks in #1397
- Buildkite hardware ci xpu test by @pi314ever in #1340
- [CI] add multimodal processing correctness tests for Omni models by @zzhuoxin1508 in #1445
- fix: propagate parallel_config through create_default_diffusion by @lishunyang12 in #1878
- [CI pipeline] Re-enable Diffusion Tensor Parallelism Test in pipeline by @Gaohan123 in #1892
- [skip CI][Docs][Benchmark]: clarify vbench parameter behavior and add t2v example by @asukaqaq-s in #1497
- [Bugfix] Fix cpu offload and quantization compatibility by @RuixiangMa in #1473
- [Feat] support SP for FLUX.2-klein by @RuixiangMa in #1250
- [CI]: Add/Fix bagel e2e online/offline test by @princepride in #1895
- [Feat] support HSDP for Flux family by @RuixiangMa in #1900
- Add
Governancesection by @ywang96 in #1889 - Update latest news section in README.md by @ywang96 in #1909
- [Feature] Split #1303 Part 1: PD disaggregation scaffolding by @ahengljh in #1863
- [NPU] Upgrade to v0.17.0 by @gcanlin in #1890
- [Misc] removed qwen3_tts.py as it is out-dated by @lengrongfu in #1926
- [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request by @JuanPZuluaga in #1930
- Fix Base voice clone streaming quality and stop-token crash by @linyueqian in #1945
- [Docs] Update WeChat QR code for community support by @david6666666 in #1974
- [skip ci][Docs] Update WeChat QR code (fix filename case) by @david6666666 in #1976
- [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring by @fake0fan in #1908
- [Bugfix] Set PREEMPTED status when moving requests from running to waiting queue by @gcanlin in #1893
- [Feature] Add cache-dit support for HunyuanImage3 by @Fishermanykx in #1848
- [Feature]: Remove some useless
hf_overridesin yaml by @princepride in #1898 - [CI] Nightly Benchmark - Add an HTML generator, Update the EXCEL generator. by @congw729 in #1831
- [Bug]: fix CUDA OOM during diffusion post-processing by @lishunyang12 in #1670
- [Optim][Qwen3TTS] big boost model throughput+latency high concurrency by @JuanPZuluaga in #1852
- [CI] [ROCm] Bugfix device environment issue by @tjtanaa in #1984
- [CI]init intel ci dispatch in buildkite folder by @xuechendi in #1721
- Fix OmniGen2 transformer config loading for HF models by @Joshna-Medisetty in #1934
- [Test] L4 complete diffusion feature test for Bagel models by @NumberWan in #1938
- [Performance] diffusion timing by @Bounty-hunter in #1757
- [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls by @DomBrown in #1985
- [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) by @princepride in #1998
- [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) by @SamitHuang in #1979
- Add HF token to H100 jobs by @khluu in #2008
- [Bugfix] Fix Ovis Image crash when guidance_scale is set without negative_prompt by @Dnoob in #1956
- [Bugfix] fix helios video generate use cpu device by @lengrongfu in #1915
- [XPU] update bagel modeling to remove cuda hardcode, add xpu stage_config by @xuechendi in #1931
- [Fix] Fix slow hasattr in CUDAGraphWrapper.getattr by @ZeldaHuang in #1982
- [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni by @R2-Y in #2009
- [Bugfix]Fix bug of online server can not return mutli images by @Hu1Lcode in #2007
- [CI] [ROCm] Setup
test-ready.ymlandtest-merge.ymlby @tjtanaa in #2017 - Int8 Quantization Support for DiT (Z-Image & Qwen-Image) by @yjb767868009 in #1470
- [Model] Add Voxtral TTS model by @y123456y78 in #1803
- [Feat] Support T5 Tensor Parallelism by @yuanheng-zhao in #1881
- [Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph by @JuanPZuluaga in #1797
- [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False by @JuanPZuluaga in #1913
- [CI] Change Bagel online test environment variable
VLLM_TEST_CLEAN_GPU_MEMORYto0by @princepride in #2032 - [BugFix][Doc]Update voxtral_tts end2end.py & README.md by @y123456y78 in #2026
- [Docs] Add Wan2.1-T2V as supported video generation models by @SamitHuang in #1920
- [Bugfix] Remove duplicated config keyword max batch size by @tzhouam in #1851
- [Test] Implement mock HTTP request handling in benchmark CLI tests by @yenuo26 in #2014
- [CI] Fix test. by @congw729 in #2031
- reafator pipeline stage/step pipeline by @asukaqaq-s in #1368
- [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #2012
- [Benchmark] [Diffusion] [Enhancement] Random dataset by @Bounty-hunter in #1657
- [Bugfix] Z-Image CFG threshold should be > 0 instead of > 1 by @RuixiangMa in #1634
- [Voxtral TTS] Remove redundant yaml by @y123456y78 in #2056
- [Bugfix]: fixed ServerDisconnectedError in benchmark test (reapply #1683, fixes #1374) by @NumberWan in #1841
- [Perf] Improve Fish Speech S2 Pro inference performance by @Sy0307 in #1859
- [Voxtral] Improve example by @patrickvonplaten in #2045
- [CI] Uncomment condition for nightly build in YAML by @Gaohan123 in #2057
- [bugfix] /chat/completion doesn't read extra_body for diffusion model by @fhfuih in #2042
- [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool by @JuanPZuluaga in #2059
- [Rebase] Rebase to vllm v0.18.0 by @tzhouam in #2037
- [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 by @tzhouam in #2038
- [Model] Add HunyuanVideo-1.5 T2V and I2V support by @lishunyang12 in #1516
New Contributors
- @oglok made their first contribution in #1337
- @NumberWan made their first contribution in #1465
- @Flink-ddd made their first contribution in #1582
- @ieaves made their first contribution in #1665
- @ahengljh made their first contribution in #1863
- @Fishermanykx made their first contribution in #1848
- @Joshna-Medisetty made their first contribution in #1934
- @DomBrown made their first contribution in #1985
- @Dnoob made their first contribution in #1956
- @Hu1Lcode made their first contribution in #2007
- @yjb767868009 made their first contribution in #1470
- @y123456y78 made their first contribution in #1803
- @patrickvonplaten made their first contribution in #2045
Full Changelog: v0.17.0rc1...v0.18.0rc1