vllm-project/vllm-omni v0.18.0rc1 on GitHub

Highlights

This release features approximately 120 commits across 120+ pull requests from 50+ contributors, including 13 new contributors.

Expanded Model Support

This release continues to grow the multimodal model ecosystem with several major additions:

Added FLUX.2-dev image generation model (#1629).
Added Bagel multistage img2img support (#1669).
Added HunyuanVideo-1.5 text-to-video and image-to-video support (#1516).
Added Voxtral TTS model (#1803, #2026, #2056).
Added Fish Speech S2 Pro with online serving and voice cloning (#1798).
Added Dreamid-Omni from ByteDance (#1855).
Extended NPU support for HunyuanImage3 diffusion model (#1689).
Added OmniGen2 transformer config loading for HF models (#1934).

Performance Improvements

Multiple optimizations improve throughput, latency, and runtime efficiency:

Qwen3-Omni code predictor re-prefill + SDPA to eliminate decode hot-path CPU round-trips (#2012).
Qwen3-TTS high-concurrency throughput & latency boost (#1852).
Qwen3-TTS Code2Wav triton SnakeBeta kernel and CUDA Graph support (#1797).
Qwen3-TTS CodePredictor torch.compile with reduce-overhead and dynamic=False (#1913).
Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls (#1985).
Simple dynamic TTFA based on Code2Wav load for Qwen3-TTS (#1714).
Enabled async_scheduling by default for Qwen3-TTS (#1853).
Fish Speech S2 Pro inference performance improvements (#1859).
Fix slow hasattr in CUDAGraphWrapper.getattr (#1982).
Diffusion timing profiling improvements (#1757).

Inference Infrastructure & Parallelism

New infrastructure capabilities improve scalability and production readiness:

Model Pipeline Configuration System refactor (Part 1) (#1115).
vLLM-Omni entrypoint refactoring for cleaner startup flow (#1908).
Expert parallel for diffusion MoE layers (#1323).
Sequence parallelism (SP) support for FLUX.2-klein (#1250) and HSDP for Flux family (#1900).
T5 Tensor Parallelism support (#1881).
LongCat Sequence Parallelism refactored to use SP Plan (#1772).
PD disaggregation scaffolding (Split #1303 Part 1) (#1863).
Coordinator module with unit tests (#1465).
Refactored pipeline stage/step pipeline (#1368).
Helm Chart to deploy vLLM-Omni on Kubernetes (#1337).

Text-to-Speech Improvements

Major TTS pipeline improvements for streaming, quality, and new models:

Streaming audio output via WebSocket for Qwen3-TTS (#1719).
Gradio demo for Qwen3-TTS online serving (#1231).
Added wav response_format when stream is true in /v1/audio/speech (#1819).
Fixed Base voice clone streaming quality and stop-token crash (#1945).
Fixed streaming initial chunk — removed dynamic initial chunk, compute only on initial request (#1930).
Preserved ref_code decoder context for Base ICL in Qwen3-TTS (#1731).
Restored voice upload API and profiler endpoints reverted by #1719 (#1879).
BugFix for CodePredictor CudaGraph Pool (#2059).

Quantization & Hardware Support

Int8 quantization support for DiT (Z-Image & Qwen-Image) (#1470).
Added cache-dit support for HunyuanImage3 (#1848) and Flux.2-dev (#1814).
Enabled CPU offloading and Cache-DiT together on diffusion models (#1723).
Upgraded cache-dit from 1.2.0 to 1.3.0 (#1834).
NPU upgrade to v0.17.0 (#1890).
Updated Bagel modeling to remove CUDA hardcode and added XPU stage_config (#1931).
Updated GpuMemoryMonitor to DeviceMemoryMonitor for all hardware (#1526).
ROCm bugfix for device environment issues and CI setup (#1984, #2017).
Intel CI dispatch in Buildkite folder (#1721).

Frontend & Serving

ComfyUI video & LoRA support (#1596).
Rewrote video API for async job lifecycle (#1665).
Fix /chat/completion not reading extra_body for diffusion models (#2042).
Fix online server returning multiple images (#2007).
Fix Ovis Image crash when guidance_scale is set without negative_prompt (#1956).
Fix config misalignment between offline and online diffusion inference (#1979).

Reliability, Tooling & Developer Experience

OmniStage.try_collect() patched with process alive checks (#1560) and Ray alive checks (#1561).
Nightly Buildkite Pytest test case statistics with HTML report by email (#1674).
Nightly Benchmark HTML generator and updated EXCEL generator (#1831).
Added multimodal processing correctness tests for Omni models (#1445).
Added Qwen3-TTS nightly performance benchmark (#1700) and benchmark scripts (#1573).
Added Governance section (#1889).
Rebase to vllm v0.18.0 (#2037, #2038).
Numerous bug fixes across models, configuration, parallelism, and CI pipelines.

What's Changed

[Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
[Bagel]: Support multistage img2img by @princepride in #1669
[BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
[Doc] CLI Args Naming Style Correction by @wtomin in #1750
[Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
[Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint by @ekagra-ranjan in #1255
[Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
[CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
[Doc][skip-ci] Update installation instructions by @tzhouam in #1762
Revert "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" by @hsliuustc0106 in #1789
[BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
[Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
[Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
[Docs] update async chunk performance diagram by @R2-Y in #1741
[Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
[Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
[skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
Add coordinator module and corresponding unit test by @NumberWan in #1465
[Model]: add FLUX.2-dev model by @nuclearwu in #1629
[skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
[Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
[Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
[Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
[Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
[Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
[Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
[Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
[Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
[Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
[Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
[Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
[Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever in #1561
[Feat][Diffusion]: Implement Component-Level VRAM Quota and Resource Domain Isolation by @Flink-ddd in #1582
[Feature]: Enable directly use OmniLLM init AR model by @princepride in #1821
[Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0 by @SamitHuang in #1834
[Bugfix] Modify _resolve_pytest_target to support glob patterns and return multiple paths by @yenuo26 in #1843
[Feat] add wav response_format when stream is true in /v1/audio/speec… by @lengrongfu in #1819
[BugFix]: Revert #1582 by @princepride in #1842
[Feature]: support Flux.2-dev cache_dit by @nuclearwu in #1814
[skip ci] update readme slides link by @hsliuustc0106 in #1850
[Model] Extend NPU support for HunyuanImage3 Diffusion Model by @ElleElleWu in #1689
[Config Refactor][1/2] Model Pipeline Configuration System by @lishunyang12 in #1115
[Test] Reduce SP & Offloading test cases for L2 by @fhfuih in #1839
[bugfix] Add Interleaved 2D Rotary Embedding for HunyuanImage3 by @usberkeley in #1784
[Bugfix] Fix Helios text_encoder embed_tokens all-zeros due to untied weights by @dubin555 in #1728
Enable async_scheduling by default for Qwen3-TTS by @linyueqian in #1853
[CI failure] Comment out test_zimage_vae_patch_parallel_tp2 by @Gaohan123 in #1856
Add Fish Speech S2 Pro support with online serving and voice cloning by @linyueqian in #1798
[skip CI][Docs] add connector design document by @natureofnature in #1737
[BugFix] Readme and example runner file for cosyvoice3 missed in refactoring by @divyanshsinghvi in #1685
[Refactor] Use SP Plan for LongCat Sequence Parallelism by @alex-jw-brooks in #1772
[CI failed] Disable test for zimage tensor parallelism by @Gaohan123 in #1870
[Bugfix] Fix SD3.5-medium attn2 uninitialized weights by @lishunyang12 in #1659
[Bugfix] fix layer-wise offload incompatible with cache-dit by @RuixiangMa in #1786
[CI failed] Disable Diffusion Tensor Parallelism Test by @Gaohan123 in #1876
[BugFix]: Fix bagel online inference bug by @princepride in #1804
[Frontend] Rewrite video API for async job lifecycle by @ieaves in #1665
[Diffusion] [Model] Dreamid-Omni from bytedance by @Bounty-hunter in #1855
[Bugfix] Restore voice upload API and profiler endpoints reverted by #1719 by @linyueqian in #1879
[BugFix] Fix Max Rank Handling in LoRA by @alex-jw-brooks in #1397
Buildkite hardware ci xpu test by @pi314ever in #1340
[CI] add multimodal processing correctness tests for Omni models by @zzhuoxin1508 in #1445
fix: propagate parallel_config through create_default_diffusion by @lishunyang12 in #1878
[CI pipeline] Re-enable Diffusion Tensor Parallelism Test in pipeline by @Gaohan123 in #1892
[skip CI][Docs][Benchmark]: clarify vbench parameter behavior and add t2v example by @asukaqaq-s in #1497
[Bugfix] Fix cpu offload and quantization compatibility by @RuixiangMa in #1473
[Feat] support SP for FLUX.2-klein by @RuixiangMa in #1250
[CI]: Add/Fix bagel e2e online/offline test by @princepride in #1895
[Feat] support HSDP for Flux family by @RuixiangMa in #1900
Add Governance section by @ywang96 in #1889
Update latest news section in README.md by @ywang96 in #1909
[Feature] Split #1303 Part 1: PD disaggregation scaffolding by @ahengljh in #1863
[NPU] Upgrade to v0.17.0 by @gcanlin in #1890
[Misc] removed qwen3_tts.py as it is out-dated by @lengrongfu in #1926
[Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request by @JuanPZuluaga in #1930
Fix Base voice clone streaming quality and stop-token crash by @linyueqian in #1945
[Docs] Update WeChat QR code for community support by @david6666666 in #1974
[skip ci][Docs] Update WeChat QR code (fix filename case) by @david6666666 in #1976
[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring by @fake0fan in #1908
[Bugfix] Set PREEMPTED status when moving requests from running to waiting queue by @gcanlin in #1893
[Feature] Add cache-dit support for HunyuanImage3 by @Fishermanykx in #1848
[Feature]: Remove some useless hf_overrides in yaml by @princepride in #1898
[CI] Nightly Benchmark - Add an HTML generator, Update the EXCEL generator. by @congw729 in #1831
[Bug]: fix CUDA OOM during diffusion post-processing by @lishunyang12 in #1670
[Optim][Qwen3TTS] big boost model throughput+latency high concurrency by @JuanPZuluaga in #1852
[CI] [ROCm] Bugfix device environment issue by @tjtanaa in #1984
[CI]init intel ci dispatch in buildkite folder by @xuechendi in #1721
Fix OmniGen2 transformer config loading for HF models by @Joshna-Medisetty in #1934
[Test] L4 complete diffusion feature test for Bagel models by @NumberWan in #1938
[Performance] diffusion timing by @Bounty-hunter in #1757
[Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls by @DomBrown in #1985
[CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) by @princepride in #1998
[Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) by @SamitHuang in #1979
Add HF token to H100 jobs by @khluu in #2008
[Bugfix] Fix Ovis Image crash when guidance_scale is set without negative_prompt by @Dnoob in #1956
[Bugfix] fix helios video generate use cpu device by @lengrongfu in #1915
[XPU] update bagel modeling to remove cuda hardcode, add xpu stage_config by @xuechendi in #1931
[Fix] Fix slow hasattr in CUDAGraphWrapper.getattr by @ZeldaHuang in #1982
[Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni by @R2-Y in #2009
[Bugfix]Fix bug of online server can not return mutli images by @Hu1Lcode in #2007
[CI] [ROCm] Setup test-ready.yml and test-merge.yml by @tjtanaa in #2017
Int8 Quantization Support for DiT (Z-Image & Qwen-Image) by @yjb767868009 in #1470
[Model] Add Voxtral TTS model by @y123456y78 in #1803
[Feat] Support T5 Tensor Parallelism by @yuanheng-zhao in #1881
[Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph by @JuanPZuluaga in #1797
[Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False by @JuanPZuluaga in #1913
[CI] Change Bagel online test environment variable VLLM_TEST_CLEAN_GPU_MEMORY to 0 by @princepride in #2032
[BugFix][Doc]Update voxtral_tts end2end.py & README.md by @y123456y78 in #2026
[Docs] Add Wan2.1-T2V as supported video generation models by @SamitHuang in #1920
[Bugfix] Remove duplicated config keyword max batch size by @tzhouam in #1851
[Test] Implement mock HTTP request handling in benchmark CLI tests by @yenuo26 in #2014
[CI] Fix test. by @congw729 in #2031
reafator pipeline stage/step pipeline by @asukaqaq-s in #1368
[Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #2012
[Benchmark] [Diffusion] [Enhancement] Random dataset by @Bounty-hunter in #1657
[Bugfix] Z-Image CFG threshold should be > 0 instead of > 1 by @RuixiangMa in #1634
[Voxtral TTS] Remove redundant yaml by @y123456y78 in #2056
[Bugfix]: fixed ServerDisconnectedError in benchmark test (reapply #1683, fixes #1374) by @NumberWan in #1841
[Perf] Improve Fish Speech S2 Pro inference performance by @Sy0307 in #1859
[Voxtral] Improve example by @patrickvonplaten in #2045
[CI] Uncomment condition for nightly build in YAML by @Gaohan123 in #2057
[bugfix] /chat/completion doesn't read extra_body for diffusion model by @fhfuih in #2042
[BugFix][Qwen3TTS] CodePredictor CudaGraph Pool by @JuanPZuluaga in #2059
[Rebase] Rebase to vllm v0.18.0 by @tzhouam in #2037
[Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 by @tzhouam in #2038
[Model] Add HunyuanVideo-1.5 T2V and I2V support by @lishunyang12 in #1516

New Contributors

@oglok made their first contribution in #1337
@NumberWan made their first contribution in #1465
@Flink-ddd made their first contribution in #1582
@ieaves made their first contribution in #1665
@ahengljh made their first contribution in #1863
@Fishermanykx made their first contribution in #1848
@Joshna-Medisetty made their first contribution in #1934
@DomBrown made their first contribution in #1985
@Dnoob made their first contribution in #1956
@Hu1Lcode made their first contribution in #2007
@yjb767868009 made their first contribution in #1470
@y123456y78 made their first contribution in #1803
@patrickvonplaten made their first contribution in #2045

Full Changelog: v0.17.0rc1...v0.18.0rc1