github vllm-project/vllm-omni v0.18.0rc1

pre-release13 hours ago

Highlights

This release features approximately 120 commits across 120+ pull requests from 50+ contributors, including 13 new contributors.

Expanded Model Support

This release continues to grow the multimodal model ecosystem with several major additions:

  • Added FLUX.2-dev image generation model (#1629).
  • Added Bagel multistage img2img support (#1669).
  • Added HunyuanVideo-1.5 text-to-video and image-to-video support (#1516).
  • Added Voxtral TTS model (#1803, #2026, #2056).
  • Added Fish Speech S2 Pro with online serving and voice cloning (#1798).
  • Added Dreamid-Omni from ByteDance (#1855).
  • Extended NPU support for HunyuanImage3 diffusion model (#1689).
  • Added OmniGen2 transformer config loading for HF models (#1934).

Performance Improvements

Multiple optimizations improve throughput, latency, and runtime efficiency:

  • Qwen3-Omni code predictor re-prefill + SDPA to eliminate decode hot-path CPU round-trips (#2012).
  • Qwen3-TTS high-concurrency throughput & latency boost (#1852).
  • Qwen3-TTS Code2Wav triton SnakeBeta kernel and CUDA Graph support (#1797).
  • Qwen3-TTS CodePredictor torch.compile with reduce-overhead and dynamic=False (#1913).
  • Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls (#1985).
  • Simple dynamic TTFA based on Code2Wav load for Qwen3-TTS (#1714).
  • Enabled async_scheduling by default for Qwen3-TTS (#1853).
  • Fish Speech S2 Pro inference performance improvements (#1859).
  • Fix slow hasattr in CUDAGraphWrapper.getattr (#1982).
  • Diffusion timing profiling improvements (#1757).

Inference Infrastructure & Parallelism

New infrastructure capabilities improve scalability and production readiness:

  • Model Pipeline Configuration System refactor (Part 1) (#1115).
  • vLLM-Omni entrypoint refactoring for cleaner startup flow (#1908).
  • Expert parallel for diffusion MoE layers (#1323).
  • Sequence parallelism (SP) support for FLUX.2-klein (#1250) and HSDP for Flux family (#1900).
  • T5 Tensor Parallelism support (#1881).
  • LongCat Sequence Parallelism refactored to use SP Plan (#1772).
  • PD disaggregation scaffolding (Split #1303 Part 1) (#1863).
  • Coordinator module with unit tests (#1465).
  • Refactored pipeline stage/step pipeline (#1368).
  • Helm Chart to deploy vLLM-Omni on Kubernetes (#1337).

Text-to-Speech Improvements

Major TTS pipeline improvements for streaming, quality, and new models:

  • Streaming audio output via WebSocket for Qwen3-TTS (#1719).
  • Gradio demo for Qwen3-TTS online serving (#1231).
  • Added wav response_format when stream is true in /v1/audio/speech (#1819).
  • Fixed Base voice clone streaming quality and stop-token crash (#1945).
  • Fixed streaming initial chunk — removed dynamic initial chunk, compute only on initial request (#1930).
  • Preserved ref_code decoder context for Base ICL in Qwen3-TTS (#1731).
  • Restored voice upload API and profiler endpoints reverted by #1719 (#1879).
  • BugFix for CodePredictor CudaGraph Pool (#2059).

Quantization & Hardware Support

  • Int8 quantization support for DiT (Z-Image & Qwen-Image) (#1470).
  • Added cache-dit support for HunyuanImage3 (#1848) and Flux.2-dev (#1814).
  • Enabled CPU offloading and Cache-DiT together on diffusion models (#1723).
  • Upgraded cache-dit from 1.2.0 to 1.3.0 (#1834).
  • NPU upgrade to v0.17.0 (#1890).
  • Updated Bagel modeling to remove CUDA hardcode and added XPU stage_config (#1931).
  • Updated GpuMemoryMonitor to DeviceMemoryMonitor for all hardware (#1526).
  • ROCm bugfix for device environment issues and CI setup (#1984, #2017).
  • Intel CI dispatch in Buildkite folder (#1721).

Frontend & Serving

  • ComfyUI video & LoRA support (#1596).
  • Rewrote video API for async job lifecycle (#1665).
  • Fix /chat/completion not reading extra_body for diffusion models (#2042).
  • Fix online server returning multiple images (#2007).
  • Fix Ovis Image crash when guidance_scale is set without negative_prompt (#1956).
  • Fix config misalignment between offline and online diffusion inference (#1979).

Reliability, Tooling & Developer Experience

  • OmniStage.try_collect() patched with process alive checks (#1560) and Ray alive checks (#1561).
  • Nightly Buildkite Pytest test case statistics with HTML report by email (#1674).
  • Nightly Benchmark HTML generator and updated EXCEL generator (#1831).
  • Added multimodal processing correctness tests for Omni models (#1445).
  • Added Qwen3-TTS nightly performance benchmark (#1700) and benchmark scripts (#1573).
  • Added Governance section (#1889).
  • Rebase to vllm v0.18.0 (#2037, #2038).
  • Numerous bug fixes across models, configuration, parallelism, and CI pipelines.

What's Changed

  • [Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
  • [Bagel]: Support multistage img2img by @princepride in #1669
  • [BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
  • [Doc] CLI Args Naming Style Correction by @wtomin in #1750
  • [Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
  • [Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
  • Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint by @ekagra-ranjan in #1255
  • [Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
  • [CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
  • [Doc][skip-ci] Update installation instructions by @tzhouam in #1762
  • Revert "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" by @hsliuustc0106 in #1789
  • [BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
  • [feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
  • [Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
  • [Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
  • [Docs] update async chunk performance diagram by @R2-Y in #1741
  • [Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
  • [Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
  • [skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
  • update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
  • Add coordinator module and corresponding unit test by @NumberWan in #1465
  • [Model]: add FLUX.2-dev model by @nuclearwu in #1629
  • [skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
  • [Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
  • [Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
  • [Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
  • [Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
  • [Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
  • [Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
  • Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
  • [Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
  • Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
  • [Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
  • [Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
  • [Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
  • [Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
  • [Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever in #1561
  • [Feat][Diffusion]: Implement Component-Level VRAM Quota and Resource Domain Isolation by @Flink-ddd in #1582
  • [Feature]: Enable directly use OmniLLM init AR model by @princepride in #1821
  • [Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0 by @SamitHuang in #1834
  • [Bugfix] Modify _resolve_pytest_target to support glob patterns and return multiple paths by @yenuo26 in #1843
  • [Feat] add wav response_format when stream is true in /v1/audio/speec… by @lengrongfu in #1819
  • [BugFix]: Revert #1582 by @princepride in #1842
  • [Feature]: support Flux.2-dev cache_dit by @nuclearwu in #1814
  • [skip ci] update readme slides link by @hsliuustc0106 in #1850
  • [Model] Extend NPU support for HunyuanImage3 Diffusion Model by @ElleElleWu in #1689
  • [Config Refactor][1/2] Model Pipeline Configuration System by @lishunyang12 in #1115
  • [Test] Reduce SP & Offloading test cases for L2 by @fhfuih in #1839
  • [bugfix] Add Interleaved 2D Rotary Embedding for HunyuanImage3 by @usberkeley in #1784
  • [Bugfix] Fix Helios text_encoder embed_tokens all-zeros due to untied weights by @dubin555 in #1728
  • Enable async_scheduling by default for Qwen3-TTS by @linyueqian in #1853
  • [CI failure] Comment out test_zimage_vae_patch_parallel_tp2 by @Gaohan123 in #1856
  • Add Fish Speech S2 Pro support with online serving and voice cloning by @linyueqian in #1798
  • [skip CI][Docs] add connector design document by @natureofnature in #1737
  • [BugFix] Readme and example runner file for cosyvoice3 missed in refactoring by @divyanshsinghvi in #1685
  • [Refactor] Use SP Plan for LongCat Sequence Parallelism by @alex-jw-brooks in #1772
  • [CI failed] Disable test for zimage tensor parallelism by @Gaohan123 in #1870
  • [Bugfix] Fix SD3.5-medium attn2 uninitialized weights by @lishunyang12 in #1659
  • [Bugfix] fix layer-wise offload incompatible with cache-dit by @RuixiangMa in #1786
  • [CI failed] Disable Diffusion Tensor Parallelism Test by @Gaohan123 in #1876
  • [BugFix]: Fix bagel online inference bug by @princepride in #1804
  • [Frontend] Rewrite video API for async job lifecycle by @ieaves in #1665
  • [Diffusion] [Model] Dreamid-Omni from bytedance by @Bounty-hunter in #1855
  • [Bugfix] Restore voice upload API and profiler endpoints reverted by #1719 by @linyueqian in #1879
  • [BugFix] Fix Max Rank Handling in LoRA by @alex-jw-brooks in #1397
  • Buildkite hardware ci xpu test by @pi314ever in #1340
  • [CI] add multimodal processing correctness tests for Omni models by @zzhuoxin1508 in #1445
  • fix: propagate parallel_config through create_default_diffusion by @lishunyang12 in #1878
  • [CI pipeline] Re-enable Diffusion Tensor Parallelism Test in pipeline by @Gaohan123 in #1892
  • [skip CI][Docs][Benchmark]: clarify vbench parameter behavior and add t2v example by @asukaqaq-s in #1497
  • [Bugfix] Fix cpu offload and quantization compatibility by @RuixiangMa in #1473
  • [Feat] support SP for FLUX.2-klein by @RuixiangMa in #1250
  • [CI]: Add/Fix bagel e2e online/offline test by @princepride in #1895
  • [Feat] support HSDP for Flux family by @RuixiangMa in #1900
  • Add Governance section by @ywang96 in #1889
  • Update latest news section in README.md by @ywang96 in #1909
  • [Feature] Split #1303 Part 1: PD disaggregation scaffolding by @ahengljh in #1863
  • [NPU] Upgrade to v0.17.0 by @gcanlin in #1890
  • [Misc] removed qwen3_tts.py as it is out-dated by @lengrongfu in #1926
  • [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request by @JuanPZuluaga in #1930
  • Fix Base voice clone streaming quality and stop-token crash by @linyueqian in #1945
  • [Docs] Update WeChat QR code for community support by @david6666666 in #1974
  • [skip ci][Docs] Update WeChat QR code (fix filename case) by @david6666666 in #1976
  • [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring by @fake0fan in #1908
  • [Bugfix] Set PREEMPTED status when moving requests from running to waiting queue by @gcanlin in #1893
  • [Feature] Add cache-dit support for HunyuanImage3 by @Fishermanykx in #1848
  • [Feature]: Remove some useless hf_overrides in yaml by @princepride in #1898
  • [CI] Nightly Benchmark - Add an HTML generator, Update the EXCEL generator. by @congw729 in #1831
  • [Bug]: fix CUDA OOM during diffusion post-processing by @lishunyang12 in #1670
  • [Optim][Qwen3TTS] big boost model throughput+latency high concurrency by @JuanPZuluaga in #1852
  • [CI] [ROCm] Bugfix device environment issue by @tjtanaa in #1984
  • [CI]init intel ci dispatch in buildkite folder by @xuechendi in #1721
  • Fix OmniGen2 transformer config loading for HF models by @Joshna-Medisetty in #1934
  • [Test] L4 complete diffusion feature test for Bagel models by @NumberWan in #1938
  • [Performance] diffusion timing by @Bounty-hunter in #1757
  • [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls by @DomBrown in #1985
  • [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) by @princepride in #1998
  • [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) by @SamitHuang in #1979
  • Add HF token to H100 jobs by @khluu in #2008
  • [Bugfix] Fix Ovis Image crash when guidance_scale is set without negative_prompt by @Dnoob in #1956
  • [Bugfix] fix helios video generate use cpu device by @lengrongfu in #1915
  • [XPU] update bagel modeling to remove cuda hardcode, add xpu stage_config by @xuechendi in #1931
  • [Fix] Fix slow hasattr in CUDAGraphWrapper.getattr by @ZeldaHuang in #1982
  • [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni by @R2-Y in #2009
  • [Bugfix]Fix bug of online server can not return mutli images by @Hu1Lcode in #2007
  • [CI] [ROCm] Setup test-ready.yml and test-merge.yml by @tjtanaa in #2017
  • Int8 Quantization Support for DiT (Z-Image & Qwen-Image) by @yjb767868009 in #1470
  • [Model] Add Voxtral TTS model by @y123456y78 in #1803
  • [Feat] Support T5 Tensor Parallelism by @yuanheng-zhao in #1881
  • [Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph by @JuanPZuluaga in #1797
  • [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False by @JuanPZuluaga in #1913
  • [CI] Change Bagel online test environment variable VLLM_TEST_CLEAN_GPU_MEMORY to 0 by @princepride in #2032
  • [BugFix][Doc]Update voxtral_tts end2end.py & README.md by @y123456y78 in #2026
  • [Docs] Add Wan2.1-T2V as supported video generation models by @SamitHuang in #1920
  • [Bugfix] Remove duplicated config keyword max batch size by @tzhouam in #1851
  • [Test] Implement mock HTTP request handling in benchmark CLI tests by @yenuo26 in #2014
  • [CI] Fix test. by @congw729 in #2031
  • reafator pipeline stage/step pipeline by @asukaqaq-s in #1368
  • [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #2012
  • [Benchmark] [Diffusion] [Enhancement] Random dataset by @Bounty-hunter in #1657
  • [Bugfix] Z-Image CFG threshold should be > 0 instead of > 1 by @RuixiangMa in #1634
  • [Voxtral TTS] Remove redundant yaml by @y123456y78 in #2056
  • [Bugfix]: fixed ServerDisconnectedError in benchmark test (reapply #1683, fixes #1374) by @NumberWan in #1841
  • [Perf] Improve Fish Speech S2 Pro inference performance by @Sy0307 in #1859
  • [Voxtral] Improve example by @patrickvonplaten in #2045
  • [CI] Uncomment condition for nightly build in YAML by @Gaohan123 in #2057
  • [bugfix] /chat/completion doesn't read extra_body for diffusion model by @fhfuih in #2042
  • [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool by @JuanPZuluaga in #2059
  • [Rebase] Rebase to vllm v0.18.0 by @tzhouam in #2037
  • [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 by @tzhouam in #2038
  • [Model] Add HunyuanVideo-1.5 T2V and I2V support by @lishunyang12 in #1516

New Contributors

Full Changelog: v0.17.0rc1...v0.18.0rc1

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.