Highlights
This release features 71 commits since v0.18.0.
vLLM-Omni v0.19.0rc1 is a rebase-and-production-readiness release candidate aligned with upstream vLLM v0.19.0. It strengthens the runtime and serving stack, expands speech/TTS and diffusion/video capabilities, improves production behavior for Bagel and Wan pipelines, and broadens deployment coverage across new platforms and distributed execution modes.
Key Improvements
- Rebased to upstream vLLM v0.19.0, while continuing runtime cleanup and stage execution refactors that improve orchestration and production robustness. (#2475, #2006)
- Expanded speech and TTS serving, including new OmniVoice two-stage support, CosyVoice3 online serving, and multiple Qwen3-TTS / Fish Speech quality and latency fixes. (#2463, #2431, #2108, #2446, #2378, #2358)
- Improved diffusion and video generation workflows across Bagel, Wan2.2, FLUX.2-dev, and LTX-2, with lower latency, better forwarding behavior, and stronger production correctness. (#2398, #2422, #2397, #2381, #2459, #2393, #2433, #2260)
- Broadened deployment coverage, adding MUSA platform support, improving XPU readiness, and extending distributed diffusion features such as HSDP and CFG parallelism. (#2337, #2428, #2029, #2021, #1751)
Core Architecture & Runtime
- Rebased the project to upstream vLLM v0.19.0, keeping vLLM-Omni aligned with the latest upstream runtime behavior and APIs. (#2475)
- Continued the stage/runtime refactor by moving stage-side inference into dedicated subprocess-based clients and procs, simplifying orchestration and improving isolation for both AR and diffusion stages. (#2006)
- Added session-based streaming audio input with a realtime WebSocket path for Qwen3-Omni-style workflows, enabling incremental audio input and streamed transcription/output flows. (#2208)
- Added a nightly wheel release index, making it easier to validate and consume nightly builds in testing and pre-release workflows. (#2345)
Model Support
- Added OmniVoice two-stage TTS serving support, bringing zero-shot multilingual speech generation into the vLLM-Omni serving stack. (#2463)
- Added and stabilized CosyVoice3 online serving through
/v1/audio/speech, including stage config fixes and CI coverage. (#2431) - Added LTX-2 distilled two-stage inference for both text-to-video and image-to-video production workflows. (#2260)
- Added Wan 2.1 VACE support for conditional video generation workflows, including multiple conditioning modes. (#1885)
Audio, Speech & Omni Production Optimization
- Improved Qwen3-TTS repeated custom-voice serving by introducing an in-memory voice cache for reference-audio artifacts, reducing warm-request latency for repeated voices. (#2108)
- Fixed a Fish Speech structured voice-clone conditioning regression so cloned voice quality is restored in the prefill path. (#2446)
- Fixed Qwen3-TTS chunk-boundary handling, case-insensitive speaker lookup, and demo-serving issues to make TTS behavior more reliable in real deployments. (#2378, #2358, #2372)
- Added better benchmark support for Qwen3-TTS Base and VoiceDesign models so serving and HF benchmark paths correctly reflect task-specific request formats. (#2411)
Diffusion, Image & Video Generation
- Improved Wan2.2 runtime efficiency by optimizing rotary embedding behavior and skipping unnecessary cross-attention Ulysses SP paths where appropriate. (#2393, #2459)
- Strengthened Bagel production behavior with earlier KV-ready forwarding, fixes for delayed decoding in AR/DiT workflows, proper single-stage img2img routing, and a dedicated single-stage config. (#2398, #2422, #2397, #2381)
- Added Bagel thinking mode in multi-stage serving, expanding interactive and reasoning-style generation workflows. (#2447)
- Fixed FLUX.2-dev guidance handling so guidance scale is applied correctly during generation. (#2433)
- Added a synchronous
/v1/videos/syncendpoint for latency-sensitive benchmarking and direct-response video generation workflows. (#2049)
Quantization & Memory Efficiency
- Added offline AutoRound W4A16 support for diffusion models, improving deployability for memory-constrained setups. (#1777)
- Fixed layer-wise offload incompatibility with HSDP, improving compatibility between memory-saving and distributed execution paths. (#2021)
Platforms, Distributed Execution & Hardware Coverage
- Added MUSA platform support for Moore Threads GPUs, expanding vLLM-Omni beyond the existing CUDA/ROCm/NPU/XPU coverage. (#2337)
- Improved XPU readiness for speech serving by removing CUDA-only assumptions in Voxtral TTS components and adding an XPU stage config. (#2428)
- Expanded distributed diffusion support with HSDP for Qwen-Image-series, Z-Image, and GLM-Image, and added CFG parallel support for HunyuanImage3.0. (#2029, #1751)
- Fixed distributed gather behavior for non-contiguous tensors, improving correctness in CFG-parallel and related distributed paths. (#2367)
CI, Benchmarks & Documentation
- Refreshed the diffusion documentation structure around feature compatibility, parallelism, cache acceleration, quantization, and serving examples, making the diffusion stack easier to navigate and adopt.
- Expanded CI and E2E coverage for speech, diffusion, and video-serving scenarios, especially around CosyVoice3, Qwen3-TTS benchmarking, and Wan-family validation. (#2431, #2411, #2262)
Note
v0.19.0rc1is a release candidate focused on validating the upstream rebase, the refreshed runtime architecture, and the expanded speech/diffusion/platform support before the finalv0.19.0release.- Some low-signal CI and documentation maintenance changes were intentionally merged into broader themes instead of being listed one-by-one, following the project’s recent release-note style.
What's Changed
- [Bugfix][HunyuanImage3.0] Fix default guidance_scale from 1.0 to 4.0 and port GPU MoE ForwardContext fix from NPU by @nussejzz in #2142
- [Feat] support quantization for Flux Kontext by @RuixiangMa in #2184
- [Tests][Qwen3-Omni] Add performance test cases by @amy-why-3459 in #2011
- [Docs] Modify the documentation description for streaming output by @amy-why-3459 in #2300
- Fix: Enable /v1/models endpoint for pure diffusion mode by @majiayu000 in #805
- [skip ci] [Docs]: add CI Failures troubleshooting guide for contributors by @lishunyang12 in #1259
- Qwen3-Omni][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @LJH-LBJ in #2291
- [Feature] [HunyuanImage3] Add TeaCache support for inference acceleration by @nussejzz in #1927
- [Misc] Make gradio an optional dependency and upgrade to >=6.7.0 by @Lidang-Jiang in #2221
- [ROCm] [CI] Migrate to use amd docker hub for ci by @tjtanaa in #2303
- [Feat] add helios fp8 quantization by @lengrongfu in #1916
- [Bugfix] fix: handle Qwen-Image-Layered layered RGBA output for jpeg edits by @david6666666 in #2297
- [Doc] Add transformers version requirement in GLM-Image example doc by @chickeyton in #2265
- [Bugfix] Fix Qwen3TTSConfig init order to be compatible with newer Tansformers(5.x) by @RuixiangMa in #2306
- [Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #2195
- [skip ci][Doc] Refine the Diffusion Features User Guide by @wtomin in #1928
- [Bugfix] fix: return 400 for unsupported multi-image edits such as Qwen-Image-Layered by @david6666666 in #2298
- [Bugfix] fix: validate layered image layers range by @david6666666 in #2334
- [skip ci][Docs] reorganize multiple L4 test guidelines by @fhfuih in #2119
- [Diffusion] Refactor CFG parallel for extensibility and performance by @TKONIY in #2063
- Fix Qwen3-TTS Base on NPU running failed by @OrangePure in #2353
- [Test] Fix 4 broken Qwen3-TTS async chunk unit tests by @linyueqian in #2351
- [Test] Add qwen3-omni tests for audio_in_video and one word prompt by @yenuo26 in #2097
- [CI] fix test: use minimum supported layered output count by @david6666666 in #2350
- [CI]test: add wan22 i2v video similarity e2e by @david6666666 in #2262
- [Bugfix] Fix case-sensitivity in Qwen3 TTS speaker name lookup by @reidliu41 in #2358
- Fix Qwen3-TTS gradio demo by @noobHappylife in #2372
- [skip ci] update release 0.18.0 by @hsliuustc0106 in #2380
- [Bugfix] Update Whisper model loading to support multi-GPU configurations and optimize CUDA memory management by @yenuo26 in #2354
- [release] Add nightly wheel release index by @khluu in #2345
- [BugFix] Add BAGEL single-stage diffusion config and fix multiple
<im_start><im_end>bug by @princepride in #2381 - [Bugfix] Fix layer-wise offload incompatibility with HSDP by @RuixiangMa in #2021
- [BugFix] qwen3_tts chunk boundary handling logic in initial chunk (IC) by @Fattysand in #2378
- [Feat][Benchmark] Add synchronous video generation endpoint POST /v1/videos/sync for benchmark test by @SamitHuang in #2049
- [Docs] Update WeChat QR code for community support by @david6666666 in #2402
- [CI] [skip ci]Nightly Report Optim by @congw729 in #2406
- [Feature][HunyuanImage3.0] Add cfgP to HunyuanImage3.0 by @nussejzz in #1751
- Fix: ensure input tensor is contiguous in GroupCoordinator.all_gather by @daixinning in #2367
- [Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions by @natureofnature in #2398
- [Feat] Support step-boundary abort in diffusion by @asukaqaq-s in #1769
- [BugFix]: Fix bagel single-stage img2img fallback to text2img bug by @princepride in #2397
- [Feat] Add MUSA platform support for Moore Threads GPUs by @yeahdongcn in #2337
- Add new committers to governance page by @ywang96 in #2419
- [CI] Tune GPU resources for test by @tjtanaa in #2401
- [Feat] support HSDP for Qwen-image series, Z-Image, GLM-Image by @RuixiangMa in #2029
- [Bugfix] Fix delayed decoding bug for Bagel AR/DIT workflow (L3 test_bagel_img2img error) by @natureofnature in #2422
- [skip ci][Doc] Update RFC template doc by @yuanheng-zhao in #2141
- [Test] Add voice or language test case for Qwen3-omni and Qwen-tts by @yenuo26 in #1844
- [skip ci][Doc] Small fix of Doc by @wtomin in #2400
- [Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model by @JasonJ2021 in #2411
- [CI] [skip ci] Rename & reset timout mins for nightly L4 tests. by @congw729 in #2251
- [AutoRound] Add offline quantized
W4A16model support by @yiliu30 in #1777 - [Perf] Optimize Wan2.2 rotary embedding by @gcanlin in #2393
- Add VACE support for WAN 2.1 conditional video generation by @tangbinh in #1885
- [skip ci][Bugfix] clean useless log by @R2-Y in #2450
- [Test] Skip tests/e2e/online_serving/test_zimage_expansion.py due to issue #2435 by @zhumingjue138 in #2454
- [Feature] add session based audio streaming input by @Shirley125 in #2208
- Update MRoPE config fallback logic by @vraiti in #2278
- [Docs] Update docs to use vllm-ascend v0.18.0rc1 by @gcanlin in #2453
- [BAGEL] [Feature]: Add
thinking modein Bagel multi-stage serving by @princepride in #2447 - [BugFix][FishSpeech] Fix structured voice clone prefill conditioning by @Sy0307 in #2446
- Refactor StageDiffusionClient and StageEngineCoreClient by @chickeyton in #2006
- [Perf] Skip Wan2.2 cross attn Ulysses SP by @gcanlin in #2459
- [Model] Add two stages inference for model LTX-2 distilled. by @Songrui625 in #2260
- [Cleanup] Replace bare print() with logger and use specific exception types by @Lidang-Jiang in #2228
- [Bugfix] Fix Flux2 Dev Guidance by @alex-jw-brooks in #2433
- [OmniVoice] Add two-stage TTS serving support by @linyueqian in #2463
- [Qwen3TTS] [TTS] [Feat] Refactor voice cache manager by @JuanPZuluaga in #2108
- [CosyVoice3] Add online serving support, fix stage config, and add CI tests by @linyueqian in #2431
- [Rebase] Rebase to vllm v0.19.0 by @tzhouam in #2475
- Voxtral TTS: drop hardcoded CUDA in audio tokenizer; add XPU stage config by @Joshna-Medisetty in #2428
New Contributors
- @chickeyton made their first contribution in #2265
- @TKONIY made their first contribution in #2063
- @OrangePure made their first contribution in #2353
- @noobHappylife made their first contribution in #2372
- @Fattysand made their first contribution in #2378
- @daixinning made their first contribution in #2367
- @yeahdongcn made their first contribution in #2337
- @JasonJ2021 made their first contribution in #2411
- @yiliu30 made their first contribution in #1777
- @tangbinh made their first contribution in #1885
- @vraiti made their first contribution in #2278
- @Songrui625 made their first contribution in #2260
Full Changelog: v0.18.0...v0.19.0rc1