vllm-project/vllm-omni v0.19.0rc1 on GitHub

Highlights

This release features 71 commits since v0.18.0.

vLLM-Omni v0.19.0rc1 is a rebase-and-production-readiness release candidate aligned with upstream vLLM v0.19.0. It strengthens the runtime and serving stack, expands speech/TTS and diffusion/video capabilities, improves production behavior for Bagel and Wan pipelines, and broadens deployment coverage across new platforms and distributed execution modes.

Key Improvements

Rebased to upstream vLLM v0.19.0, while continuing runtime cleanup and stage execution refactors that improve orchestration and production robustness. (#2475, #2006)
Expanded speech and TTS serving, including new OmniVoice two-stage support, CosyVoice3 online serving, and multiple Qwen3-TTS / Fish Speech quality and latency fixes. (#2463, #2431, #2108, #2446, #2378, #2358)
Improved diffusion and video generation workflows across Bagel, Wan2.2, FLUX.2-dev, and LTX-2, with lower latency, better forwarding behavior, and stronger production correctness. (#2398, #2422, #2397, #2381, #2459, #2393, #2433, #2260)
Broadened deployment coverage, adding MUSA platform support, improving XPU readiness, and extending distributed diffusion features such as HSDP and CFG parallelism. (#2337, #2428, #2029, #2021, #1751)

Core Architecture & Runtime

Rebased the project to upstream vLLM v0.19.0, keeping vLLM-Omni aligned with the latest upstream runtime behavior and APIs. (#2475)
Continued the stage/runtime refactor by moving stage-side inference into dedicated subprocess-based clients and procs, simplifying orchestration and improving isolation for both AR and diffusion stages. (#2006)
Added session-based streaming audio input with a realtime WebSocket path for Qwen3-Omni-style workflows, enabling incremental audio input and streamed transcription/output flows. (#2208)
Added a nightly wheel release index, making it easier to validate and consume nightly builds in testing and pre-release workflows. (#2345)

Model Support

Added OmniVoice two-stage TTS serving support, bringing zero-shot multilingual speech generation into the vLLM-Omni serving stack. (#2463)
Added and stabilized CosyVoice3 online serving through /v1/audio/speech, including stage config fixes and CI coverage. (#2431)
Added LTX-2 distilled two-stage inference for both text-to-video and image-to-video production workflows. (#2260)
Added Wan 2.1 VACE support for conditional video generation workflows, including multiple conditioning modes. (#1885)

Audio, Speech & Omni Production Optimization

Improved Qwen3-TTS repeated custom-voice serving by introducing an in-memory voice cache for reference-audio artifacts, reducing warm-request latency for repeated voices. (#2108)
Fixed a Fish Speech structured voice-clone conditioning regression so cloned voice quality is restored in the prefill path. (#2446)
Fixed Qwen3-TTS chunk-boundary handling, case-insensitive speaker lookup, and demo-serving issues to make TTS behavior more reliable in real deployments. (#2378, #2358, #2372)
Added better benchmark support for Qwen3-TTS Base and VoiceDesign models so serving and HF benchmark paths correctly reflect task-specific request formats. (#2411)

Diffusion, Image & Video Generation

Improved Wan2.2 runtime efficiency by optimizing rotary embedding behavior and skipping unnecessary cross-attention Ulysses SP paths where appropriate. (#2393, #2459)
Strengthened Bagel production behavior with earlier KV-ready forwarding, fixes for delayed decoding in AR/DiT workflows, proper single-stage img2img routing, and a dedicated single-stage config. (#2398, #2422, #2397, #2381)
Added Bagel thinking mode in multi-stage serving, expanding interactive and reasoning-style generation workflows. (#2447)
Fixed FLUX.2-dev guidance handling so guidance scale is applied correctly during generation. (#2433)
Added a synchronous /v1/videos/sync endpoint for latency-sensitive benchmarking and direct-response video generation workflows. (#2049)

Quantization & Memory Efficiency

Added offline AutoRound W4A16 support for diffusion models, improving deployability for memory-constrained setups. (#1777)
Fixed layer-wise offload incompatibility with HSDP, improving compatibility between memory-saving and distributed execution paths. (#2021)

Platforms, Distributed Execution & Hardware Coverage

Added MUSA platform support for Moore Threads GPUs, expanding vLLM-Omni beyond the existing CUDA/ROCm/NPU/XPU coverage. (#2337)
Improved XPU readiness for speech serving by removing CUDA-only assumptions in Voxtral TTS components and adding an XPU stage config. (#2428)
Expanded distributed diffusion support with HSDP for Qwen-Image-series, Z-Image, and GLM-Image, and added CFG parallel support for HunyuanImage3.0. (#2029, #1751)
Fixed distributed gather behavior for non-contiguous tensors, improving correctness in CFG-parallel and related distributed paths. (#2367)

CI, Benchmarks & Documentation

Refreshed the diffusion documentation structure around feature compatibility, parallelism, cache acceleration, quantization, and serving examples, making the diffusion stack easier to navigate and adopt.
Expanded CI and E2E coverage for speech, diffusion, and video-serving scenarios, especially around CosyVoice3, Qwen3-TTS benchmarking, and Wan-family validation. (#2431, #2411, #2262)

Note

v0.19.0rc1 is a release candidate focused on validating the upstream rebase, the refreshed runtime architecture, and the expanded speech/diffusion/platform support before the final v0.19.0 release.
Some low-signal CI and documentation maintenance changes were intentionally merged into broader themes instead of being listed one-by-one, following the project’s recent release-note style.

What's Changed

[Bugfix][HunyuanImage3.0] Fix default guidance_scale from 1.0 to 4.0 and port GPU MoE ForwardContext fix from NPU by @nussejzz in #2142
[Feat] support quantization for Flux Kontext by @RuixiangMa in #2184
[Tests][Qwen3-Omni] Add performance test cases by @amy-why-3459 in #2011
[Docs] Modify the documentation description for streaming output by @amy-why-3459 in #2300
Fix: Enable /v1/models endpoint for pure diffusion mode by @majiayu000 in #805
[skip ci] [Docs]: add CI Failures troubleshooting guide for contributors by @lishunyang12 in #1259
Qwen3-Omni][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @LJH-LBJ in #2291
[Feature] [HunyuanImage3] Add TeaCache support for inference acceleration by @nussejzz in #1927
[Misc] Make gradio an optional dependency and upgrade to >=6.7.0 by @Lidang-Jiang in #2221
[ROCm] [CI] Migrate to use amd docker hub for ci by @tjtanaa in #2303
[Feat] add helios fp8 quantization by @lengrongfu in #1916
[Bugfix] fix: handle Qwen-Image-Layered layered RGBA output for jpeg edits by @david6666666 in #2297
[Doc] Add transformers version requirement in GLM-Image example doc by @chickeyton in #2265
[Bugfix] Fix Qwen3TTSConfig init order to be compatible with newer Tansformers(5.x) by @RuixiangMa in #2306
[Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #2195
[skip ci][Doc] Refine the Diffusion Features User Guide by @wtomin in #1928
[Bugfix] fix: return 400 for unsupported multi-image edits such as Qwen-Image-Layered by @david6666666 in #2298
[Bugfix] fix: validate layered image layers range by @david6666666 in #2334
[skip ci][Docs] reorganize multiple L4 test guidelines by @fhfuih in #2119
[Diffusion] Refactor CFG parallel for extensibility and performance by @TKONIY in #2063
Fix Qwen3-TTS Base on NPU running failed by @OrangePure in #2353
[Test] Fix 4 broken Qwen3-TTS async chunk unit tests by @linyueqian in #2351
[Test] Add qwen3-omni tests for audio_in_video and one word prompt by @yenuo26 in #2097
[CI] fix test: use minimum supported layered output count by @david6666666 in #2350
[CI]test: add wan22 i2v video similarity e2e by @david6666666 in #2262
[Bugfix] Fix case-sensitivity in Qwen3 TTS speaker name lookup by @reidliu41 in #2358
Fix Qwen3-TTS gradio demo by @noobHappylife in #2372
[skip ci] update release 0.18.0 by @hsliuustc0106 in #2380
[Bugfix] Update Whisper model loading to support multi-GPU configurations and optimize CUDA memory management by @yenuo26 in #2354
[release] Add nightly wheel release index by @khluu in #2345
[BugFix] Add BAGEL single-stage diffusion config and fix multiple <im_start><im_end> bug by @princepride in #2381
[Bugfix] Fix layer-wise offload incompatibility with HSDP by @RuixiangMa in #2021
[BugFix] qwen3_tts chunk boundary handling logic in initial chunk (IC) by @Fattysand in #2378
[Feat][Benchmark] Add synchronous video generation endpoint POST /v1/videos/sync for benchmark test by @SamitHuang in #2049
[Docs] Update WeChat QR code for community support by @david6666666 in #2402
[CI] [skip ci]Nightly Report Optim by @congw729 in #2406
[Feature][HunyuanImage3.0] Add cfgP to HunyuanImage3.0 by @nussejzz in #1751
Fix: ensure input tensor is contiguous in GroupCoordinator.all_gather by @daixinning in #2367
[Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions by @natureofnature in #2398
[Feat] Support step-boundary abort in diffusion by @asukaqaq-s in #1769
[BugFix]: Fix bagel single-stage img2img fallback to text2img bug by @princepride in #2397
[Feat] Add MUSA platform support for Moore Threads GPUs by @yeahdongcn in #2337
Add new committers to governance page by @ywang96 in #2419
[CI] Tune GPU resources for test by @tjtanaa in #2401
[Feat] support HSDP for Qwen-image series, Z-Image, GLM-Image by @RuixiangMa in #2029
[Bugfix] Fix delayed decoding bug for Bagel AR/DIT workflow (L3 test_bagel_img2img error) by @natureofnature in #2422
[skip ci][Doc] Update RFC template doc by @yuanheng-zhao in #2141
[Test] Add voice or language test case for Qwen3-omni and Qwen-tts by @yenuo26 in #1844
[skip ci][Doc] Small fix of Doc by @wtomin in #2400
[Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model by @JasonJ2021 in #2411
[CI] [skip ci] Rename & reset timout mins for nightly L4 tests. by @congw729 in #2251
[AutoRound] Add offline quantized W4A16 model support by @yiliu30 in #1777
[Perf] Optimize Wan2.2 rotary embedding by @gcanlin in #2393
Add VACE support for WAN 2.1 conditional video generation by @tangbinh in #1885
[skip ci][Bugfix] clean useless log by @R2-Y in #2450
[Test] Skip tests/e2e/online_serving/test_zimage_expansion.py due to issue #2435 by @zhumingjue138 in #2454
[Feature] add session based audio streaming input by @Shirley125 in #2208
Update MRoPE config fallback logic by @vraiti in #2278
[Docs] Update docs to use vllm-ascend v0.18.0rc1 by @gcanlin in #2453
[BAGEL] [Feature]: Add thinking mode in Bagel multi-stage serving by @princepride in #2447
[BugFix][FishSpeech] Fix structured voice clone prefill conditioning by @Sy0307 in #2446
Refactor StageDiffusionClient and StageEngineCoreClient by @chickeyton in #2006
[Perf] Skip Wan2.2 cross attn Ulysses SP by @gcanlin in #2459
[Model] Add two stages inference for model LTX-2 distilled. by @Songrui625 in #2260
[Cleanup] Replace bare print() with logger and use specific exception types by @Lidang-Jiang in #2228
[Bugfix] Fix Flux2 Dev Guidance by @alex-jw-brooks in #2433
[OmniVoice] Add two-stage TTS serving support by @linyueqian in #2463
[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager by @JuanPZuluaga in #2108
[CosyVoice3] Add online serving support, fix stage config, and add CI tests by @linyueqian in #2431
[Rebase] Rebase to vllm v0.19.0 by @tzhouam in #2475
Voxtral TTS: drop hardcoded CUDA in audio tokenizer; add XPU stage config by @Joshna-Medisetty in #2428

New Contributors

@chickeyton made their first contribution in #2265
@TKONIY made their first contribution in #2063
@OrangePure made their first contribution in #2353
@noobHappylife made their first contribution in #2372
@Fattysand made their first contribution in #2378
@daixinning made their first contribution in #2367
@yeahdongcn made their first contribution in #2337
@JasonJ2021 made their first contribution in #2411
@yiliu30 made their first contribution in #1777
@tangbinh made their first contribution in #1885
@vraiti made their first contribution in #2278
@Songrui625 made their first contribution in #2260

Full Changelog: v0.18.0...v0.19.0rc1