vllm-project/vllm-omni v0.17.0rc1 on GitHub

Highlights

This release features approximately 70 commits across 72 pull requests from 30+ contributors, including 12 new contributors.

Expanded Model Support

This release significantly expands the supported multimodal model ecosystem:

Added support for Helios models and Helios-Mid / Distilled variants (#1604, #1648).
Added Hunyuan Image3 AR generation support (#759).
Added LTX-2 text-to-video and image-to-video support (#841).
Added support for MammothModa2 (#336) and CosyVoice3-0.5B (#498).
Improved compatibility and fixes for Qwen3-Omni and LongCat models (#1602, #1485, #1631).

Performance Improvements

Multiple optimizations improve startup time, streaming latency, and runtime efficiency:

Accelerated diffusion model startup with multi-threaded weight loading (#1504).
Reduced inter-packet latency in async chunking for Qwen3-Omni streaming (#1656).
Reduced TTFA (time-to-first-audio) for Qwen3-TTS via flexible initial phases (#1583).
Optimized TTS code predictor execution by removing GPU synchronization bottlenecks (#1614).
Enabled torch.compile + CUDA Graph for TTS pipelines (#1617).
Reduced IPC overhead in single-stage diffusion serving for Wan2.2 (#1715).

Inference Infrastructure & Parallelism

New infrastructure improvements improve scalability and flexibility for multimodal serving:

Added CFG KV-cache transfer support for multi-stage pipelines (#1422).
Added CFG parallel mode for Bagel diffusion models (#1578, #1695).
Refactored tile/patch parallelism to simplify support for additional models (#1366).
Added VAE patch parallel CLI option for online diffusion serving (#1716).
Enabled async chunking for offline inference and configurable chunk parameters (#1415, #1423).
Added collective RPC API entrypoint and custom I/O support for RL workloads (#1646).

Text-to-Speech Improvements

Major improvements to the stability and flexibility of the TTS pipeline:

Added voice upload API for Qwen3-TTS (#1201).
Added flexible task_type configuration for Qwen3-TTS models (#1197).
Added non-async chunk mode and improved offline batching support (#1678, #1417).
Fixed several stability issues including predictor crashes, all-silence output, and Transformers 5.x compatibility (#1619, #1664, #1536).

Quantization & Hardware Support

Added FP8 quantization support for Flux transformers (#1640).
Improved NPU support, including MindIE-SD AdaLN compatibility (#1537).
Improved device abstraction by replacing hard-coded CUDA generators with platform-aware detection (#1677).
Updated XPU container configuration (#1545).

Reliability, Tooling & Developer Experience

Added progress bar support for diffusion models (#1652).
Introduced benchmark collection and reporting scripts in CI (#1307).
Added TTS developer guide and testing documentation (#1693, #1376).
Improved API robustness with better error handling and request validation (#1641, #1687).
Numerous bug fixes across models, kernels, and configuration handling (#1391, #1566, #1609, #1661).

What's Changed

0.16.0 release by @ywang96 in #1576
[Refactor]: Phase1 for rebasing_additional_info by @divyanshsinghvi in #1394
[Feature]: Support cfg kv-cache transfer in multi-stage by @princepride in #1422
[BugFix] Fix load_weights error when loading HunyuanImage3.0 by @Semmer2 in #1598
[Bugfix] fix kernel error for qwen3-omni by @R2-Y in #1602
[bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio by @qibaoyuan in #1570
[Bugfix] Import InputPreprocessor into Renderer by @lengrongfu in #1566
[Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading by @SamitHuang in #1504
[Bugfix][Model] Fix LongCat Image Config Handling / Layer Creation by @alex-jw-brooks in #1485
[Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context by @ZhanqiuHu in #1619
[Debug] Enable curl retry aligned with openai by @tzhouam in #1539
[Doc] Fix links in the configuration doc by @yuanheng-zhao in #1615
[CI] Add scripts for bechmark collection and email distribution. by @congw729 in #1307
[FEATURE] Tile/Patch parallelism refactor for easily support other models by @Bounty-hunter in #1366
[Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation by @yuanheng-zhao in #1609
Make chunk_size and left_context_size configurable via YAML for async chunking by @LJH-LBJ in #1423
[Bugfix] Fix transformers 5.x compat issues in online TTS serving by @linyueqian in #1536
[Refactor] lora: reuse load_weights packed mapping by @dongbo910220 in #991
[Model]: support Helios from ByteDance by @princepride in #1604
[chore] add _repeated_blocks for regional compilation support by @RuixiangMa in #1642
[Bugfix] Add TTS request validation to prevent engine crashes by @linyueqian in #1641
[CI] Fix ASCII codes. by @congw729 in #1647
[Misc] update wechat by @david6666666 in #1649
docs: Announce vllm-omni-skills community project by @hsliuustc0106 in #1651
[Model] Add Hunyuan Image3 AR Support by @usberkeley in #759
[Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases by @amy-why-3459 in #1628
[Bugfix] Fix Dtype Parsing by @alex-jw-brooks in #1391
[XPU] fix UMD version in docker file by @yma11 in #1545
add support for MammothModa2 model by @HonestDeng in #336
[Model] Fun cosy voice3-0.5-b-2512 by @divyanshsinghvi in #498
[Bugfix] Enable torch.compile for low noise model (transformer_2) by @lishunyang12 in #1541
[NPU] [Features] [Bugfix] Support mindiesd adaln by @jiangmengyu18 in #1537
[FP8 Quantization] Add FP8 quantization support for Flux transformer by @zzhuoxin1508 in #1640
Replace hard-coded cuda generator with current_omni_platform.device_type by @pi314ever in #1677
[BugFix] Fix LongCat Sequence Parallelism / Small Cleanup by @alex-jw-brooks in #1631
[Misc] remove logits_processor_pattern this field, because vllm have … by @lengrongfu in #1675
[CI] Remove high concurrency tests before issue #1374 fixed. by @congw729 in #1683
[Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk by @ZeldaHuang in #1656
[Feat][Qwen3TTS] reduce TTFA with flexible initial phase by @JuanPZuluaga in #1583
[Model] support LTX-2 text-to-video image-to-video by @david6666666 in #841
[BugFix] Return proper HTTP status for ErrorResponse in create_speech by @Lidang-Jiang in #1687
[Doc] Add the test guide document. [skip ci] by @yenuo26 in #1376
[UX] Add progress bar for diffusion models by @gcanlin in #1652
[Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder by @ZhanqiuHu in #1664
[Feature] Support flexible task_type configuration for Qwen3-TTS models by @JackLeeHal in #1197
[Cleanup] Move cosyvoice3 tests to model subdirectory by @linyueqian in #1666
[Feature][Bagel] Add CFG parallel mode by @nussejzz in #1578
perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor by @dubin555 in #1614
[Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph by @Sy0307 in #1617
[MiMo-Audio] Bugfix tp lg than 1 by @qibaoyuan in #1688
Add non-async chunk support for Qwen3-TTS by @linyueqian in #1678
[1/N][Refactor] Clean up dead code in output processor by @gcanlin in #1579
[feature]: support flux2.klein cache_dit by @nuclearwu in #1209
[skip CI][Docs] Add TTS model developer guide by @linyueqian in #1693
[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline by @erfgss in #668
[Feature]: Add vae-patch-parallel CLI argument in online serving by @wtomin in #1716
Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" by @gcanlin in #1724
[CI] Add release-pipeline.yaml. by @congw729 in #1726
[NPU] Support Helios-Mid / Distilled by @gcanlin in #1648
[skip ci] Update slides link by @hsliuustc0106 in #1730
[Bugfix] (qwen3_tts): enable batched offline inference by fixing tens… by @RomanKoshkin in #1417
[Bugfix] Use upstream MediaConnector for ref_audio resolution by @linyueqian in #1661
[RL] Support collective rpc api to entrypoint && Support custom input output by @knlnguyen1802 in #1646
Pre-download Qwen3-TTS model in CI to avoid intermittent download timeouts by @linyueqian in #1727
[1/N] fix CP for Helios by @SHYuanBest in #1729
feat(tts): add voice upload API for Qwen3-TTS by @zhaotyer in #1201
[Bagel] Eliminate broadcast in CFG parallel denoising loop by @nussejzz in #1695
[Feat]: Offline inference supports async_chunk by @Sy0307 in #1415
[Bugfix] Allow to enable HSDP alone by @gcanlin in #1567
Disable mm processor cache in CI stage configs by @linyueqian in #1739
Dev/rebase v0170 by @tzhouam in #1639
[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2 by @SamitHuang in #1715

New Contributors

@lengrongfu made their first contribution in #1566
@ZhanqiuHu made their first contribution in #1619
@usberkeley made their first contribution in #759
@HonestDeng made their first contribution in #336
@jiangmengyu18 made their first contribution in #1537
@pi314ever made their first contribution in #1677
@Lidang-Jiang made their first contribution in #1687
@JackLeeHal made their first contribution in #1197
@dubin555 made their first contribution in #1614
@RomanKoshkin made their first contribution in #1417
@SHYuanBest made their first contribution in #1729
@zhaotyer made their first contribution in #1201

Full Changelog: v0.16.0...v0.17.0rc1