Highlights
This release features approximately 70 commits across 72 pull requests from 30+ contributors, including 12 new contributors.
Expanded Model Support
This release significantly expands the supported multimodal model ecosystem:
- Added support for Helios models and Helios-Mid / Distilled variants (#1604, #1648).
- Added Hunyuan Image3 AR generation support (#759).
- Added LTX-2 text-to-video and image-to-video support (#841).
- Added support for MammothModa2 (#336) and CosyVoice3-0.5B (#498).
- Improved compatibility and fixes for Qwen3-Omni and LongCat models (#1602, #1485, #1631).
Performance Improvements
Multiple optimizations improve startup time, streaming latency, and runtime efficiency:
- Accelerated diffusion model startup with multi-threaded weight loading (#1504).
- Reduced inter-packet latency in async chunking for Qwen3-Omni streaming (#1656).
- Reduced TTFA (time-to-first-audio) for Qwen3-TTS via flexible initial phases (#1583).
- Optimized TTS code predictor execution by removing GPU synchronization bottlenecks (#1614).
- Enabled torch.compile + CUDA Graph for TTS pipelines (#1617).
- Reduced IPC overhead in single-stage diffusion serving for Wan2.2 (#1715).
Inference Infrastructure & Parallelism
New infrastructure improvements improve scalability and flexibility for multimodal serving:
- Added CFG KV-cache transfer support for multi-stage pipelines (#1422).
- Added CFG parallel mode for Bagel diffusion models (#1578, #1695).
- Refactored tile/patch parallelism to simplify support for additional models (#1366).
- Added VAE patch parallel CLI option for online diffusion serving (#1716).
- Enabled async chunking for offline inference and configurable chunk parameters (#1415, #1423).
- Added collective RPC API entrypoint and custom I/O support for RL workloads (#1646).
Text-to-Speech Improvements
Major improvements to the stability and flexibility of the TTS pipeline:
- Added voice upload API for Qwen3-TTS (#1201).
- Added flexible
task_typeconfiguration for Qwen3-TTS models (#1197). - Added non-async chunk mode and improved offline batching support (#1678, #1417).
- Fixed several stability issues including predictor crashes, all-silence output, and Transformers 5.x compatibility (#1619, #1664, #1536).
Quantization & Hardware Support
- Added FP8 quantization support for Flux transformers (#1640).
- Improved NPU support, including MindIE-SD AdaLN compatibility (#1537).
- Improved device abstraction by replacing hard-coded CUDA generators with platform-aware detection (#1677).
- Updated XPU container configuration (#1545).
Reliability, Tooling & Developer Experience
- Added progress bar support for diffusion models (#1652).
- Introduced benchmark collection and reporting scripts in CI (#1307).
- Added TTS developer guide and testing documentation (#1693, #1376).
- Improved API robustness with better error handling and request validation (#1641, #1687).
- Numerous bug fixes across models, kernels, and configuration handling (#1391, #1566, #1609, #1661).
What's Changed
- 0.16.0 release by @ywang96 in #1576
- [Refactor]: Phase1 for rebasing_additional_info by @divyanshsinghvi in #1394
- [Feature]: Support cfg kv-cache transfer in multi-stage by @princepride in #1422
- [BugFix] Fix load_weights error when loading HunyuanImage3.0 by @Semmer2 in #1598
- [Bugfix] fix kernel error for qwen3-omni by @R2-Y in #1602
- [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio by @qibaoyuan in #1570
- [Bugfix] Import InputPreprocessor into Renderer by @lengrongfu in #1566
- [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading by @SamitHuang in #1504
- [Bugfix][Model] Fix LongCat Image Config Handling / Layer Creation by @alex-jw-brooks in #1485
- [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context by @ZhanqiuHu in #1619
- [Debug] Enable curl retry aligned with openai by @tzhouam in #1539
- [Doc] Fix links in the configuration doc by @yuanheng-zhao in #1615
- [CI] Add scripts for bechmark collection and email distribution. by @congw729 in #1307
- [FEATURE] Tile/Patch parallelism refactor for easily support other models by @Bounty-hunter in #1366
- [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation by @yuanheng-zhao in #1609
- Make chunk_size and left_context_size configurable via YAML for async chunking by @LJH-LBJ in #1423
- [Bugfix] Fix transformers 5.x compat issues in online TTS serving by @linyueqian in #1536
- [Refactor] lora: reuse load_weights packed mapping by @dongbo910220 in #991
- [Model]: support Helios from ByteDance by @princepride in #1604
- [chore] add _repeated_blocks for regional compilation support by @RuixiangMa in #1642
- [Bugfix] Add TTS request validation to prevent engine crashes by @linyueqian in #1641
- [CI] Fix ASCII codes. by @congw729 in #1647
- [Misc] update wechat by @david6666666 in #1649
- docs: Announce vllm-omni-skills community project by @hsliuustc0106 in #1651
- [Model] Add Hunyuan Image3 AR Support by @usberkeley in #759
- [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases by @amy-why-3459 in #1628
- [Bugfix] Fix Dtype Parsing by @alex-jw-brooks in #1391
- [XPU] fix UMD version in docker file by @yma11 in #1545
- add support for MammothModa2 model by @HonestDeng in #336
- [Model] Fun cosy voice3-0.5-b-2512 by @divyanshsinghvi in #498
- [Bugfix] Enable torch.compile for low noise model (transformer_2) by @lishunyang12 in #1541
- [NPU] [Features] [Bugfix] Support mindiesd adaln by @jiangmengyu18 in #1537
- [FP8 Quantization] Add FP8 quantization support for Flux transformer by @zzhuoxin1508 in #1640
- Replace hard-coded cuda generator with current_omni_platform.device_type by @pi314ever in #1677
- [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup by @alex-jw-brooks in #1631
- [Misc] remove logits_processor_pattern this field, because vllm have … by @lengrongfu in #1675
- [CI] Remove high concurrency tests before issue #1374 fixed. by @congw729 in #1683
- [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk by @ZeldaHuang in #1656
- [Feat][Qwen3TTS] reduce TTFA with flexible initial phase by @JuanPZuluaga in #1583
- [Model] support LTX-2 text-to-video image-to-video by @david6666666 in #841
- [BugFix] Return proper HTTP status for ErrorResponse in create_speech by @Lidang-Jiang in #1687
- [Doc] Add the test guide document. [skip ci] by @yenuo26 in #1376
- [UX] Add progress bar for diffusion models by @gcanlin in #1652
- [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder by @ZhanqiuHu in #1664
- [Feature] Support flexible task_type configuration for Qwen3-TTS models by @JackLeeHal in #1197
- [Cleanup] Move cosyvoice3 tests to model subdirectory by @linyueqian in #1666
- [Feature][Bagel] Add CFG parallel mode by @nussejzz in #1578
- perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor by @dubin555 in #1614
- [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph by @Sy0307 in #1617
- [MiMo-Audio] Bugfix tp lg than 1 by @qibaoyuan in #1688
- Add non-async chunk support for Qwen3-TTS by @linyueqian in #1678
- [1/N][Refactor] Clean up dead code in output processor by @gcanlin in #1579
- [feature]: support flux2.klein cache_dit by @nuclearwu in #1209
- [skip CI][Docs] Add TTS model developer guide by @linyueqian in #1693
- [Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline by @erfgss in #668
- [Feature]: Add vae-patch-parallel CLI argument in online serving by @wtomin in #1716
- Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" by @gcanlin in #1724
- [CI] Add release-pipeline.yaml. by @congw729 in #1726
- [NPU] Support Helios-Mid / Distilled by @gcanlin in #1648
- [skip ci] Update slides link by @hsliuustc0106 in #1730
- [Bugfix] (qwen3_tts): enable batched offline inference by fixing tens… by @RomanKoshkin in #1417
- [Bugfix] Use upstream MediaConnector for ref_audio resolution by @linyueqian in #1661
- [RL] Support collective rpc api to entrypoint && Support custom input output by @knlnguyen1802 in #1646
- Pre-download Qwen3-TTS model in CI to avoid intermittent download timeouts by @linyueqian in #1727
- [1/N] fix CP for Helios by @SHYuanBest in #1729
- feat(tts): add voice upload API for Qwen3-TTS by @zhaotyer in #1201
- [Bagel] Eliminate broadcast in CFG parallel denoising loop by @nussejzz in #1695
- [Feat]: Offline inference supports async_chunk by @Sy0307 in #1415
- [Bugfix] Allow to enable HSDP alone by @gcanlin in #1567
- Disable mm processor cache in CI stage configs by @linyueqian in #1739
- Dev/rebase v0170 by @tzhouam in #1639
- [Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2 by @SamitHuang in #1715
New Contributors
- @lengrongfu made their first contribution in #1566
- @ZhanqiuHu made their first contribution in #1619
- @usberkeley made their first contribution in #759
- @HonestDeng made their first contribution in #336
- @jiangmengyu18 made their first contribution in #1537
- @pi314ever made their first contribution in #1677
- @Lidang-Jiang made their first contribution in #1687
- @JackLeeHal made their first contribution in #1197
- @dubin555 made their first contribution in #1614
- @RomanKoshkin made their first contribution in #1417
- @SHYuanBest made their first contribution in #1729
- @zhaotyer made their first contribution in #1201
Full Changelog: v0.16.0...v0.17.0rc1