github vllm-project/vllm-omni v0.17.0rc1

latest release: v0.18.0rc1
pre-release14 days ago

Highlights

This release features approximately 70 commits across 72 pull requests from 30+ contributors, including 12 new contributors.

Expanded Model Support

This release significantly expands the supported multimodal model ecosystem:

  • Added support for Helios models and Helios-Mid / Distilled variants (#1604, #1648).
  • Added Hunyuan Image3 AR generation support (#759).
  • Added LTX-2 text-to-video and image-to-video support (#841).
  • Added support for MammothModa2 (#336) and CosyVoice3-0.5B (#498).
  • Improved compatibility and fixes for Qwen3-Omni and LongCat models (#1602, #1485, #1631).

Performance Improvements

Multiple optimizations improve startup time, streaming latency, and runtime efficiency:

  • Accelerated diffusion model startup with multi-threaded weight loading (#1504).
  • Reduced inter-packet latency in async chunking for Qwen3-Omni streaming (#1656).
  • Reduced TTFA (time-to-first-audio) for Qwen3-TTS via flexible initial phases (#1583).
  • Optimized TTS code predictor execution by removing GPU synchronization bottlenecks (#1614).
  • Enabled torch.compile + CUDA Graph for TTS pipelines (#1617).
  • Reduced IPC overhead in single-stage diffusion serving for Wan2.2 (#1715).

Inference Infrastructure & Parallelism

New infrastructure improvements improve scalability and flexibility for multimodal serving:

  • Added CFG KV-cache transfer support for multi-stage pipelines (#1422).
  • Added CFG parallel mode for Bagel diffusion models (#1578, #1695).
  • Refactored tile/patch parallelism to simplify support for additional models (#1366).
  • Added VAE patch parallel CLI option for online diffusion serving (#1716).
  • Enabled async chunking for offline inference and configurable chunk parameters (#1415, #1423).
  • Added collective RPC API entrypoint and custom I/O support for RL workloads (#1646).

Text-to-Speech Improvements

Major improvements to the stability and flexibility of the TTS pipeline:

  • Added voice upload API for Qwen3-TTS (#1201).
  • Added flexible task_type configuration for Qwen3-TTS models (#1197).
  • Added non-async chunk mode and improved offline batching support (#1678, #1417).
  • Fixed several stability issues including predictor crashes, all-silence output, and Transformers 5.x compatibility (#1619, #1664, #1536).

Quantization & Hardware Support

  • Added FP8 quantization support for Flux transformers (#1640).
  • Improved NPU support, including MindIE-SD AdaLN compatibility (#1537).
  • Improved device abstraction by replacing hard-coded CUDA generators with platform-aware detection (#1677).
  • Updated XPU container configuration (#1545).

Reliability, Tooling & Developer Experience

  • Added progress bar support for diffusion models (#1652).
  • Introduced benchmark collection and reporting scripts in CI (#1307).
  • Added TTS developer guide and testing documentation (#1693, #1376).
  • Improved API robustness with better error handling and request validation (#1641, #1687).
  • Numerous bug fixes across models, kernels, and configuration handling (#1391, #1566, #1609, #1661).

What's Changed

New Contributors

Full Changelog: v0.16.0...v0.17.0rc1

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.