github vllm-project/vllm-omni v0.18.0

9 hours ago

Highlights

This release features 324 commits from 83 contributors, including 38 new contributors.

vLLM-Omni v0.18.0 is a major rebase and systems release that aligns the project with upstream vLLM v0.18.0, strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.

Key Improvements

  • Rebased to upstream vLLM v0.18.0, with follow-up updates to docs and dockerfiles, plus cleanup of patches that were no longer needed after the rebase. (#2037, #2038, #2062, #2271)
  • Refactored the serving entrypoint architecture, making the stack cleaner and easier to extend, while also laying groundwork for PD disaggregation, multimodal output decoupling, coordinator-based orchestration, and pipeline config cleanup. (#1908, #1863, #1816, #1465, #1115)
  • Strengthened audio, speech, and omni production serving, especially for Qwen3-TTS, Qwen3-Omni, MiMo-Audio, Fish Speech S2 Pro, and Voxtral TTS, with lower latency, better concurrency, more robust streaming, and improved online serving stability. (#1583, #1617, #1797, #1913, #1985, #1852, #1656, #1963, #2009, #2019, #2239, #1688, #1752, #1964, #2225, #1859, #2145, #2151, #2156, #2158)
  • Delivered substantial diffusion optimization, with scheduler/executor refactoring, faster startup, better cache-dit / TeaCache integration, broader TP/SP/HSDP support, and multiple correctness fixes for online and offline serving. (#1625, #1504, #1715, #1834, #1848, #1234, #2163, #1979, #2101, #2176)
  • Expanded model support across omni, speech, image, and video, including Helios, Helios-Mid / Distilled, MammothModa2, Fun CosyVoice3-0.5B-2512, FLUX.2-dev, FLUX.1-Kontext-dev, Hunyuan Image3 AR, Fish Speech S2 Pro, Voxtral TTS, DreamID-Omni, LTX-2, and HunyuanVideo-1.5. (#1604, #1648, #336, #498, #1629, #561, #759, #1798, #1803, #1855, #841, #1516)
  • Introduced a unified quantization framework and expanded quantization support across diffusion and image workloads, including INT8, FP8, and GGUF-related enablement. (#1764, #1470, #1640, #1755, #1473, #2180)
  • Improved RL and custom pipeline readiness, verl collaboration & Qwen-Image E2E RL, Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support. Including collective RPC support at the entrypoint, custom input/output support, async batching for Qwen-Image, and dedicated E2E coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)

Core Architecture & Runtime

  • Reworked the core serving architecture through the vLLM-Omni Entrypoint Refactoring, while also adding PD disaggregation scaffolding, coordinator support, multimodal output decoupling foundations, and cleaner model/pipeline configuration handling. (#1908, #1863, #1465, #1816, #1115, #1958, #2105)
  • Continued cleanup of runtime internals with stage/step pipeline refactors, dead-code cleanup, and improvements to async engine robustness and scheduler state handling. (#1368, #1579, #2153, #2028, #1893)

Model Support

  • Omni / speech / audio models: added or expanded support for MammothModa2, Fun CosyVoice3-0.5B-2512, Fish Speech S2 Pro, and Voxtral TTS. (#336, #498, #1798, #1803)
  • Image / diffusion models: added or expanded support for Hunyuan Image-3.0, FLUX.2-dev, FLUX.1-Kontext-dev, and continued improvements for Qwen-Image, Qwen-Image-Edit, Qwen-Image-Layered, LongCat-Image, GLM-Image, Bagel, and OmniGen2. (#759, #1629, #561, #1682, #2085, #1970, #2035, #1918, #1578, #1669, #1903, #1711, #1934)
  • Video models: added or expanded support for Helios, Helios-Mid / Distilled, DreamID-Omni, LTX-2, HunyuanVideo-1.5, and updated supported video-generation coverage for Wan2.1-T2V. (#1604, #1648, #1855, #841, #1516, #1920)

Audio, Speech & Omni Production Optimization

  • Qwen3-TTS received major optimization work, including lower TTFA, better high-concurrency throughput, improved Code Predictor / Code2Wav execution, websocket streaming audio output, async scheduling by default, voice upload support, optional ref_text, and long ref_audio handling fixes. (#1583, #1617, #1797, #1913, #1985, #1852, #1719, #1853, #1201, #1879, #2046, #2104)
  • Qwen3-Omni gained lower inter-packet latency, speaker-switching support, decode-alignment fixes, and multiple correctness fixes for answer quality and online serving stability. (#1656, #1963, #2009, #2019, #2239)
  • MiMo-Audio improved compatibility and production robustness with TP fixes, broader attention backend support, configurable chunk sizing, and documentation to prevent noise-only outputs under unsupported attention setups. (#1688, #1752, #1964, #2225, #2205)
  • Fish Speech S2 Pro and Voxtral TTS were productionized further with online serving, voice cloning, better TTFP / inference performance, multilingual demo support, lighter flow matching, and voice-embedding fixes. (#1798, #1859, #2145, #1803, #2045, #2056, #2067, #2151, #2156, #2158, #2023)
  • Added or improved speech-serving interfaces, including speech batch entrypoint, speaker embedding support for speech and voices APIs, proper HTTP status handling, and streaming wav response support. (#1701, #1227, #1687, #1819)

Diffusion, Image & Video Generation

  • Runtime refactor & benchmarking: Refactored the diffusion runtime with cleaner scheduler/executor boundaries, better request-state flow, unified profiling, and stronger benchmarking infrastructure. (#1625, #2099, #1757, #1917, #1995)
  • Performance & startup gains: Improved diffusion performance through multi-threaded weight loading for Wan2.2, reduced IPC overhead for single-stage serving, cache-dit upgrades, TeaCache support, and nightly performance improvements for Qwen-Image. (#1504, #1715, #1834, #1234, #1314, #1805, #2111)
  • Distributed scaling: Expanded distributed diffusion execution with broader TP/SP/HSDP support across Flux, GLM-Image, Hunyuan, and Bagel. (#1250, #1900, #1918, #2163, #1903)
  • Serving UX & API ergonomics: Improved serving usability with a progress bar for diffusion models, richer image-edit parameters such as layers and resolution, and extra request-body support for video APIs. (#1652, #2053, #1955)
  • Correctness & stability fixes: Fixed a wide range of diffusion correctness issues, including config misalignment between offline and online inference, TP/no-seed broken-image issues, GLM-Image stage/device bugs, and TeaCache incompatibilities. (#1979, #2176, #2137, #2101, #1894, #2025)

Quantization & Memory Efficiency

  • Added the Unified Quantization Framework as a core infrastructure upgrade for more consistent quantized execution across model families. (#1764)
  • Expanded quantization support for diffusion/image workloads, including INT8 for DiT (Z-Image and Qwen-Image), FP8 for Flux transformers, and GGUF adapter support for Qwen-Image. (#1470, #1640, #1755)
  • Improved compatibility between quantization and runtime features such as CPU offload, tensor parallelism, and Flux-family execution. (#1473, #1723, #1978, #2180)

RL, Serving & Integrations

  • verl collaboration & Qwen-Image E2E RL: Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support, custom input/output, async batching for Qwen-Image, and dedicated E2E CI coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
  • Rollout scaling for visual RL: Added rollout building blocks referenced by verl’s Qwen-Image integration plan, including async batching for Qwen-Image plus tensor-parallel and data-parallel support for diffusion serving. (#1593, #1713, #1706)
  • Deployment & ecosystem integrations: Improved deployment and ecosystem integration with a Helm chart for Kubernetes, ComfyUI video & LoRA support, and a rewritten async video API lifecycle. (#1337, #1596, #1665)

Platforms, Distributed Execution & Hardware Coverage

  • Continued improving portability across CUDA, ROCm, NPU, and XPU/Intel GPU environments, including rebase follow-ups, ROCm CI setup, Intel CI dispatch, Intel GPU docs, and NPU docker/docs refreshes. (#2017, #1984, #1721, #2154, #2271, #2091)
  • Expanded distributed execution coverage with T5 tensor parallelism, more model-level TP/SP/HSDP support, and better handling of visible GPUs and stage-device initialization. (#1881, #1250, #1900, #1918, #2163, #2025)

CI, Benchmarks & Documentation

  • Strengthened release engineering and CI with a release pipeline, richer nightly benchmark/report generation, L3/L4/L5 test layering, expanded model E2E coverage, and stronger diffusion test coverage. (#1726, #1831, #1995, #1514, #1799, #2086, #1869, #2085, #2087, #2132, #2129, #2023)
  • Improved benchmarking with Qwen3-TTS benchmark scripts, nightly Qwen3-TTS and Qwen-Image performance tracking, diffusion timing, random benchmark datasets, and T2I/I2I accuracy benchmark integration. (#1573, #1700, #1805, #2111, #1757, #1657, #1917)
  • Refreshed project docs across installation, omni/TTS docs, diffusion serving parameters, UAA documentation, developer guides, and governance. (#1762, #1693, #2051, #2130, #2148, #1889)

Note

  • GLM-Image requires manually upgrading the transformers version to >= 5.0.

What's Changed

New Contributors

Full Changelog: v0.16.0...v0.18.0

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.