github vllm-project/vllm-omni v0.22.0

3 hours ago

Highlights

This release features 339 commits from 124 contributors, including 52 new contributors.

vLLM-Omni v0.22.0 is a major world-model, diffusion, and omni-serving release aligned with the vLLM 0.22 release line. It adds Cosmos 3 and DreamZero world-model support, broadens speech and multimodal model coverage, and improves production serving across multistage runtime, OpenPI robot serving, diffusion acceleration, quantization, and hardware backends.

Key Improvements

  • World model support, with Cosmos3 model support, sound generation, action modality, and DreamZero integration with CFG parallel plus OpenPI serving. (#3454, #4073, #4102, #2162, #3673)
  • Aligned with the vLLM 0.22 release line, including the vLLM 0.21 and 0.22 rebases, dependency compatibility updates, release image builds, and PyPI upload support. (#3530, #3891, #4022, #3428, #3667)
  • Expanded speech and omni model coverage, adding MiniCPM-o 4.5, Lance, MOSS-TTS, GLM-TTS, Higgs Audio v2, Covo-Audio-Chat, and new recipes/deploy profiles for several speech and omni models. (#3642, #4067, #3710, #3420, #3141, #3762, #2293)
  • Improved diffusion acceleration and parallel execution, including Wan2.2 pipeline parallelism, HunyuanImage3 VAE parallelism, HunyuanVideo 1.5 USP/VAE patch parallel, LTX-2.3 CFG parallel, step-wise LoRA, MagCache, CacheDiT coverage, and prompt-embedding caching. (#2322, #3091, #3979, #3905, #3639, #1287, #3265, #3906, #2962)
  • Made audio and TTS serving more production-ready, with Qwen3-TTS, Qwen3-Omni, VoxCPM2, Fish Speech S2 Pro, OmniVoice, async audio input, custom voices, ref-context cache, and high-concurrency improvements. (#3662, #3492, #3322, #3592, #4054, #3882, #3773, #3336, #3614)
  • Expanded quantization and hardware coverage, including Blackwell diffusion attention backends, W4A16, FP8/INT8, MXFP4, MXFP8, ModelOpt mixed FP8/NVFP4, batched ModelOpt FP8, ROCm AITER, Intel XPU, and Ascend NPU updates. (#3353, #3059, #3700, #3902, #3578, #3570, #3782, #3943, #4155, #3079, #3015, #3419, #3511, #2325)

Core Architecture & Runtime

  • Integrated OmniCoordinator into the stage engine pipeline and continued the communication-layer refactor across non-async omni paths, improving multistage orchestration, request routing, and model-runner reuse. (#3569, #2677, #3719, #3476)
  • Hardened stage and diffusion lifecycle behavior with worker dead detection, cleanup fixes, safer subprocess shutdown, SIGINT cleanup for NCCL/ZMQ resources, master-port selection fixes, and diffusion prefetch protection for newer transformers shard-resolution behavior. (#3214, #3494, #3751, #3872, #3803, #4076)
  • Improved request and scheduler correctness through unified diffusion request identity, prefix-cache and token-history fixes, streaming finish reasons, Qwen3-Omni sampling alignment with transformers, and deterministic media-path handling in mixed-modality examples. (#3744, #3665, #3681, #3374, #4137, #3355)
  • Added TrackingArgumentParser and refreshed configuration behavior around recursive engine-arg merging, deploy-config field allowlisting, concrete entrypoint typing, and single-stage/multistage test coverage. (#3369, #3009, #3483, #3139)

Model Support

  • Added Cosmos3 support across model execution, recipes, tests, and accuracy coverage, including base model support, sound generation, and action modality support. (#3454, #4073, #4102)
  • Added DreamZero world-model integration with CFG parallel, OpenPI serving, deployment configs, online examples, OpenPI client helpers, and source-parity tests. (#2162, #3673)
  • Added or expanded omni and multimodal model support for MiniCPM-o 4.5, Lance, MOSS-TTS, GLM-TTS, Higgs Audio v2, Covo-Audio-Chat, HiDream-I1-Full, Ming-flash-omni-2.0 image generation, SenseNova U1, and Qwen3-Omni Thinker LoRA for RL training. (#3642, #4067, #3710, #3420, #3141, #3762, #2293, #2572, #2875, #3319, #3915)
  • Improved model-family behavior across Qwen-Image, Qwen-Image-Edit, BAGEL, HunyuanImage3, HunyuanVideo 1.5, FLUX.2-dev, LTX-2/LTX-2.3, DreamID-Omni, Helios, Ovis image, MiMo-Audio, and Ming-flash-omni. (#3608, #3219, #3933, #3728, #3857, #3979, #3244, #3621, #3905, #3265, #3470, #3876, #3686, #4080)

Audio, Speech & Omni Production Optimization

  • Optimized Qwen3-TTS for high-concurrency serving with precomputed custom voices, ref-context cache, restored cross-request Code2Wav batching, persistent prompt-embedding helpers, reduced CUDA Graph buckets, and compatibility fixes for newer transformers versions. (#3662, #3492, #3322, #3992, #3932, #3880)
  • Improved Qwen3-Omni performance and correctness with TTFP optimization, sampling alignment with transformers, prefix-cache correctness, long-output correctness tests, torch.compile accuracy fixes, and streaming-input fixes after the v0.22 rebase. (#4054, #4137, #3665, #3539, #3885, #4085)
  • Improved Fish Speech S2 Pro, VoxCPM2, OmniVoice, GLM-TTS, Higgs Audio v2, and MOSS-TTS serving paths through high-concurrency decode work, Triton/CUDA Graph acceleration, voice clone serving, reproducible seeds, nonverbal tags, and broader offline/online examples. (#3773, #3882, #3336, #3668, #3968, #3141, #3762, #3420)
  • Added audio SLO metrics, cross-stage transfer metric families, audio streaming continuity metrics, and per-stage/per-replica metric wrapping for upstream vllm:* metrics. (#3576, #3618)

Diffusion, Image & Video Generation

  • Added and expanded diffusion parallel execution with Wan2.2 pipeline parallelism, HunyuanImage3 VAE parallelism, HunyuanVideo 1.5 USP plus VAE patch parallel, LTX-2.3 CFG parallel, BAGEL VAE parallel, and HunyuanVideo/HunyuanImage3 NPU performance work. (#2322, #3091, #3979, #3905, #3982, #3178)
  • Expanded diffusion acceleration with CacheDiT for Helios, DreamID-Omni, SenseNova U1, and LTX-2; prompt-embedding caching; MagCache; step-wise LoRA; and CacheDiT-related correctness fixes. (#3470, #3265, #3906, #3621, #2962, #1287, #3639, #3219)
  • Improved image and video generation correctness and serving behavior across HunyuanImage3, Qwen-Image, Qwen-Image-Edit, Flux2 Klein, GLM-Image, SD3, SenseNova U1, and /v1/videos, including long-prompt/device fixes and safer bf16 video frame conversion before NumPy output. (#4145, #3933, #4074, #3711, #3717, #3451, #3949, #4114)
  • Improved diffusion serving and benchmark behavior with endpoint routing for image edits, benchmark endpoint naming, output comparison tooling, performance quality gates, and stage-level benchmark statistics. (#3693, #3137, #3175, #3851, #3628)

Quantization & Memory Efficiency

  • Added broader diffusion quantization support, including Wan2.2 W4A16, GLM-Image W4A16, LTX-2 online FP8/INT8, DreamID-Omni online FP8/INT8, NPU MXFP4 online/offline quantization, XPU MXFP8, ModelOpt mixed FP8/NVFP4 and batched ModelOpt FP8 serving support. (#3353, #3059, #3700, #3902, #3578, #3782, #3570, #3943, #4155)
  • Added quantization quality and trajectory comparison tooling for diffusion outputs, improved quantization benchmark handling for omni outputs, and expanded quality-gate coverage for FP8 Z-Image and related diffusion tests. (#3175, #3653, #3929)
  • Improved memory and cache behavior through Qwen-Image text encoder cleanup, prompt-embedding cache support, custom pipeline sleep memory release fixes, global CUDA graph pool reuse, BAGEL per-step sync removal, and AR prefix hidden-state CPU staging deduplication. (#3608, #2962, #3818, #3361, #3987, #3734)

RL, Serving & Integrations

  • Added DreamZero/OpenPI serving and a realtime OpenPI robot serving API, including online DreamZero examples, OpenPI client helpers, connection tests, and serving tests. (#2162, #3673)
  • Added Qwen3-Omni Thinker LoRA support for RL training and improved custom pipeline argument handling, sleep/wakeup memory behavior, and multistage deployment coverage. (#3915, #2973, #3818, #3610)
  • Improved OpenAI-compatible serving behavior for image edits, speech generation, realtime audio, chat/multistage generation, invalid parameter handling, stream finish reasons, and frontend audio engine errors. (#3693, #2849, #3614, #3374, #3652, #3316)
  • Added Yuanrong TransferEngine connector support for NPU and improved connector/runtime infrastructure around chunk transfer, memory pools, local-rank handling, distributed KV flow, and multi-replica GPU device mapping. (#3180, #3569, #3740, #4132)

Platforms, Distributed Execution & Hardware Coverage

  • Expanded Blackwell diffusion support with CUDNN attention, FlashInfer attention auto-routing, and SageAttention3 backend support for GB200/B200/RTX 5090/PRO 6000/DGX Spark class systems. (#3079, #3015)
  • Improved ROCm coverage with AITER GroupNorm, AITER backend support for ring attention, and v0.22-era ROCm CI fixes. (#3419, #3511, #3946)
  • Improved Intel XPU coverage with CosyVoice3 support, MXFP8 support through the vLLM main-repo method, diffusion attention defaults, Docker/CI updates, v0.22 rebase fixes, and Wan2.2 S2V RoPE/cache_dit optimization. (#2325, #3782, #3525, #3675, #4059, #4062)
  • Improved Ascend NPU coverage with Wan2.2 MXFP4 quantization, HunyuanImage3 FA-FP8, GLM-Image stage configs and HCCL runtime environment fixes, Yuanrong connector support, sampler/runtime fixes, and v0.22 ModelRunner updates. (#3578, #3540, #3235, #3180, #3517, #4130)

CI, Benchmarks & Documentation

  • Unified the release pipeline around a NIGHTLY=1 option, added x86_64/aarch64 image builds, enabled twine upload to PyPI, refreshed Docker bases, and updated CUDA/ROCm/XPU installation docs for the current release line. (#3428, #3667, #3859, #4059)
  • Added or improved reliability, invalid-parameter, nightly parity, accuracy, and performance coverage for Cosmos3, DreamZero, HunyuanImage3, HunyuanVideo 1.5, GLM-Image, BAGEL, VoxCPM2, Qwen3-Omni, Wan2.2, MOSS-TTS, and multistage deployment. (#3454, #2162, #3790, #3852, #3451, #2175, #4055, #3729, #4097, #3610)
  • Improved benchmarking and observability infrastructure with audio SLOs, cross-stage transfer metrics, modality metrics, Prometheus/stat-logger tests, audio-streaming continuity metrics, diffusion benchmark endpoint routing, optional baseline assertions, and repo-wide benchmark documentation. (#3576, #3618, #3693, #3695, #1939)
  • Refreshed docs and recipes for quantization, diffusion performance, CosyVoice3 online serving, GLM-Image, Helios, Qwen Image Edit, VACE, MiniCPM-o 4.5, Lance, MOSS-TTS, VoxCPM2, Cosmos3, and CUDA image commands. (#3764, #3851, #3748, #2950, #3114, #3684, #3584, #4067, #3710, #3420, #3850, #3454, #3836)

What's Changed

New Contributors

Full Changelog: v0.20.0...v0.22.0

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.