github vllm-project/vllm-omni v0.20.0

4 hours ago

Highlights

This release features 466 commits from 125 contributors since v0.18.0.

vLLM-Omni v0.20.0 is a major production release aligned with upstream vLLM v0.20.0. It refreshes the serving/runtime stack for large-scale omni workloads, expands model coverage across speech, omni, image, audio, and video generation, and improves performance, quantization, and hardware readiness across CUDA, ROCm, MUSA, NPU, and XPU backends.

Key Improvements

  • Rebased to upstream vLLM v0.20.0, with CUDA 13.0 and PyTorch 2.11 alignment, Transformers 5.x compatibility fixes, removal of the old vLLM entrypoint hijack, and runtime changes needed for the 0.20.0 integration path. (#3232, #3082, #3352, #3393, #2306)
  • Large-scale serving for Qwen3-Omni, with Qwen3-Omni performance optimization, CUDA graph support for the Code2Wav decoder, async/sync autoregressive scheduling, multi-stage deployment support, and expanded long-audio/video and performance validation. (#3203, #2376, #3306, #2396, #2598, #2600)
  • CLI and configuration refactor, including the stage CLI refactor, forwarding CLI tokenizer settings into per-stage engine configs, removal of legacy Omni CLI helpers, cleaner deploy/pipeline config migration, and updated CLI documentation. (#2020, #3120, #3144, #2383, #2978)
  • Expanded quantization coverage, including AutoRound W4A16 support for Qwen Omni, offline W4A16 quantized model support, OmniGen2 FP8, Z-Image text-encoder FP8 online quantization, HunyuanImage3 NPU quantization, GLM-Image quantization, and fixes for pre-quantized checkpoints. (#2670, #1777, #2441, #3279, #2979, #2292, #2702, #2795)
  • TTS model speedups and production fixes, improving VoxCPM2, Qwen3-TTS/Qwen-TTS, MiMo Audio, Fish Speech, and Voxtral TTS through CUDA graph reuse, native decoder construction, global speaker/reference-audio caches, streaming VAE optimization, memory-pool sharing, and deterministic sampling fixes. (#2758, #2803, #2341, #2630, #2657, #2520, #2386, #3350)
  • Hardware plugin and platform optimization, expanding MUSA flash attention and torch.accelerator support, aligning NPU with the v0.20.0/GPU model-runner path, restoring ROCm/AMD CI signal, and refreshing XPU Docker/CI readiness for the PyTorch 2.11 stack. (#2451, #3101, #3325, #3343, #3083, #3393)
  • More SOTA model support, including Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, HunyuanImage-3.0 IT2I, ERNIE image T2I, AudioX, Wan2.2-S2V, DreamID-Omni HSDP, LTX-2.3, and FastGen Wan 2.1 pipelines. (#2890, #3089, #2753, #2658, #3107, #2861, #2077, #2751, #3138, #2893, #2749)
  • Diffusion dynamic step-level batching, adding async batch inference in the DiffusionEngine and strengthening step-level/diffusion serving paths with pipeline-declared offload modules, CFG/HSDP improvements, VAE tiling, and performance validation. (#2729, #2707, #2427, #2423, #2368, #2899, #2982)

Core Architecture & Runtime

  • Rebased vLLM-Omni to upstream vLLM v0.20.0 and removed the legacy vLLM entrypoint hijack used before the 0.20.0 integration path. (#3232, #3082)
  • Refreshed the runtime and stage lifecycle with the stage CLI refactor, omni sleep mode and acknowledgement protocol, coordinator reconnect/race/heartbeat fixes, stage launch-lock handling, multi-stage deployment support, and user-configurable stage initialization timeout behavior. (#2020, #2022, #1899, #2717, #2396, #2519)
  • Improved request/runtime configuration behavior by preserving media access arguments, forwarding CLI tokenizer values to stage configs, removing invalid LLM-only diffusion stage args, centralizing stage sampling parameter resolution, and guarding silent config failures when trust_remote_code is unset. (#2956, #3120, #2622, #3153, #3241)
  • Added async/sync autoregressive scheduling and lower-level runtime fixes for inline health checks, GPU-buffer accessors, stage port allocation, engine failure reporting, and async scheduler transfer behavior. (#3306, #3052, #2068, #3333, #2426, #3318)

Model Support

  • Added or expanded omni and speech model coverage for Ming-flash-omni-2.0, Dynin-omni, InternVLA-A1, MagiHuman, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, and OmniVoice voice cloning. (#2890, #1759, #2737, #2301, #3089, #2753, #2658, #2676)
  • Expanded image and video model support with HunyuanImage-3.0 IT2I, ERNIE image T2I, AudioX, Wan2.2-S2V, Wan2.2-I2V-A14B, FastGen DMD2-distilled Wan 2.1, LTX-2.3, DreamID-Omni HSDP, LTX-2 HSDP, and Stable-Audio-Open HSDP. (#3107, #2861, #2077, #2751, #2134, #2749, #2893, #3138, #2899, #2982)
  • Improved GLM-Image, HunyuanImage3, FLUX, Z-Image, and Ming/Ming-TTS behavior with config migration, benchmark and accuracy fixes, multi-stage serving fixes, image-to-image support, quantization support, and deploy/pipeline config migration. (#2977, #3024, #3084, #3373, #3243, #1580, #2292, #3154)

Audio, Speech & Omni Production Optimization

  • Improved Qwen3-Omni and Qwen3-TTS serving with Qwen3-Omni performance optimization, Code2Wav CUDA graph support, native Code2Wav decoder construction, streaming input fixes, speaker embedding validation, deterministic Fast AR seed propagation, and max-token mapping fixes. (#3203, #2376, #2341, #3396, #3191, #3350, #3217)
  • Reduced TTS and audio memory/latency overhead by freeing unused Qwen3-TTS and Fish Speech decoder/codec components, enabling Fish Speech CUDA graph capture and reference-audio caching, sharing CUDA graph memory pools, and optimizing VoxCPM2 streaming VAE/compile and manual CUDA graph paths. (#2429, #2430, #2520, #2609, #2386, #2758, #2803)
  • Expanded voice and speech API behavior with speaker as a voice alias, chat-completions support for both voice and speaker, raw-audio VoxCPM2 voice cloning, OmniVoice voice cloning, global speaker cache management, and speaker validation/case-insensitive lookup. (#2424, #3248, #2720, #2676, #2630, #2407)
  • Migrated VoxCPM2, CosyVoice3, MiMo Audio, Voxtral TTS, Fish Speech S2 Pro, Ming, and Ming-TTS to the Pipeline + Deploy schema, added a universal TTS benchmark, and consolidated per-model TTS docs into a TTS hub. (#2958, #3154, #2835, #3234, #3358)

Diffusion, Image & Video Generation

  • Added dynamic step-level batching with DiffusionEngine async batch inference and continued step-level execution/performance validation for Qwen-Image and diffusion pipelines. (#2729, #2707)
  • Expanded generation capabilities with Z-Image image-to-image, FLUX.1/FLUX.2 TeaCache, FLUX.2-dev CFG parallel, HunyuanImage3 TeaCache/IT2I, generalized diffusers adapter backend support, and profiler/progress tooling for diffusion pipelines. (#1580, #2774, #1871, #2010, #1927, #3107, #2724, #2489)
  • Improved video generation with Wan2.2 BF16 VAE conversion, fused RMSNorm/AdaLayerNorm and NPU fused RMSNorm paths, LightX2V offline conversion, reduced duplicate preprocessing, MP4 encoding latency optimization, Wan2.2-S2V support, and FastGen Wan 2.1 pipeline support. (#2391, #2583, #2585, #3067, #2134, #2963, #2735, #2751, #2749)
  • Strengthened distributed and memory-efficient diffusion with VAE tiling parallel encode, unified CFG parallel support, 3/4-branch CFG dispatch, pipeline-declared offloadable modules, layerwise offload for additional diffusion models, HSDP coverage, and inline execution for single-stage diffusion. (#2368, #2160, #2423, #2427, #2339, #2734, #2899, #2982, #2736)

Quantization & Memory Efficiency

  • Added or improved quantization coverage for Qwen Omni W4A16 via AutoRound, offline W4A16 quantized models, OmniGen2 FP8, Z-Image text-encoder FP8, HunyuanImage3 NPU quantization, GLM-Image quantization, Flux Kontext quantization, and Helios FP8. (#2670, #1777, #2441, #3279, #2979, #2292, #2184, #1916)
  • Fixed pre-quantized checkpoint behavior by avoiding FP8 quant configs on vision/audio encoders and repairing broken FP8 quantization on Z-Image-Turbo, Qwen-Image, and FLUX.1-dev. (#2702, #2795)
  • Improved memory efficiency across TTS and diffusion by combining CUDA graph reuse, codec/decoder cleanup, multi-block and layerwise CPU offloading, TeaCache/offload compatibility fixes, transformer offload, and pipeline-declared offloadable modules. (#2386, #2429, #2430, #1486, #2339, #2689, #3224, #2427)

RL, Serving & Integrations

  • Improved BAGEL serving and RL flows with LoRA adapter injection, end-to-end LoRA support, text2text/img2text think mode, single-stage think mode, fused gate_proj/up_proj, trajectory recording, RDMA flow updates, TP/CFG transfer-engine support, and rollout trajectory fixes. (#2490, #2494, #2503, #2650, #2546, #2483, #2000, #2705, #2731, #3258)
  • Added serving controls and API reliability improvements including least-queue-length and round-robin load balancers, OpenAI-compatible request cancellation for image generation, streaming delta messages, graceful multi-stage shutdown, guarded app-state access during shutdown, response body fixes, and multi-stage deployment support. (#2448, #2621, #2911, #3001, #2587, #3094, #2396)
  • Improved diffusion and multimodal serving observability with diffusion metrics surfaced in chat completions, corrected metric keys, profiler output fixes, Nsight Systems support for serving, PyTorch profiler ops/memory recording, high-load Qwen3-TTS perf CI, and multimodal benchmark token accounting fixes. (#2932, #2692, #2647, #1098, #2472, #3238, #2549)

Platforms, Distributed Execution & Hardware Coverage

  • HunyuanImage-3.0 on NPU now supports DiT, AR-only, and AR+DiT workflows, adds IT2I image editing, improves DiT performance coverage, and enables offline msModelSlim/vLLM-Ascend quantized inference. (#2713, #3107, #2495, #2979, #2590, #1927, #1751)
  • Wan2.2 on NPU is now production-ready with major I2V performance optimizations, including MindIE-SD fused RoPE/AdaLayerNorm/RMSNorm, VAE BF16 and parallelism fixes, HSDP/USP deployment recipes, delivering roughly 50-60% performance improvement in tested workloads. (#2919, #2393, #2459, #2391, #2585, #2583, #2571, #3067, #2969, #2852, #3063, #2262, #2817)
  • Qwen3-TTS and Qwen3-Omni improve NPU speech generation performance with shared Code Predictor infrastructure, NPU graph/fusion-attention support, NPU runner alignment, and stronger benchmark coverage, improving RTF by about 50%. (#2375, #2695, #3325, #2353, #3203, #3238, #2835)
  • Added and stabilized MUSA/Moore Threads GPU support, including platform discovery, torch.accelerator alignment, torch.compile/Inductor support, MATE Flash Attention for diffusion, device capability/version APIs, and installation docs. (#2337, #2359, #2451, #2766, #3101, #3179, #3132)
  • Improved ROCm reliability by migrating AMD CI images, restoring CI failure signals, syncing ROCm coverage with CUDA tests, fixing Qwen2.5/Qwen3 Omni CI cases, and selecting a safer default AR attention backend for Omni workloads. (#2303, #2340, #2708, #3225, #3343)
  • Improved Intel XPU support with Voxtral TTS stage configs, removal of hardcoded CUDA paths in audio tokenization, torch Inductor enablement, Qwen2.5 CI fixes, and updated XPU Docker support for the vLLM 0.20 / PyTorch 2.11 stack. (#2428, #3113, #3083, #3393)
  • Added distributed and parallel execution improvements including HSDP for Qwen-Image/Z-Image/GLM-Image, DreamID-Omni, LTX-2, and Stable-Audio-Open, plus BAGEL TP/CFG transfer-engine support and CFG parallel dispatch improvements. (#2029, #3138, #2899, #2982, #2705, #2731, #2423)

CI, Benchmarks & Documentation

  • Expanded reliability and performance coverage with Qwen3-Omni/Qwen-TTS/Qwen-Image/Wan2.2 stability tests, Qwen3-TTS high-load daily performance CI, HunyuanImage3 DiT benchmark tests, universal TTS benchmarks, and model accuracy benchmark coverage. (#2817, #3238, #2495, #2835, #2558)
  • Updated documentation for CLI usage, Qwen3-Omni recipes, diffusion quantization, diffusion attention backends, TTS hubs, Fish Speech deployment, LTX-2 recipes, and hardware-specific deployment notes. (#2978, #3109, #3200, #3011, #3234, #3193, #3294, #2919)
  • Improved CI readiness after the v0.20.0 rebase with CUDA 13.0/Qwen-Image performance fixes, NPU alignment, ROCm fixes, Intel XPU CI fixes, PyTorch 2.11 XPU Docker updates, and nightly/ready test scheduling updates. (#3352, #3325, #3343, #3083, #3393, #2945)

What's Changed

New Contributors

Full Changelog: v0.18.0...v0.20.0

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.