github vllm-project/vllm-omni v0.20.0rc1

pre-release6 hours ago

Highlights

This release features 327 commits from 99 contributors, including 36 new contributors.

vLLM-Omni v0.20.0rc1 is a release candidate aligned with upstream vLLM v0.20.0. It expands the model catalog across speech, omni, image, and video workloads; improves production readiness for TTS, diffusion, BAGEL, and multi-stage serving; and broadens hardware coverage across CUDA, ROCm, MUSA, NPU, XPU, and AMD validation paths. This release candidate is intended to validate the upstream rebase, the refreshed serving/runtime behavior, and the expanded model and platform matrix before the final v0.20.0 release.

Key Improvements

  • Rebased to upstream vLLM v0.20.0, with entrypoint cleanup, stage CLI refactoring, sleep-mode support, coordinator reliability fixes, and removed vLLM entrypoint hijacking for the 0.20.0 integration path. (#3232, #3082, #2020, #2022, #1899)
  • Expanded model support across omni, TTS, image, and video workloads, including MagiHuman, Dynin-omni, InternVLA-A1, Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, LTX-2.3, and FastGen Wan 2.1 pipelines. (#2301, #1759, #2737, #2890, #3089, #2753, #2658, #2893, #2749)
  • Improved TTS and audio production behavior, with lower VRAM usage, CUDA graph reuse, voice-cloning fixes, deterministic Fish Speech generation, a unified TTS pipeline/deploy schema, and a new universal TTS benchmark. (#2480, #2429, #2430, #2520, #2609, #2624, #2676, #2690, #2835, #2958, #3253)
  • Strengthened diffusion, image, and video generation, adding Z-Image image-to-image, FLUX.1/FLUX.2 TeaCache and CFG-parallel paths, Wan/LTX model support, VAE tiling, diffusion profiler/progress tooling, MP4 latency optimization, and HSDP coverage for LTX-2 and Stable-Audio-Open. (#1580, #1871, #2010, #2134, #2160, #2368, #2489, #2735, #2774, #2899, #2982)
  • Made BAGEL more capable in serving and RL workflows, with LoRA support, think mode, fused projections, RDMA/connector work, TP/CFG parallel flow, layerwise offload, diffusion metrics, and rollout fixes. (#2490, #2494, #2503, #2546, #2650, #2705, #2731, #2734, #2932, #3258)
  • Broadened quantization, memory, and hardware coverage, including OmniGen2 FP8, Qwen Omni W4A16, HunyuanImage3 NPU quantization, safer pre-quantized checkpoint handling, MUSA flash attention and torch.accelerator support, NPU graph/fused-op improvements, ROCm/AMD CI fixes, and XPU torch inductor support. (#2441, #2670, #2702, #2795, #2979, #2451, #2695, #2766, #3067, #3101, #3113, #3225)

Core Architecture & Runtime

  • Rebased vLLM-Omni to upstream vLLM v0.20.0 and removed the old vLLM entrypoint hijack used before the 0.20.0 integration path. (#3232, #3082)
  • Refreshed the runtime and stage lifecycle with a stage CLI refactor, omni sleep mode and acknowledgement protocol, coordinator reconnect/race/heartbeat fixes, stage launch-lock handling, and restored user-configurable stage initialization timeout behavior. (#2020, #2022, #1899, #2717, #2519)
  • Improved request/runtime configuration behavior by preserving media access arguments, removing invalid LLM-only diffusion stage args, centralizing stage sampling parameter resolution, and guarding silent config failures when trust_remote_code is unset. (#2956, #2622, #3153, #3241)
  • Added lower-level runtime support such as inline-client health checks, optional MROPE kwargs for HunyuanImage3, GPU-buffer accessor fixes, and clearer runtime failure reporting. (#3052, #2654, #2068, #2426)

Model Support

  • Added or expanded omni and speech model coverage for MagiHuman, Dynin-omni, InternVLA-A1, Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, and OmniVoice voice cloning. (#2301, #2542, #2554, #1759, #2737, #2890, #3089, #2753, #2658, #2676)
  • Expanded video and diffusion model support with Wan2.2-I2V-A14B, FastGen DMD2-distilled Wan 2.1 text-to-video and image-to-video pipelines, LTX-2.3, LTX-2 HSDP, and Stable-Audio-Open HSDP. (#2134, #2749, #2893, #2899, #2982)
  • Improved GLM-Image, HunyuanImage3, FLUX, and Z-Image behavior with config migration, benchmark/bug fixes, multi-stage serving fixes, system prompt alignment, text encoder fixes, image-to-image support, and quantization support for GLM-Image. (#2977, #3024, #3084, #2270, #2760, #1580, #2292)

Audio, Speech & Omni Production Optimization

  • Improved Qwen3-TTS serving quality and stability by fixing streaming chunk-boundary artifacts, aligning code predictor dtypes, handling missing ref_text, correcting Code2Wav length and eager-mode behavior, supporting local-path reference audio, and using float32 code prediction on fp16-only GPUs. (#2480, #2470, #2203, #2508, #2868, #2984, #3253)
  • Reduced TTS and audio memory/latency overhead by freeing unused Qwen3-TTS and Fish Speech decoder/codec components, enabling Fish Speech CUDA graph capture and reference-audio caching, sharing CUDA graph memory pools, and optimizing VoxCPM2 streaming VAE/compile and manual CUDA graph paths. (#2429, #2430, #2520, #2609, #2386, #2758, #2803)
  • Expanded voice and speech API behavior with speaker as a voice alias, deterministic Fish Speech seed support, raw-audio VoxCPM2 voice cloning, OmniVoice voice cloning, and speaker validation/case-insensitive lookup. (#2424, #2624, #2720, #2676, #2407)
  • Migrated VoxCPM2, CosyVoice3, MiMo Audio, Voxtral TTS, and Fish Speech S2 Pro to the Pipeline + Deploy schema, added the universal TTS benchmark, and offloaded blocking TTS/speech work to avoid event-loop stalls. (#2958, #2835, #2511)

Diffusion, Image & Video Generation

  • Expanded generation capabilities with Z-Image image-to-image, FLUX.1/FLUX.2 TeaCache, FLUX.2-dev CFG parallel, generalized diffusers adapter backend support, and profiler/progress tooling for diffusion pipelines. (#1580, #2774, #1871, #2010, #2724, #2489)
  • Improved video generation with Wan2.2 BF16 VAE conversion, fused RMSNorm/AdaLayerNorm and NPU fused RMSNorm paths, LightX2V offline conversion, reduced duplicate preprocessing, MP4 encoding latency optimization, and FastGen Wan 2.1 pipeline support. (#2391, #2583, #2585, #3067, #2134, #2963, #2735, #2749)
  • Strengthened distributed and memory-efficient diffusion with VAE tiling parallel encode, unified CFG parallel support for LTX2, 3/4-branch CFG dispatch, per-pipeline offloadable-module declarations, layerwise offload for additional diffusion models and BAGEL, HSDP support, and inline execution for single-stage diffusion. (#2368, #2160, #2423, #2427, #2339, #2734, #2899, #2982, #2736)
  • Improved online image/video serving correctness with request cancellation for /v1/images/generations, max-generated-image-size enforcement, default video sampling fixes, ComfyUI image-to-image DALL-E endpoint fixes, media/default sampling preservation, and pure-diffusion offline example fixes. (#2621, #2599, #3049, #2980, #2780, #3181)

Quantization & Memory Efficiency

  • Added or improved quantization coverage for OmniGen2 FP8, Qwen Omni W4A16 via AutoRound, HunyuanImage3 offline quantization on the NPU diffusion path, and GLM-Image quantization, alongside updated quantization documentation. (#2441, #2670, #2979, #2292, #3200)
  • Fixed pre-quantized checkpoint behavior by avoiding FP8 quant configs on vision/audio encoders and repairing broken FP8 quantization on Z-Image-Turbo, Qwen-Image, and FLUX.1-dev. (#2702, #2795)
  • Improved memory efficiency across TTS and diffusion by combining CUDA graph reuse, codec/decoder cleanup, multi-block and layerwise CPU offloading, TeaCache/offload compatibility fixes, and pipeline-declared offloadable modules. (#2386, #2429, #2430, #1486, #2339, #2689, #2427)

RL, Serving & Integrations

  • Improved BAGEL serving and RL flows with LoRA adapter injection, end-to-end LoRA support, text2text/img2text think mode, single-stage think mode, fused gate_proj/up_proj, trajectory recording, RDMA flow updates, TP/CFG transfer-engine support, and rollout trajectory fixes. (#2490, #2494, #2503, #2650, #2546, #2483, #2000, #2705, #2731, #3258)
  • Added serving controls and API reliability improvements including least-queue-length and round-robin load balancers, OpenAI-compatible request cancellation for image generation, streaming delta messages, graceful multi-stage shutdown, guarded app-state access during shutdown, and response body fixes. (#2448, #2621, #2911, #3001, #2587, #3094)
  • Improved diffusion and multimodal serving observability with diffusion metrics surfaced in chat completions, corrected metric keys, profiler output fixes, Nsight Systems support for serving, PyTorch profiler ops/memory recording, and multimodal benchmark token accounting fixes. (#2932, #2692, #2647, #1098, #2472, #2549)

Platforms, Distributed Execution & Hardware Coverage

  • Expanded MUSA support with flash attention through MATE, torch.accelerator support, torchada updates, device capability/version APIs, and matching behavior with CUDA/ROCm flash attention paths. (#2451, #2766, #3101, #3132, #3179)
  • Improved NPU coverage with code predictor graph support, MindIE SD fused RoPE/cache paths, Wan2.2 fused ops, VAE parallel gather performance fixes, HunyuanImage3 quantization, and Ascend NPU documentation for Wan2.2 image-to-video. (#2695, #2571, #2583, #2585, #3067, #2969, #2979, #2919)
  • Strengthened ROCm/AMD and XPU readiness through ROCm CI signal restoration and environment fixes, AMD simple-unit-test fixes, XPU torch inductor support, and platform capability cleanup such as supports_float64() and flash-attention package detection. (#2340, #2708, #3225, #3113, #2488, #3068)
  • Improved distributed execution paths with Bagel TP/CFG transfer-engine support, diffusion TP-size propagation, non-contiguous gather fixes, and CFG companion/orchestrator cleanup. (#2731, #2867, #2367, #2623)

CI, Benchmarks & Documentation

  • Added a CUDA Dockerfile for NVIDIA GPU users, doc-only CI change detection, a Buildkite skip-CI upload pipeline, and reorganized Buildkite nightly/ready/merge coverage for Omni and Diffusion models. (#1439, #1284, #2582, #2620, #2945)
  • Expanded validation with stability and reliability tests for Wan2.2, Qwen3-Omni, Qwen3-TTS, Qwen-Image, Stable Audio TeaCache, Qwen image edit performance, L5 reliability, and selected previously skipped expansion tests. (#2377, #2216, #2817, #2972, #3211)
  • Refreshed documentation for MUSA installation, multi-thread weight loading, expert parallelism, LTX-2 online serving, CLI usage, diffusion attention backends, profiling, quantization, and add-model skills for diffusion and TTS. (#2359, #2445, #2471, #1971, #2978, #3011, #3196, #3200, #2806)
  • Replaced model-specific TTS benchmark folders with a more general TTS benchmark flow covering Qwen3-TTS and VoxCPM2 voice-clone/default/design tasks. (#2835)

Note

  • v0.20.0rc1 is a release candidate. Use it to validate the upstream vLLM 0.20.0 rebase, the refreshed runtime and stage configuration behavior, and the expanded model/platform matrix before the final release.The remaining issues include #3268, #3266, #3264, #3257, #3256, #3255, and #2354.
  • The generated release appendix contains several reverted changes. This editorial note intentionally does not claim the reverted Qwen3-Omni performance optimization, VoxCPM2 instructions/cfg_value change, Z-image text encoder FP8 online quantization, or deploy override field refactor as shipped features. (#3202, #3204, #3272, #3287)
  • Some low-signal CI, documentation, typo, and release-script maintenance changes were merged into broader themes instead of being listed one-by-one.

What's Changed

Keep the existing GitHub-generated What's Changed appendix below this editorial section when updating the release body, or regenerate it from:

v0.19.0rc1...v0.20.0rc1

New Contributors

Keep the existing GitHub-generated New Contributors appendix below this editorial section. The current generated appendix lists 36 first-time contributors for v0.20.0rc1.

What's Changed

New Contributors

Full Changelog: v0.19.0rc1...v0.20.0rc1

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.