github vllm-project/vllm-omni v0.21.0rc1

pre-releaseone hour ago

Highlights

This release features 94 commits from 72 contributors, including 10 new contributors.

vLLM-Omni v0.21.0rc1 is a release candidate aligned with upstream vLLM v0.21.0. It focuses on validating the next production cut by expanding diffusion, image/video, speech, and omni-model coverage; improving distributed execution and hardware backend readiness; and tightening serving stability across OpenAI-compatible APIs, deploy configs, and long-running workloads.

This release candidate is intended to validate the v0.21.0 integration, HunyuanImage-3.0 feature set, Qwen3-TTS stability, diffusion quantization, and NPU/ROCm platform coverage before the final release.

Key Improvements

  • Aligned vLLM-Omni with upstream vLLM v0.21.0, refreshing the base runtime for the v0.21 release line. (#3530)
  • Expanded image, video, and diffusion generation capabilities, including HunyuanImage-3.0 AR + DiT KV reuse, online IT2I image editing, multi-image input, FLUX/Qwen image pipelines, DMD2 image generation, and FLUX.2-dev TP support. (#3346, #3410, #3444, #2974, #2465)
  • Improved diffusion parallelism and backend configurability, with HunyuanVideo 1.5 USP sequence parallelism, Bagel HSDP support, per-role attention backend selection, and additional diffusion worker configuration. (#2444, #3150, #2681, #3020)
  • Strengthened speech and omni serving, including Qwen3-TTS recipes and tests, Voxtral TTS FP8 quantization, Covo-Audio-Chat support, MiMo-Audio tokenizer decoding improvements, and Qwen2.5/Qwen3-Omni talker-stage cleanup. (#3130, #3036, #2293, #2183, #3296, #3425)
  • Expanded quantization and hardware coverage, including ModelOpt FP8 auto-detection for diffusion checkpoints, NPU FP8/FA support, Wan2.2 W8A8 MXFP8 on Ascend NPU, ROCm fixes, and new NPU nightly coverage. (#2913, #2640, #3140, #3463, #3480)
  • Improved production reliability, fixing seed handling through the OpenAI Python client, HunyuanImage-3.0 online/offline prompt and AR behavior mismatches, diffusion worker Ray shutdown failures, streaming audio splicing, deploy-config serving failures, and AR replica cache routing. (#3436, #3500, #3516, #3533, #3438, #3537, #3605)

Core Architecture & Runtime

  • Reworked runtime abstractions for broader backend compatibility, including replacing selected torch.cuda usage with torch.accelerator, renaming module-offload discovery interfaces, standardizing msgspec data-entry keys, and continuing the Qwen3-Omni communication-layer refactor. (#3365, #3354, #3149, #2677)
  • Improved diffusion engine extensibility for out-of-tree hardware backends and added more concrete entrypoint typing for serving integration work. (#3239, #3139)
  • Fixed runtime correctness issues around HSDP + torch.compile RMSNorm, OmniGen2 offload/dtype mismatch, Helios optimized-scale casting, and platform device-count detection. (#3460, #2560, #3529, #3636)

Model Support

  • Added or expanded support for Sensenova U1, Tencent Covo-Audio-Chat, Qwen3-TTS recipes, HunyuanImage-3.0 deploy configs, FLUX.2-dev TP-aware MistralEncoder, and additional FLUX/Qwen image pipelines. (#3319, #2293, #3130, #3172, #2465, #2974)
  • Improved model-family coverage for Bagel, VoxCPM2, Wan2.2 I2V, HunyuanVideo 1.5, HunyuanImage-3.0, Qwen-Image, Qwen2.5-Omni, and Qwen3-Omni through targeted fixes and recipe updates. (#3150, #3424, #3271, #2444, #3346, #3450, #3425, #3296)

Audio, Speech & Omni Production Optimization

  • Added Qwen3-TTS model recipes and hardened Qwen3-TTS performance/nightly coverage, including Base voice-clone testing, additional concurrency cells, and ready-tag tests. (#3130, #3491, #3600, #3637)
  • Improved Qwen3-TTS latency and streaming behavior, including a latency-regression fix, custom-voice streaming client update, and streaming audio output splicing fix. (#3485, #3380, #3438)
  • Improved MiMo-Audio tokenizer decoding performance and added Voxtral TTS FP8 quantization support. (#2183, #3036)
  • Cleaned dead audio/visual components from Qwen2.5-Omni and Qwen3-Omni talker stages, and fixed omni processing tests for non-multimodal talker stages. (#3425, #3296, #3559)

Diffusion, Image & Video Generation

  • Added major HunyuanImage-3.0 capabilities, including AR + DiT KV reuse, online IT2I image editing, multi-image input, deploy configs, AR sampler batching, and stability tests for HunyuanImage-3-Instruct. (#3346, #3410, #3444, #3172, #3590, #3504)
  • Improved HunyuanImage-3.0 correctness across online/offline paths by aligning AR and DiT prompt formatting, fixing AR encode differences, adding think_recaption bot-task support, and fixing KV reuse compatibility under sequence parallelism. (#3516, #3500, #3551, #3546)
  • Expanded diffusion and video execution with USP support for HunyuanVideo 1.5, Bagel HSDP support, DMD2 image generation, FLUX/Qwen image pipelines, and Wan2.2 I2V recipe updates. (#2444, #3150, #2974, #3271)
  • Improved diffusion serving robustness with fixes for SD3 dtype crashes, OmniGen2 offload/dtype mismatch, TeaCache refresh behavior, diffusion worker Ray SIGKILL, diffusers backend input handling, shared-memory connector issues, and diffusion KV-cache dtype isolation. (#2526, #2560, #2240, #3533, #3644, #3583, #3596)

Quantization & Memory Efficiency

  • Added ModelOpt FP8 auto-detection for diffusion checkpoints and bumped the minimum diffusers dependency to >=0.38.0. (#2913, #3349)
  • Added Voxtral TTS FP8 quantization and expanded NPU quantization with online FP8 for FA plus W8A8 MXFP8 online/offline quantization for Wan2.2 T2V/I2V/TI2V on Ascend NPU. (#3036, #2640, #3140)
  • Fixed diffusion KV-cache dtype isolation so diffusion cache behavior is not incorrectly coupled to vLLM --kv-cache-dtype. (#3596)

RL, Serving & Integrations

  • Fixed OpenAI Python client seed handling and ensured extra_params are merged correctly into diffusion speech sampling parameters. (#3436, #3320)
  • Improved deploy-config based online serving and added additional_config support for diffusion workers. (#3537, #3020)
  • Fixed multimodal cache routing for AR replicas, thinker preemption shape mismatches, and async diffusion race conditions. (#3605, #3147, #3379)
  • Improved VoxCPM2 first-request latency through startup warmup and fixed the default stage config path. (#3424, #3447)

Platforms, Distributed Execution & Hardware Coverage

  • Extended diffusion engine plugin extensibility for out-of-tree hardware backends and improved NPU support with code-predictor device mismatch fixes, AR prefix-cache key flattening, and NPU nightly tests. (#3239, #3453, #3568, #3480)
  • Added NPU quantization coverage for Wan2.2 and FA, and improved ROCm parity with CUDA CI skip logic plus Wan2.2 ROCm fixes. (#3140, #2640, #3482, #3463)
  • Improved distributed execution support with HunyuanVideo 1.5 USP, Bagel HSDP, FLUX.2-dev TP, and HSDP + torch.compile correctness fixes. (#2444, #3150, #2465, #3460)

CI, Benchmarks & Documentation

  • Unified L2/L3 test layout, Buildkite steps, and test helpers; refined nightly pytest execution; and improved e2e latency/startup logging. (#2556, #3459, #3246)
  • Expanded nightly and stability coverage for HunyuanImage-3, Diffusion X2I/X2V, Qwen3-TTS, Qwen3-Omni, and NPU backends. (#3455, #3504, #3625, #3600, #3501, #3480)
  • Refactored attention-backend documentation and skill content, updated release/readme materials, refreshed community QR code, and updated CODEOWNERS feature reviewers. (#3475, #3594, #3624, #3378)
  • Cleaned obsolete or noisy CI paths, including multi-replica Bagel CI removal, duplicate H100 testing reduction, weekly-test merge-condition updates, and environment-variable cleanup that avoided unnecessary GPU detection. (#3407, #3459, #3197, #3446)

Note

  • attention_config has been renamed to diffusion_attention_config. Users with custom configs, deploy configs, or scripts that reference the old key should update them accordingly. (#3489)
  • The minimum supported diffusers version is now >=0.38.0. (#3349)
  • This is a release candidate. Please validate model-specific serving paths, hardware backends, quantization settings, and deployment configs before using it as a production baseline.

What's Changed

  • [chore] Update command to download dataset from huggingface-cli to hf by @Gaohan123 in #3403
  • [Refactor] Replace and ban a few torch.cuda functions in favor of torch.accelerator replacements. by @NickCao in #3365
  • [Clean] Remove multi-replica Bagel CI and related docs/configs by @fake0fan in #3407
  • Update CODEOWNERS feature reviewers by @david6666666 in #3378
  • [Test] Unify L2/L3 test layout, Buildkite steps, and test helpers by @yenuo26 in #2556
  • [Hardware] Extend diffusion engine plugin extensibility for out-of-tree hardware backends by @yuchenjiangyj in #3239
  • [Feat] support hsdp for Bagel by @lsyyysky in #3150
  • [Bugfix] Fix the issue where the seed parameter does not take effect when using the OpenAI Python client by @Phi-C in #3436
  • [Bugfix] Fix Dtype Crashes in SD3 by @alex-jw-brooks in #2526
  • [Feature][Hunyuan image 3.0] AR + DIT with kv reuse. by @Bounty-hunter in #3346
  • [Test][HunyuanImage3] Add e2e offline I2T smoke test by @TaffyOfficial in #3332
  • [BugFix]Fix default stage config path in voxcpm2 by @sphinxkkkbc in #3447
  • [Feat] Add Sequence Parallelism (USP) support for HunyuanVideo 1.5 transformer by @daixinning in #2444
  • [Feature] online HunyuanImage-3.0 IT2I (image editing) support by @skf-1999 in #3410
  • enhancement: extend to dmd2 to image generation + add flux, qwen image pipelines by @ayushag-nv in #2974
  • [Refactor] Rename SupportsModuleOffload to SupportsComponentDiscovery by @NickCao in #3354
  • Add Qwen3 TTS Model recipe by @chzhang2021 in #3130
  • [Bugfix][StableAudio] Pass model_class_name to Omni() and declare audio class attrs by @linyueqian in #3405
  • [Bugfix] Qwen-Image use teachche serve will crash by @lengrongfu in #3450
  • [Perf] Optimize VoxCPM2 first-request latency via startup warmup by @Dan250124 in #3424
  • [Bugfix] fix OmniGen2 offload and dtype mismatch by @RuixiangMa in #2560
  • [Feature] Add FP8 quantization for Voxtral TTS by @akshatvishu in #3036
  • Fix NPU code predictor device mismatch in concurrent mode by @Wallbreazzz in #3453
  • [Test] Restore tts mark and omni_runner_function fixture for Voxtral TTS by @linyueqian in #3462
  • [CI] Update merge condition to skip L3 merges during weekly test and update doc by @zhumingjue138 in #3197
  • [CI] Refine nightly pytest command in Omni · Function Test with H100 to avoid duplicate testing. by @yenuo26 in #3459
  • (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 by @baonudesifeizhai in #2913
  • [CI][Nightly] Shard nightly Diffusion X2I H100 lanes and centralize shard definitions by @wuhang2014 in #3455
  • [CI] Remove VLLM_TEST_CLEAN_GPU_MEMORY to avoid environment variable pollution that causes unnecessary GPU detection, thereby slowing down test case execution. by @yenuo26 in #3446
  • [Diffusion][Attention] Support per-role attention backend via CLI by @gcanlin in #2681
  • [Feature] hunyuanimage support flash attn by @Bounty-hunter in #2981
  • [Perf] Fix Qwen3-TTS latency regression by @Sy0307 in #3485
  • [ROCm] [CI] Add the same skip ci logic as CUDA CI by @tjtanaa in #3482
  • [Docs] Refactor the attention backend docs/skill by @gcanlin in #3475
  • [Performance] Improve MiMo-Audio tokenizer decoding performance by @qibaoyuan in #2183
  • [BugFix] Rename attention_config to diffusion_attention_config by @gcanlin in #3489
  • [Bug][Hunyuanimage 3.0] fix different AR encode behavior between online and offline by @Bounty-hunter in #3500
  • [Misc] Clean logs for image gen task by @wuhang2014 in #3414
  • [CI] skip failing diffusion and accuracy cases (#3432, #3256, #3257, #3488) by @yenuo26 in #3507
  • [New Model]: Add sensenova u1 support by @princepride in #3319
  • [Config] Add HunyuanImage3 deploy configs by @Fishermanykx in #3172
  • [Fix] Fix RMSNorm inductor KeyError under HSDP + torch.compile by @LJH-LBJ in #3460
  • [Perf] Remove dead audio_tower and visual from Qwen3-Omni talker stage by @NickCao in #3296
  • [bugfix][ci] avoid Whisper transcript deduplication in realtime audio test by @Shirley125 in #3417
  • [Chore] explicit .float() conversion in Helios's optimized_scale function by @RuixiangMa in #3529
  • [CI][Bugfix] Improve e2e latency logging, update response classes to include detailed latency documentation and add startup time logging by @yenuo26 in #3246
  • [Recipes]update Wan2.2-I2V gpu part by @bjf-frz in #3271
  • [BugFix] Modify the splicing method of streaming audio output. by @amy-why-3459 in #3438
  • [Bugfix] Align the AR and DiT prompt formatting across both online and offline modes. by @Bounty-hunter in #3516
  • [FIX] Ensure extra_params are correctly merged into sampling params in _create_diffusion_speech() by @saadaltohamy in #3320
  • [Nightly CI] Remove TP case by @NumberWan in #3534
  • [Refactor] msgspec standardisation for data entry key names and improved type checks by @divyanshsinghvi in #3149
  • [New Model] Add support for tencent/Covo-Audio-Chat by @Dnoob in #2293
  • [bugfix, rl] Fix race condition bug on async running for diffusion model by @knlnguyen1802 in #3379
  • [CI] update daily omni min accuracy by @R2-Y in #3536
  • [Perf] Remove dead audio_tower and visual from Qwen2.5-Omni talker stage by @NickCao in #3425
  • [Bugfix] Fix the issue where the qwen3-omni model long-term stability test sometimes gets stuck without sending requests. by @zhumingjue138 in #3468
  • [Bugfix] Fix omni processing test for non-multimodal talker stage by @NickCao in #3559
  • Bump diffusers minimum version to >=0.38.0 by @oglok in #3349
  • support online FP8 quantization for FA on NPU #2236 by @lyj-jjj in #2640
  • [CI][Test] Add NPU nightly tests by @gcanlin in #3480
  • [CI][Bugfix] skip fp8 Z-Image quality gate (#3531) and add torchdiffeq dev extra by @yenuo26 in #3563
  • [Bugfix, rl] Diffusion worker SIGKILL under Ray actor (exitcode -9) by @knlnguyen1802 in #3533
  • Fix: NPU AR model runner prefix cache key flattening by @weizhoublue in #3568
  • [NPU][Quant] Add W8A8 MXFP8 online/offline quantization support for Wan2.2 T2V / I2V / TI2V inference on Ascend NPU by @hxhhhlalala in #3140
  • [skip ci][Tests] Splitting Qwen3-omni's performance test cases by @amy-why-3459 in #3501
  • [ROCm] Bugfix wan22 by @tjtanaa in #3463
  • [Bugfix] Add bot_task option of think_recaption for hunyuanimage3 it2i by @zengchuang-hw in #3551
  • [Feat][Config] Support additional_config for diffusion worker by @Fishermanykx in #3020
  • [Bugfix][HunyuanImage3.0] Fix KV reuse compatibility in SP scenarios by @Bounty-hunter in #3546
  • [Model] Add TP-aware MistralEncoder for FLUX.2-dev TP by @vraiti in #2465
  • [BugFix] Refresh TeaCache when num_inference_steps=None by @alex-jw-brooks in #2240
  • [Test] Add stability tests for HunyuanImage-3-Instruct by @zhumingjue138 in #3504
  • [Bugfix]: Fix online serving failure when using deploy config by @Fishermanykx in #3537
  • [Entrypoint][Refactor] Make field type hint more concrete by @wuhang2014 in #3139
  • [CI] Harden Qwen3-TTS perf nightly: enable Base voice_clone, add c=64/128, 2-GPU split by @linyueqian in #3491
  • [Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup by @TaffyOfficial in #3444
  • update v0.20.0 readme by @hsliuustc0106 in #3594
  • [Bugfix]Allow HunyuanImage3 AR sampler batching by @bjf-frz in #3590
  • [BugFix] fix shm connector by @Bounty-hunter in #3583
  • [CI] Add Qwen3-TTS tests for ready tag by @gcanlin in #3600
  • Update WeChat group QR code by @david6666666 in #3624
  • [BugFix] fix(omni): isolate diffusion KV-cache dtype from vLLM --kv-cache-dtype #3585 by @lyj-jjj in #3596
  • Update streaming_speech_client.py to solve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice voice problem by @keeper-jie in #3380
  • [CI] add cuda marker to Diffusion X2V function pytest by @yenuo26 in #3625
  • [Bugfix] UnspecifiedOmniPlatform.get_device_count returns 0 by @princepride in #3636
  • [2/5] [core]refactor communication layer: PR 2 of 5 Qwen3 Omni non async by @natureofnature in #2677
  • [Bugfix]Fix multimodal cache routing for AR replicas by @bjf-frz in #3605
  • [BugFix] Fix the issue of thinker requests being preempted, causing shape mismatch. by @amy-why-3459 in #3147
  • [Bugfix] fix compatibility of _hunyuan_image3_unpack_packed_topk between vllm / vllm ascend by @Fishermanykx in #3640
  • [bugfix] Fix diffusers backend input bug after #2913 by @fhfuih in #3644
  • [BugFix] fix ci by @amy-why-3459 in #3650
  • [CI] Replace c=128 perf cell with c=16; loosen new-cell baselines by @linyueqian in #3637
  • [Rebase] Rebase to vllm v0.21.0 by @tzhouam in #3530

New Contributors

Full Changelog: v0.20.0...v0.21.0rc1

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.