vllm-project/vllm-omni v0.21.0rc1 on GitHub

Highlights

This release features 94 commits from 72 contributors, including 10 new contributors.

vLLM-Omni v0.21.0rc1 is a release candidate aligned with upstream vLLM v0.21.0. It focuses on validating the next production cut by expanding diffusion, image/video, speech, and omni-model coverage; improving distributed execution and hardware backend readiness; and tightening serving stability across OpenAI-compatible APIs, deploy configs, and long-running workloads.

This release candidate is intended to validate the v0.21.0 integration, HunyuanImage-3.0 feature set, Qwen3-TTS stability, diffusion quantization, and NPU/ROCm platform coverage before the final release.

Key Improvements

Aligned vLLM-Omni with upstream vLLM v0.21.0, refreshing the base runtime for the v0.21 release line. (#3530)
Expanded image, video, and diffusion generation capabilities, including HunyuanImage-3.0 AR + DiT KV reuse, online IT2I image editing, multi-image input, FLUX/Qwen image pipelines, DMD2 image generation, and FLUX.2-dev TP support. (#3346, #3410, #3444, #2974, #2465)
Improved diffusion parallelism and backend configurability, with HunyuanVideo 1.5 USP sequence parallelism, Bagel HSDP support, per-role attention backend selection, and additional diffusion worker configuration. (#2444, #3150, #2681, #3020)
Strengthened speech and omni serving, including Qwen3-TTS recipes and tests, Voxtral TTS FP8 quantization, Covo-Audio-Chat support, MiMo-Audio tokenizer decoding improvements, and Qwen2.5/Qwen3-Omni talker-stage cleanup. (#3130, #3036, #2293, #2183, #3296, #3425)
Expanded quantization and hardware coverage, including ModelOpt FP8 auto-detection for diffusion checkpoints, NPU FP8/FA support, Wan2.2 W8A8 MXFP8 on Ascend NPU, ROCm fixes, and new NPU nightly coverage. (#2913, #2640, #3140, #3463, #3480)
Improved production reliability, fixing seed handling through the OpenAI Python client, HunyuanImage-3.0 online/offline prompt and AR behavior mismatches, diffusion worker Ray shutdown failures, streaming audio splicing, deploy-config serving failures, and AR replica cache routing. (#3436, #3500, #3516, #3533, #3438, #3537, #3605)

Core Architecture & Runtime

Reworked runtime abstractions for broader backend compatibility, including replacing selected torch.cuda usage with torch.accelerator, renaming module-offload discovery interfaces, standardizing msgspec data-entry keys, and continuing the Qwen3-Omni communication-layer refactor. (#3365, #3354, #3149, #2677)
Improved diffusion engine extensibility for out-of-tree hardware backends and added more concrete entrypoint typing for serving integration work. (#3239, #3139)
Fixed runtime correctness issues around HSDP + torch.compile RMSNorm, OmniGen2 offload/dtype mismatch, Helios optimized-scale casting, and platform device-count detection. (#3460, #2560, #3529, #3636)

Model Support

Added or expanded support for Sensenova U1, Tencent Covo-Audio-Chat, Qwen3-TTS recipes, HunyuanImage-3.0 deploy configs, FLUX.2-dev TP-aware MistralEncoder, and additional FLUX/Qwen image pipelines. (#3319, #2293, #3130, #3172, #2465, #2974)
Improved model-family coverage for Bagel, VoxCPM2, Wan2.2 I2V, HunyuanVideo 1.5, HunyuanImage-3.0, Qwen-Image, Qwen2.5-Omni, and Qwen3-Omni through targeted fixes and recipe updates. (#3150, #3424, #3271, #2444, #3346, #3450, #3425, #3296)

Audio, Speech & Omni Production Optimization

Added Qwen3-TTS model recipes and hardened Qwen3-TTS performance/nightly coverage, including Base voice-clone testing, additional concurrency cells, and ready-tag tests. (#3130, #3491, #3600, #3637)
Improved Qwen3-TTS latency and streaming behavior, including a latency-regression fix, custom-voice streaming client update, and streaming audio output splicing fix. (#3485, #3380, #3438)
Improved MiMo-Audio tokenizer decoding performance and added Voxtral TTS FP8 quantization support. (#2183, #3036)
Cleaned dead audio/visual components from Qwen2.5-Omni and Qwen3-Omni talker stages, and fixed omni processing tests for non-multimodal talker stages. (#3425, #3296, #3559)

Diffusion, Image & Video Generation

Added major HunyuanImage-3.0 capabilities, including AR + DiT KV reuse, online IT2I image editing, multi-image input, deploy configs, AR sampler batching, and stability tests for HunyuanImage-3-Instruct. (#3346, #3410, #3444, #3172, #3590, #3504)
Improved HunyuanImage-3.0 correctness across online/offline paths by aligning AR and DiT prompt formatting, fixing AR encode differences, adding think_recaption bot-task support, and fixing KV reuse compatibility under sequence parallelism. (#3516, #3500, #3551, #3546)
Expanded diffusion and video execution with USP support for HunyuanVideo 1.5, Bagel HSDP support, DMD2 image generation, FLUX/Qwen image pipelines, and Wan2.2 I2V recipe updates. (#2444, #3150, #2974, #3271)
Improved diffusion serving robustness with fixes for SD3 dtype crashes, OmniGen2 offload/dtype mismatch, TeaCache refresh behavior, diffusion worker Ray SIGKILL, diffusers backend input handling, shared-memory connector issues, and diffusion KV-cache dtype isolation. (#2526, #2560, #2240, #3533, #3644, #3583, #3596)

Quantization & Memory Efficiency

Added ModelOpt FP8 auto-detection for diffusion checkpoints and bumped the minimum diffusers dependency to >=0.38.0. (#2913, #3349)
Added Voxtral TTS FP8 quantization and expanded NPU quantization with online FP8 for FA plus W8A8 MXFP8 online/offline quantization for Wan2.2 T2V/I2V/TI2V on Ascend NPU. (#3036, #2640, #3140)
Fixed diffusion KV-cache dtype isolation so diffusion cache behavior is not incorrectly coupled to vLLM --kv-cache-dtype. (#3596)

RL, Serving & Integrations

Fixed OpenAI Python client seed handling and ensured extra_params are merged correctly into diffusion speech sampling parameters. (#3436, #3320)
Improved deploy-config based online serving and added additional_config support for diffusion workers. (#3537, #3020)
Fixed multimodal cache routing for AR replicas, thinker preemption shape mismatches, and async diffusion race conditions. (#3605, #3147, #3379)
Improved VoxCPM2 first-request latency through startup warmup and fixed the default stage config path. (#3424, #3447)

Platforms, Distributed Execution & Hardware Coverage

Extended diffusion engine plugin extensibility for out-of-tree hardware backends and improved NPU support with code-predictor device mismatch fixes, AR prefix-cache key flattening, and NPU nightly tests. (#3239, #3453, #3568, #3480)
Added NPU quantization coverage for Wan2.2 and FA, and improved ROCm parity with CUDA CI skip logic plus Wan2.2 ROCm fixes. (#3140, #2640, #3482, #3463)
Improved distributed execution support with HunyuanVideo 1.5 USP, Bagel HSDP, FLUX.2-dev TP, and HSDP + torch.compile correctness fixes. (#2444, #3150, #2465, #3460)

CI, Benchmarks & Documentation

Unified L2/L3 test layout, Buildkite steps, and test helpers; refined nightly pytest execution; and improved e2e latency/startup logging. (#2556, #3459, #3246)
Expanded nightly and stability coverage for HunyuanImage-3, Diffusion X2I/X2V, Qwen3-TTS, Qwen3-Omni, and NPU backends. (#3455, #3504, #3625, #3600, #3501, #3480)
Refactored attention-backend documentation and skill content, updated release/readme materials, refreshed community QR code, and updated CODEOWNERS feature reviewers. (#3475, #3594, #3624, #3378)
Cleaned obsolete or noisy CI paths, including multi-replica Bagel CI removal, duplicate H100 testing reduction, weekly-test merge-condition updates, and environment-variable cleanup that avoided unnecessary GPU detection. (#3407, #3459, #3197, #3446)

Note

attention_config has been renamed to diffusion_attention_config. Users with custom configs, deploy configs, or scripts that reference the old key should update them accordingly. (#3489)
The minimum supported diffusers version is now >=0.38.0. (#3349)
This is a release candidate. Please validate model-specific serving paths, hardware backends, quantization settings, and deployment configs before using it as a production baseline.

What's Changed

[chore] Update command to download dataset from huggingface-cli to hf by @Gaohan123 in #3403
[Refactor] Replace and ban a few torch.cuda functions in favor of torch.accelerator replacements. by @NickCao in #3365
[Clean] Remove multi-replica Bagel CI and related docs/configs by @fake0fan in #3407
Update CODEOWNERS feature reviewers by @david6666666 in #3378
[Test] Unify L2/L3 test layout, Buildkite steps, and test helpers by @yenuo26 in #2556
[Hardware] Extend diffusion engine plugin extensibility for out-of-tree hardware backends by @yuchenjiangyj in #3239
[Feat] support hsdp for Bagel by @lsyyysky in #3150
[Bugfix] Fix the issue where the seed parameter does not take effect when using the OpenAI Python client by @Phi-C in #3436
[Bugfix] Fix Dtype Crashes in SD3 by @alex-jw-brooks in #2526
[Feature][Hunyuan image 3.0] AR + DIT with kv reuse. by @Bounty-hunter in #3346
[Test][HunyuanImage3] Add e2e offline I2T smoke test by @TaffyOfficial in #3332
[BugFix]Fix default stage config path in voxcpm2 by @sphinxkkkbc in #3447
[Feat] Add Sequence Parallelism (USP) support for HunyuanVideo 1.5 transformer by @daixinning in #2444
[Feature] online HunyuanImage-3.0 IT2I (image editing) support by @skf-1999 in #3410
enhancement: extend to dmd2 to image generation + add flux, qwen image pipelines by @ayushag-nv in #2974
[Refactor] Rename SupportsModuleOffload to SupportsComponentDiscovery by @NickCao in #3354
Add Qwen3 TTS Model recipe by @chzhang2021 in #3130
[Bugfix][StableAudio] Pass model_class_name to Omni() and declare audio class attrs by @linyueqian in #3405
[Bugfix] Qwen-Image use teachche serve will crash by @lengrongfu in #3450
[Perf] Optimize VoxCPM2 first-request latency via startup warmup by @Dan250124 in #3424
[Bugfix] fix OmniGen2 offload and dtype mismatch by @RuixiangMa in #2560
[Feature] Add FP8 quantization for Voxtral TTS by @akshatvishu in #3036
Fix NPU code predictor device mismatch in concurrent mode by @Wallbreazzz in #3453
[Test] Restore tts mark and omni_runner_function fixture for Voxtral TTS by @linyueqian in #3462
[CI] Update merge condition to skip L3 merges during weekly test and update doc by @zhumingjue138 in #3197
[CI] Refine nightly pytest command in Omni · Function Test with H100 to avoid duplicate testing. by @yenuo26 in #3459
（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 by @baonudesifeizhai in #2913
[CI][Nightly] Shard nightly Diffusion X2I H100 lanes and centralize shard definitions by @wuhang2014 in #3455
[CI] Remove VLLM_TEST_CLEAN_GPU_MEMORY to avoid environment variable pollution that causes unnecessary GPU detection, thereby slowing down test case execution. by @yenuo26 in #3446
[Diffusion][Attention] Support per-role attention backend via CLI by @gcanlin in #2681
[Feature] hunyuanimage support flash attn by @Bounty-hunter in #2981
[Perf] Fix Qwen3-TTS latency regression by @Sy0307 in #3485
[ROCm] [CI] Add the same skip ci logic as CUDA CI by @tjtanaa in #3482
[Docs] Refactor the attention backend docs/skill by @gcanlin in #3475
[Performance] Improve MiMo-Audio tokenizer decoding performance by @qibaoyuan in #2183
[BugFix] Rename attention_config to diffusion_attention_config by @gcanlin in #3489
[Bug][Hunyuanimage 3.0] fix different AR encode behavior between online and offline by @Bounty-hunter in #3500
[Misc] Clean logs for image gen task by @wuhang2014 in #3414
[CI] skip failing diffusion and accuracy cases (#3432, #3256, #3257, #3488) by @yenuo26 in #3507
[New Model]: Add sensenova u1 support by @princepride in #3319
[Config] Add HunyuanImage3 deploy configs by @Fishermanykx in #3172
[Fix] Fix RMSNorm inductor KeyError under HSDP + torch.compile by @LJH-LBJ in #3460
[Perf] Remove dead audio_tower and visual from Qwen3-Omni talker stage by @NickCao in #3296
[bugfix][ci] avoid Whisper transcript deduplication in realtime audio test by @Shirley125 in #3417
[Chore] explicit .float() conversion in Helios's optimized_scale function by @RuixiangMa in #3529
[CI][Bugfix] Improve e2e latency logging, update response classes to include detailed latency documentation and add startup time logging by @yenuo26 in #3246
[Recipes]update Wan2.2-I2V gpu part by @bjf-frz in #3271
[BugFix] Modify the splicing method of streaming audio output. by @amy-why-3459 in #3438
[Bugfix] Align the AR and DiT prompt formatting across both online and offline modes. by @Bounty-hunter in #3516
[FIX] Ensure extra_params are correctly merged into sampling params in _create_diffusion_speech() by @saadaltohamy in #3320
[Nightly CI] Remove TP case by @NumberWan in #3534
[Refactor] msgspec standardisation for data entry key names and improved type checks by @divyanshsinghvi in #3149
[New Model] Add support for tencent/Covo-Audio-Chat by @Dnoob in #2293
[bugfix, rl] Fix race condition bug on async running for diffusion model by @knlnguyen1802 in #3379
[CI] update daily omni min accuracy by @R2-Y in #3536
[Perf] Remove dead audio_tower and visual from Qwen2.5-Omni talker stage by @NickCao in #3425
[Bugfix] Fix the issue where the qwen3-omni model long-term stability test sometimes gets stuck without sending requests. by @zhumingjue138 in #3468
[Bugfix] Fix omni processing test for non-multimodal talker stage by @NickCao in #3559
Bump diffusers minimum version to >=0.38.0 by @oglok in #3349
support online FP8 quantization for FA on NPU #2236 by @lyj-jjj in #2640
[CI][Test] Add NPU nightly tests by @gcanlin in #3480
[CI][Bugfix] skip fp8 Z-Image quality gate (#3531) and add torchdiffeq dev extra by @yenuo26 in #3563
[Bugfix, rl] Diffusion worker SIGKILL under Ray actor (exitcode -9) by @knlnguyen1802 in #3533
Fix: NPU AR model runner prefix cache key flattening by @weizhoublue in #3568
[NPU][Quant] Add W8A8 MXFP8 online/offline quantization support for Wan2.2 T2V / I2V / TI2V inference on Ascend NPU by @hxhhhlalala in #3140
[skip ci][Tests] Splitting Qwen3-omni's performance test cases by @amy-why-3459 in #3501
[ROCm] Bugfix wan22 by @tjtanaa in #3463
[Bugfix] Add bot_task option of think_recaption for hunyuanimage3 it2i by @zengchuang-hw in #3551
[Feat][Config] Support additional_config for diffusion worker by @Fishermanykx in #3020
[Bugfix][HunyuanImage3.0] Fix KV reuse compatibility in SP scenarios by @Bounty-hunter in #3546
[Model] Add TP-aware MistralEncoder for FLUX.2-dev TP by @vraiti in #2465
[BugFix] Refresh TeaCache when num_inference_steps=None by @alex-jw-brooks in #2240
[Test] Add stability tests for HunyuanImage-3-Instruct by @zhumingjue138 in #3504
[Bugfix]: Fix online serving failure when using deploy config by @Fishermanykx in #3537
[Entrypoint][Refactor] Make field type hint more concrete by @wuhang2014 in #3139
[CI] Harden Qwen3-TTS perf nightly: enable Base voice_clone, add c=64/128, 2-GPU split by @linyueqian in #3491
[Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup by @TaffyOfficial in #3444
update v0.20.0 readme by @hsliuustc0106 in #3594
[Bugfix]Allow HunyuanImage3 AR sampler batching by @bjf-frz in #3590
[BugFix] fix shm connector by @Bounty-hunter in #3583
[CI] Add Qwen3-TTS tests for ready tag by @gcanlin in #3600
Update WeChat group QR code by @david6666666 in #3624
[BugFix] fix(omni): isolate diffusion KV-cache dtype from vLLM --kv-cache-dtype #3585 by @lyj-jjj in #3596
Update streaming_speech_client.py to solve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice voice problem by @keeper-jie in #3380
[CI] add cuda marker to Diffusion X2V function pytest by @yenuo26 in #3625
[Bugfix] UnspecifiedOmniPlatform.get_device_count returns 0 by @princepride in #3636
[2/5] [core]refactor communication layer: PR 2 of 5 Qwen3 Omni non async by @natureofnature in #2677
[Bugfix]Fix multimodal cache routing for AR replicas by @bjf-frz in #3605
[BugFix] Fix the issue of thinker requests being preempted, causing shape mismatch. by @amy-why-3459 in #3147
[Bugfix] fix compatibility of _hunyuan_image3_unpack_packed_topk between vllm / vllm ascend by @Fishermanykx in #3640
[bugfix] Fix diffusers backend input bug after #2913 by @fhfuih in #3644
[BugFix] fix ci by @amy-why-3459 in #3650
[CI] Replace c=128 perf cell with c=16; loosen new-cell baselines by @linyueqian in #3637
[Rebase] Rebase to vllm v0.21.0 by @tzhouam in #3530

New Contributors

@yuchenjiangyj made their first contribution in #3239
@Phi-C made their first contribution in #3436
@chzhang2021 made their first contribution in #3130
@Wallbreazzz made their first contribution in #3453
@baonudesifeizhai made their first contribution in #2913
@saadaltohamy made their first contribution in #3320
@weizhoublue made their first contribution in #3568
@hxhhhlalala made their first contribution in #3140
@zengchuang-hw made their first contribution in #3551
@keeper-jie made their first contribution in #3380

Full Changelog: v0.20.0...v0.21.0rc1