Highlights
This release features 94 commits from 72 contributors, including 10 new contributors.
vLLM-Omni v0.21.0rc1 is a release candidate aligned with upstream vLLM v0.21.0. It focuses on validating the next production cut by expanding diffusion, image/video, speech, and omni-model coverage; improving distributed execution and hardware backend readiness; and tightening serving stability across OpenAI-compatible APIs, deploy configs, and long-running workloads.
This release candidate is intended to validate the v0.21.0 integration, HunyuanImage-3.0 feature set, Qwen3-TTS stability, diffusion quantization, and NPU/ROCm platform coverage before the final release.
Key Improvements
- Aligned vLLM-Omni with upstream vLLM v0.21.0, refreshing the base runtime for the v0.21 release line. (#3530)
- Expanded image, video, and diffusion generation capabilities, including HunyuanImage-3.0 AR + DiT KV reuse, online IT2I image editing, multi-image input, FLUX/Qwen image pipelines, DMD2 image generation, and FLUX.2-dev TP support. (#3346, #3410, #3444, #2974, #2465)
- Improved diffusion parallelism and backend configurability, with HunyuanVideo 1.5 USP sequence parallelism, Bagel HSDP support, per-role attention backend selection, and additional diffusion worker configuration. (#2444, #3150, #2681, #3020)
- Strengthened speech and omni serving, including Qwen3-TTS recipes and tests, Voxtral TTS FP8 quantization, Covo-Audio-Chat support, MiMo-Audio tokenizer decoding improvements, and Qwen2.5/Qwen3-Omni talker-stage cleanup. (#3130, #3036, #2293, #2183, #3296, #3425)
- Expanded quantization and hardware coverage, including ModelOpt FP8 auto-detection for diffusion checkpoints, NPU FP8/FA support, Wan2.2 W8A8 MXFP8 on Ascend NPU, ROCm fixes, and new NPU nightly coverage. (#2913, #2640, #3140, #3463, #3480)
- Improved production reliability, fixing seed handling through the OpenAI Python client, HunyuanImage-3.0 online/offline prompt and AR behavior mismatches, diffusion worker Ray shutdown failures, streaming audio splicing, deploy-config serving failures, and AR replica cache routing. (#3436, #3500, #3516, #3533, #3438, #3537, #3605)
Core Architecture & Runtime
- Reworked runtime abstractions for broader backend compatibility, including replacing selected
torch.cudausage withtorch.accelerator, renaming module-offload discovery interfaces, standardizing msgspec data-entry keys, and continuing the Qwen3-Omni communication-layer refactor. (#3365, #3354, #3149, #2677) - Improved diffusion engine extensibility for out-of-tree hardware backends and added more concrete entrypoint typing for serving integration work. (#3239, #3139)
- Fixed runtime correctness issues around HSDP +
torch.compileRMSNorm, OmniGen2 offload/dtype mismatch, Helios optimized-scale casting, and platform device-count detection. (#3460, #2560, #3529, #3636)
Model Support
- Added or expanded support for Sensenova U1, Tencent Covo-Audio-Chat, Qwen3-TTS recipes, HunyuanImage-3.0 deploy configs, FLUX.2-dev TP-aware MistralEncoder, and additional FLUX/Qwen image pipelines. (#3319, #2293, #3130, #3172, #2465, #2974)
- Improved model-family coverage for Bagel, VoxCPM2, Wan2.2 I2V, HunyuanVideo 1.5, HunyuanImage-3.0, Qwen-Image, Qwen2.5-Omni, and Qwen3-Omni through targeted fixes and recipe updates. (#3150, #3424, #3271, #2444, #3346, #3450, #3425, #3296)
Audio, Speech & Omni Production Optimization
- Added Qwen3-TTS model recipes and hardened Qwen3-TTS performance/nightly coverage, including Base voice-clone testing, additional concurrency cells, and ready-tag tests. (#3130, #3491, #3600, #3637)
- Improved Qwen3-TTS latency and streaming behavior, including a latency-regression fix, custom-voice streaming client update, and streaming audio output splicing fix. (#3485, #3380, #3438)
- Improved MiMo-Audio tokenizer decoding performance and added Voxtral TTS FP8 quantization support. (#2183, #3036)
- Cleaned dead audio/visual components from Qwen2.5-Omni and Qwen3-Omni talker stages, and fixed omni processing tests for non-multimodal talker stages. (#3425, #3296, #3559)
Diffusion, Image & Video Generation
- Added major HunyuanImage-3.0 capabilities, including AR + DiT KV reuse, online IT2I image editing, multi-image input, deploy configs, AR sampler batching, and stability tests for HunyuanImage-3-Instruct. (#3346, #3410, #3444, #3172, #3590, #3504)
- Improved HunyuanImage-3.0 correctness across online/offline paths by aligning AR and DiT prompt formatting, fixing AR encode differences, adding
think_recaptionbot-task support, and fixing KV reuse compatibility under sequence parallelism. (#3516, #3500, #3551, #3546) - Expanded diffusion and video execution with USP support for HunyuanVideo 1.5, Bagel HSDP support, DMD2 image generation, FLUX/Qwen image pipelines, and Wan2.2 I2V recipe updates. (#2444, #3150, #2974, #3271)
- Improved diffusion serving robustness with fixes for SD3 dtype crashes, OmniGen2 offload/dtype mismatch, TeaCache refresh behavior, diffusion worker Ray SIGKILL, diffusers backend input handling, shared-memory connector issues, and diffusion KV-cache dtype isolation. (#2526, #2560, #2240, #3533, #3644, #3583, #3596)
Quantization & Memory Efficiency
- Added ModelOpt FP8 auto-detection for diffusion checkpoints and bumped the minimum
diffusersdependency to>=0.38.0. (#2913, #3349) - Added Voxtral TTS FP8 quantization and expanded NPU quantization with online FP8 for FA plus W8A8 MXFP8 online/offline quantization for Wan2.2 T2V/I2V/TI2V on Ascend NPU. (#3036, #2640, #3140)
- Fixed diffusion KV-cache dtype isolation so diffusion cache behavior is not incorrectly coupled to vLLM
--kv-cache-dtype. (#3596)
RL, Serving & Integrations
- Fixed OpenAI Python client seed handling and ensured
extra_paramsare merged correctly into diffusion speech sampling parameters. (#3436, #3320) - Improved deploy-config based online serving and added
additional_configsupport for diffusion workers. (#3537, #3020) - Fixed multimodal cache routing for AR replicas, thinker preemption shape mismatches, and async diffusion race conditions. (#3605, #3147, #3379)
- Improved VoxCPM2 first-request latency through startup warmup and fixed the default stage config path. (#3424, #3447)
Platforms, Distributed Execution & Hardware Coverage
- Extended diffusion engine plugin extensibility for out-of-tree hardware backends and improved NPU support with code-predictor device mismatch fixes, AR prefix-cache key flattening, and NPU nightly tests. (#3239, #3453, #3568, #3480)
- Added NPU quantization coverage for Wan2.2 and FA, and improved ROCm parity with CUDA CI skip logic plus Wan2.2 ROCm fixes. (#3140, #2640, #3482, #3463)
- Improved distributed execution support with HunyuanVideo 1.5 USP, Bagel HSDP, FLUX.2-dev TP, and HSDP +
torch.compilecorrectness fixes. (#2444, #3150, #2465, #3460)
CI, Benchmarks & Documentation
- Unified L2/L3 test layout, Buildkite steps, and test helpers; refined nightly pytest execution; and improved e2e latency/startup logging. (#2556, #3459, #3246)
- Expanded nightly and stability coverage for HunyuanImage-3, Diffusion X2I/X2V, Qwen3-TTS, Qwen3-Omni, and NPU backends. (#3455, #3504, #3625, #3600, #3501, #3480)
- Refactored attention-backend documentation and skill content, updated release/readme materials, refreshed community QR code, and updated CODEOWNERS feature reviewers. (#3475, #3594, #3624, #3378)
- Cleaned obsolete or noisy CI paths, including multi-replica Bagel CI removal, duplicate H100 testing reduction, weekly-test merge-condition updates, and environment-variable cleanup that avoided unnecessary GPU detection. (#3407, #3459, #3197, #3446)
Note
attention_confighas been renamed todiffusion_attention_config. Users with custom configs, deploy configs, or scripts that reference the old key should update them accordingly. (#3489)- The minimum supported
diffusersversion is now>=0.38.0. (#3349) - This is a release candidate. Please validate model-specific serving paths, hardware backends, quantization settings, and deployment configs before using it as a production baseline.
What's Changed
- [chore] Update command to download dataset from huggingface-cli to hf by @Gaohan123 in #3403
- [Refactor] Replace and ban a few torch.cuda functions in favor of torch.accelerator replacements. by @NickCao in #3365
- [Clean] Remove multi-replica Bagel CI and related docs/configs by @fake0fan in #3407
- Update CODEOWNERS feature reviewers by @david6666666 in #3378
- [Test] Unify L2/L3 test layout, Buildkite steps, and test helpers by @yenuo26 in #2556
- [Hardware] Extend diffusion engine plugin extensibility for out-of-tree hardware backends by @yuchenjiangyj in #3239
- [Feat] support hsdp for Bagel by @lsyyysky in #3150
- [Bugfix] Fix the issue where the seed parameter does not take effect when using the OpenAI Python client by @Phi-C in #3436
- [Bugfix] Fix Dtype Crashes in SD3 by @alex-jw-brooks in #2526
- [Feature][Hunyuan image 3.0] AR + DIT with kv reuse. by @Bounty-hunter in #3346
- [Test][HunyuanImage3] Add e2e offline I2T smoke test by @TaffyOfficial in #3332
- [BugFix]Fix default stage config path in voxcpm2 by @sphinxkkkbc in #3447
- [Feat] Add Sequence Parallelism (USP) support for HunyuanVideo 1.5 transformer by @daixinning in #2444
- [Feature] online HunyuanImage-3.0 IT2I (image editing) support by @skf-1999 in #3410
- enhancement: extend to dmd2 to image generation + add flux, qwen image pipelines by @ayushag-nv in #2974
- [Refactor] Rename SupportsModuleOffload to SupportsComponentDiscovery by @NickCao in #3354
- Add Qwen3 TTS Model recipe by @chzhang2021 in #3130
- [Bugfix][StableAudio] Pass model_class_name to Omni() and declare audio class attrs by @linyueqian in #3405
- [Bugfix] Qwen-Image use teachche serve will crash by @lengrongfu in #3450
- [Perf] Optimize VoxCPM2 first-request latency via startup warmup by @Dan250124 in #3424
- [Bugfix] fix OmniGen2 offload and dtype mismatch by @RuixiangMa in #2560
- [Feature] Add FP8 quantization for Voxtral TTS by @akshatvishu in #3036
- Fix NPU code predictor device mismatch in concurrent mode by @Wallbreazzz in #3453
- [Test] Restore tts mark and omni_runner_function fixture for Voxtral TTS by @linyueqian in #3462
- [CI] Update merge condition to skip L3 merges during weekly test and update doc by @zhumingjue138 in #3197
- [CI] Refine nightly pytest command in Omni · Function Test with H100 to avoid duplicate testing. by @yenuo26 in #3459
- (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 by @baonudesifeizhai in #2913
- [CI][Nightly] Shard nightly Diffusion X2I H100 lanes and centralize shard definitions by @wuhang2014 in #3455
- [CI] Remove VLLM_TEST_CLEAN_GPU_MEMORY to avoid environment variable pollution that causes unnecessary GPU detection, thereby slowing down test case execution. by @yenuo26 in #3446
- [Diffusion][Attention] Support per-role attention backend via CLI by @gcanlin in #2681
- [Feature] hunyuanimage support flash attn by @Bounty-hunter in #2981
- [Perf] Fix Qwen3-TTS latency regression by @Sy0307 in #3485
- [ROCm] [CI] Add the same skip ci logic as CUDA CI by @tjtanaa in #3482
- [Docs] Refactor the attention backend docs/skill by @gcanlin in #3475
- [Performance] Improve MiMo-Audio tokenizer decoding performance by @qibaoyuan in #2183
- [BugFix] Rename attention_config to diffusion_attention_config by @gcanlin in #3489
- [Bug][Hunyuanimage 3.0] fix different AR encode behavior between online and offline by @Bounty-hunter in #3500
- [Misc] Clean logs for image gen task by @wuhang2014 in #3414
- [CI] skip failing diffusion and accuracy cases (#3432, #3256, #3257, #3488) by @yenuo26 in #3507
- [New Model]: Add sensenova u1 support by @princepride in #3319
- [Config] Add HunyuanImage3 deploy configs by @Fishermanykx in #3172
- [Fix] Fix RMSNorm inductor KeyError under HSDP + torch.compile by @LJH-LBJ in #3460
- [Perf] Remove dead audio_tower and visual from Qwen3-Omni talker stage by @NickCao in #3296
- [bugfix][ci] avoid Whisper transcript deduplication in realtime audio test by @Shirley125 in #3417
- [Chore] explicit .float() conversion in Helios's optimized_scale function by @RuixiangMa in #3529
- [CI][Bugfix] Improve e2e latency logging, update response classes to include detailed latency documentation and add startup time logging by @yenuo26 in #3246
- [Recipes]update Wan2.2-I2V gpu part by @bjf-frz in #3271
- [BugFix] Modify the splicing method of streaming audio output. by @amy-why-3459 in #3438
- [Bugfix] Align the AR and DiT prompt formatting across both online and offline modes. by @Bounty-hunter in #3516
- [FIX] Ensure
extra_paramsare correctly merged into sampling params in_create_diffusion_speech()by @saadaltohamy in #3320 - [Nightly CI] Remove TP case by @NumberWan in #3534
- [Refactor] msgspec standardisation for data entry key names and improved type checks by @divyanshsinghvi in #3149
- [New Model] Add support for tencent/Covo-Audio-Chat by @Dnoob in #2293
- [bugfix, rl] Fix race condition bug on async running for diffusion model by @knlnguyen1802 in #3379
- [CI] update daily omni min accuracy by @R2-Y in #3536
- [Perf] Remove dead audio_tower and visual from Qwen2.5-Omni talker stage by @NickCao in #3425
- [Bugfix] Fix the issue where the qwen3-omni model long-term stability test sometimes gets stuck without sending requests. by @zhumingjue138 in #3468
- [Bugfix] Fix omni processing test for non-multimodal talker stage by @NickCao in #3559
- Bump diffusers minimum version to >=0.38.0 by @oglok in #3349
- support online FP8 quantization for FA on NPU #2236 by @lyj-jjj in #2640
- [CI][Test] Add NPU nightly tests by @gcanlin in #3480
- [CI][Bugfix] skip fp8 Z-Image quality gate (#3531) and add torchdiffeq dev extra by @yenuo26 in #3563
- [Bugfix, rl] Diffusion worker SIGKILL under Ray actor (exitcode -9) by @knlnguyen1802 in #3533
- Fix: NPU AR model runner prefix cache key flattening by @weizhoublue in #3568
- [NPU][Quant] Add W8A8 MXFP8 online/offline quantization support for Wan2.2 T2V / I2V / TI2V inference on Ascend NPU by @hxhhhlalala in #3140
- [skip ci][Tests] Splitting Qwen3-omni's performance test cases by @amy-why-3459 in #3501
- [ROCm] Bugfix wan22 by @tjtanaa in #3463
- [Bugfix] Add bot_task option of think_recaption for hunyuanimage3 it2i by @zengchuang-hw in #3551
- [Feat][Config] Support additional_config for diffusion worker by @Fishermanykx in #3020
- [Bugfix][HunyuanImage3.0] Fix KV reuse compatibility in SP scenarios by @Bounty-hunter in #3546
- [Model] Add TP-aware MistralEncoder for FLUX.2-dev TP by @vraiti in #2465
- [BugFix] Refresh TeaCache when num_inference_steps=None by @alex-jw-brooks in #2240
- [Test] Add stability tests for HunyuanImage-3-Instruct by @zhumingjue138 in #3504
- [Bugfix]: Fix online serving failure when using deploy config by @Fishermanykx in #3537
- [Entrypoint][Refactor] Make field type hint more concrete by @wuhang2014 in #3139
- [CI] Harden Qwen3-TTS perf nightly: enable Base voice_clone, add c=64/128, 2-GPU split by @linyueqian in #3491
- [Feature] HunyuanImage-3.0 IT2I: multi-image input + prompt API cleanup by @TaffyOfficial in #3444
- update v0.20.0 readme by @hsliuustc0106 in #3594
- [Bugfix]Allow HunyuanImage3 AR sampler batching by @bjf-frz in #3590
- [BugFix] fix shm connector by @Bounty-hunter in #3583
- [CI] Add Qwen3-TTS tests for ready tag by @gcanlin in #3600
- Update WeChat group QR code by @david6666666 in #3624
- [BugFix] fix(omni): isolate diffusion KV-cache dtype from vLLM --kv-cache-dtype #3585 by @lyj-jjj in #3596
- Update streaming_speech_client.py to solve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice voice problem by @keeper-jie in #3380
- [CI] add cuda marker to Diffusion X2V function pytest by @yenuo26 in #3625
- [Bugfix] UnspecifiedOmniPlatform.get_device_count returns 0 by @princepride in #3636
- [2/5] [core]refactor communication layer: PR 2 of 5 Qwen3 Omni non async by @natureofnature in #2677
- [Bugfix]Fix multimodal cache routing for AR replicas by @bjf-frz in #3605
- [BugFix] Fix the issue of thinker requests being preempted, causing shape mismatch. by @amy-why-3459 in #3147
- [Bugfix] fix compatibility of _hunyuan_image3_unpack_packed_topk between vllm / vllm ascend by @Fishermanykx in #3640
- [bugfix] Fix diffusers backend input bug after #2913 by @fhfuih in #3644
- [BugFix] fix ci by @amy-why-3459 in #3650
- [CI] Replace c=128 perf cell with c=16; loosen new-cell baselines by @linyueqian in #3637
- [Rebase] Rebase to vllm v0.21.0 by @tzhouam in #3530
New Contributors
- @yuchenjiangyj made their first contribution in #3239
- @Phi-C made their first contribution in #3436
- @chzhang2021 made their first contribution in #3130
- @Wallbreazzz made their first contribution in #3453
- @baonudesifeizhai made their first contribution in #2913
- @saadaltohamy made their first contribution in #3320
- @weizhoublue made their first contribution in #3568
- @hxhhhlalala made their first contribution in #3140
- @zengchuang-hw made their first contribution in #3551
- @keeper-jie made their first contribution in #3380
Full Changelog: v0.20.0...v0.21.0rc1