github vllm-project/vllm-omni v0.22.0rc1

pre-release6 hours ago

Highlights

This release candidate features 179 commits from 88 contributors, including 28 new contributors.

vLLM-Omni v0.22.0rc1 is a broad release candidate focused on aligning with the vLLM 0.22 release line, expanding speech and diffusion model coverage, and improving production serving for multistage omni workloads. It strengthens the runtime around stage orchestration, async audio and streaming paths, diffusion caching, quantization, and multi-backend deployment. This release candidate is intended to validate the vLLM 0.22 rebase and the new model/runtime coverage before the final cut.

Key Improvements

  • Aligned with the vLLM 0.22 release line, including the main rebase and release pipeline improvements for image builds and PyPI publishing. (#3891, #3428, #3667)
  • Expanded speech and omni model coverage, adding GLM-TTS, Higgs Audio v2, Qwen3-Omni Thinker LoRA support for RL training, voice-clone serving for OmniVoice, and new deployment recipes for Fish Speech S2 Pro and Qwen Image Edit. (#3141, #3762, #3915, #3668, #3323, #3684)
  • Broadened image and video generation support, with HiDream-I1-Full, Ming-flash-omni-2.0 image generation, improved Qwen/Hunyuan/GLM/BAGEL paths, and stronger support for HunyuanVideo 1.5. (#2572, #2875, #3933, #3728, #3979)
  • Improved diffusion acceleration and parallel execution, including Wan 2.2 pipeline parallelism, HunyuanImage3 VAE parallelism, step-wise LoRA, CacheDiT coverage, prompt-embedding cache, and MagCache. (#2322, #3091, #3639, #3470, #3265, #3906, #2962, #1287)
  • Made TTS serving more production-ready, with Qwen3-TTS high-concurrency optimization, precomputed custom voices, ref-context caching, Code2Wav batching and compatibility fixes, Fish Speech S2 Pro serving improvements, and OmniVoice CUDA Graph/Triton acceleration. (#3662, #3492, #3322, #3880, #3932, #3773, #3336)
  • Expanded quantization and hardware coverage, including W4A16, online FP8/INT8, MXFP4, MXFP8, ModelOpt mixed FP8/NVFP4, Blackwell diffusion attention backends, ROCm AITER support, Intel XPU coverage, and Ascend NPU improvements. (#3353, #3059, #3700, #3902, #3578, #3570, #3782, #3079, #3015, #3419, #3511, #2325)

Core Architecture & Runtime

  • Integrated OmniCoordinator into the stage engine pipeline and improved async audio/chunk request handling, including correct completion behavior without pad-token injection, audio streaming input for async chunks, and request-id aliasing fixes. (#3569, #3614, #3613, #3953)
  • Hardened diffusion and multistage lifecycle behavior with worker dead detection, cleanup fixes, subprocess exit handling, SIGINT cleanup for NCCL/ZMQ resources, and safer master-port selection for parallel launches. (#3214, #3494, #3751, #3872, #3803)
  • Improved scheduling and cache correctness across Qwen3-Omni, prefix caching, token history, offline/online alignment, and distributed stage-0 multimodal cache routing. (#3681, #3665, #3506, #3740, #3885)
  • Refined configuration and stage startup behavior, including recursive engine-arg merging, deploy-config field allowlisting, and migration of Ming-flash-omni image-generation deploy configs. (#3009, #3483, #3975)

Model Support

  • Added GLM-TTS and Higgs Audio v2 support with offline/online serving examples, deploy configs, tests, and client/demo coverage. (#3141, #3762)
  • Added HiDream-I1-Full and Ming-flash-omni-2.0 image-generation support, plus recipes and deploy guidance for Qwen Image Edit, GLM-Image, Helios, Fish Speech S2 Pro, and Voxtral TTS. (#2572, #2875, #3684, #2950, #3114, #3323, #3498)
  • Added Qwen3-Omni Thinker LoRA support for RL training and improved long-output correctness, streaming helpers, and torch.compile accuracy behavior. (#3915, #3539, #3885)
  • Improved existing model paths across Qwen-Image, Qwen-Image-Edit, BAGEL, HunyuanImage3, Ovis image, SenseNova U1, LTX-2.3, MiMo-Audio, and Ming-flash-omni. (#3608, #3219, #3933, #3728, #3857, #3876, #3691, #3854, #3686, #3975)

Audio, Speech & Omni Production Optimization

  • Optimized Qwen3-TTS for high-concurrency serving with precomputed custom voices, ref-context cache, cross-request Code2Wav batching, persistent prompt-embedding helpers, reduced CUDA Graph buckets, and compatibility fixes for newer transformers versions. (#3662, #3492, #3322, #3992, #3932, #3880)
  • Improved speech serving correctness and streaming behavior, including speech-endpoint finish reasons, async chunk continuity metrics, uploaded-voice handling, short Code2Wav chunk handling, and prompt-length estimation for Qwen3-TTS reference codes. (#2849, #3618, #3523, #3687, #3940)
  • Improved Fish Speech S2 Pro and OmniVoice production paths with high-concurrency serving, Triton kernel fusion, CUDA Graph acceleration, voice clone support, reproducible seed support, and removal of hardcoded default voice assumptions in examples. (#3773, #3336, #3668, #3829)
  • Stabilized MiMo-Audio and shared TTS components, including voice instability fixes, batching follow-ups, common activation refactors, bf16/fp16 Triton fixes, and reusable talker/model runner paths across GPU and NPU. (#3686, #3817, #3886, #3472, #3476)

Diffusion, Image & Video Generation

  • Added and expanded diffusion parallel execution with Wan 2.2 pipeline parallelism, HunyuanImage3 VAE parallelism, LTX-2.3 CFG parallel support, and HunyuanVideo 1.5 USP plus VAE patch parallel support. (#2322, #3091, #3905, #3979)
  • Expanded diffusion acceleration with CacheDiT for Helios, DreamID-Omni, SenseNova U1, and LTX-2, prompt-embedding caching, MagCache, and step-wise LoRA support. (#3470, #3265, #3906, #3621, #2962, #1287, #3639)
  • Improved image/video generation correctness and performance across HunyuanImage3, HunyuanVideo, Qwen-Image, Qwen-Image-Edit, BAGEL, Flux2 Klein, GLM-Image, LTX-2.3, SenseNova U1, and Ovis. (#3630, #3694, #3768, #3857, #3844, #3219, #3933, #3680, #3711, #3717, #3059, #3854, #3691, #3876)
  • Improved diffusion serving and benchmark behavior by routing image edit workloads to the edits endpoint, renaming the diffusion benchmark backend to endpoint, adding output comparison tools, and strengthening diffusion performance optimization quality gates. (#3693, #3137, #3175, #3851)

Quantization & Memory Efficiency

  • Added broader diffusion quantization support, including Wan2.2 W4A16, GLM-Image W4A16, LTX-2 online FP8/INT8, DreamID-Omni online FP8/INT8, NPU MXFP4 online/offline quantization, XPU MXFP8, and ModelOpt mixed FP8/NVFP4 for image generation. (#3353, #3059, #3700, #3902, #3578, #3782, #3570)
  • Added quantization quality and trajectory comparison tooling for diffusion outputs, improved quantization benchmark handling for omni outputs, and expanded quality-gate coverage for FP8 Z-Image and related diffusion tests. (#3175, #3653, #3929)
  • Improved memory behavior through Qwen-Image text encoder cleanup, prompt-embedding cache support, custom pipeline sleep memory release fixes, and CUDA graph pool reuse in VoxCPM2 and Ming-flash-omni paths. (#3608, #2962, #3818, #3361)

RL, Serving & Integrations

  • Added Qwen3-Omni Thinker LoRA support for RL training and improved custom pipeline argument handling, sleep/wakeup behavior, and multistage serving tests. (#3915, #2973, #3818, #3610)
  • Improved OpenAI-compatible serving behavior for image edits, speech generation, realtime and chat paths, server-control reliability, invalid parameter handling, and frontend audio engine error handling. (#3693, #2849, #3316, #3652, #3680)
  • Added Yuanrong TransferEngine connector support for NPU and improved connector/runtime infrastructure for chunk transfer, memory pools, local-rank handling, and distributed KV flow. (#3180, #3569, #3740)

Platforms, Distributed Execution & Hardware Coverage

  • Expanded Blackwell diffusion support with CUDNN attention, FlashInfer attention auto-routing, and SageAttention3 backend support for GB200/B200/RTX 5090/PRO 6000/DGX Spark class systems. (#3079, #3015)
  • Improved ROCm coverage with AITER GroupNorm and AITER backend support for ring attention, plus ROCm CI/version updates. (#3419, #3511, #3659)
  • Improved Intel XPU coverage with CosyVoice3 support, MXFP8 support through the vLLM main-repo method, diffusion attention defaults, Docker/CI updates, and XPU-specific test fixes. (#2325, #3782, #3525, #3675, #3718, #3761, #3994)
  • Improved Ascend NPU coverage with Wan2.2 MXFP4 quantization, HunyuanImage3 FA-FP8, GLM-Image stage configs and HCCL runtime environment fixes, Yuanrong connector support, and sampler/runtime fixes. (#3578, #3540, #3235, #3180, #3517)

CI, Benchmarks & Documentation

  • Unified the release pipeline around a NIGHTLY=1 option, added x86_64/aarch64 image builds, enabled twine upload to PyPI, and refreshed Docker bases for the current release line. (#3428, #3667, #3859)
  • Added or improved reliability, invalid-parameter, nightly parity, accuracy, and performance coverage for Qwen-Image, Qwen-Image-Edit, HunyuanImage3, HunyuanVideo 1.5, BAGEL, VoxCPM2, Qwen3-Omni, Wan2.2, and multistage deployment. (#3502, #3652, #3670, #3795, #3852, #3849, #2175, #3864, #3729, #3610)
  • Improved benchmarking infrastructure with audio-streaming continuity metrics, diffusion benchmark endpoint routing, optional baseline assertions, perf JSON updates, and repo-wide benchmark documentation. (#3618, #3693, #3695, #1939)
  • Refreshed docs and recipes for quantization, diffusion performance, CosyVoice3 online serving, GLM-Image, Helios, Qwen Image Edit, VACE, and CUDA image commands. (#3764, #3851, #3748, #2950, #3114, #3684, #3584, #3836)

Note

  • The release includes compatibility work for newer dependency versions, including LTX-2 connector handling with diffusers==0.38.0 and Qwen3-TTS Code2Wav compatibility with transformers>=5.9.0. (#3661, #3880)

What's Changed

  • [BugFix] Finish async_chunk requests without pad-token injection by @NickCao in #3613
  • [Hunyuanimage 3.0] hunyuan accuracy test by @Bounty-hunter in #3655
  • [CI][Accuracy] Add Qwen-Image-2512 Qwen-Image-Edit-2511 pixel accuracy tests by @david6666666 in #3502
  • [Bugfix] Support diffusion worker dead detect when use inline engine by @wuhang2014 in #3214
  • [Bugfix]update process name for dit stage by @zengchuang-hw in #3602
  • [Feat] Add helios support cache dit by @lengrongfu in #3470
  • [ROCm] [CI] [Bugfix] Upgrade vllm version to v0.21.0 and ROCm 7.2.2 by @tjtanaa in #3659
  • [Refactor] Migrate and clean up TTS configs: CosyVoice3, OmniVoice, VoxCPM by @yuanheng-zhao in #3338
  • [Config Refactor] Support Recursive Merging for Engine Args by @alex-jw-brooks in #3009
  • [CI/Build] Unify release pipeline with NIGHTLY=1 option, add x86_64/aarch64 image builds by @khluu in #3428
  • [CI/Build] Enable twine upload to PyPI by @khluu in #3667
  • [Bugfix] Adapt LTX-2 connector arg with diffusers 0.38.0 by @yuanheng-zhao in #3661
  • [Frontend]Handle audio generate engine errors consistently by @reidliu41 in #3316
  • [BugFix][HunyuanImage3] Set MRoPE dynamic_arg_dims so graph mode can compile by @TaffyOfficial in #3630
  • Fix output finish reason issue for audio chunk in stream mode by @QiuMike in #2849
  • Fix reasoning_parser crash: reconstruct StructuredOutputsConfig from dict by @QiuMike in #2845
  • [Doc] Simplify template example subtitle by @hsliuustc0106 in #3669
  • [Doc] Reorganize available recipes into a table by @hsliuustc0106 in #3671
  • [SKILL]Add diffusion perf skill by @bjf-frz in #3461
  • [TTS][Perf] Optimize Qwen3-TTS high-concurrency serving by @Sy0307 in #3662
  • Fix diffusion engine cleanup lifecycle by @wuhang2014 in #3494
  • [XPU] update dockerfile and CI to 0.21.0 by @xuechendi in #3675
  • [Bugfix][TTS] Drop meaningless TTFT from speech-endpoint benchmarks by @linyueqian in #3674
  • [Bugfix] fix diffusion quantization benchmarking for Omni outputs by @RuixiangMa in #3653
  • [Bugfix] Fix SenseNova U1 broken import after SupportsModuleOffload by @nussejzz in #3691
  • [BugFix][CI]Fixing occasional CI failures by @amy-why-3459 in #3623
  • [HY-Imgae3.0] support hunyuan image3 dit fa-fp8 on npu by @lyj-jjj in #3540
  • [Bugfix][Qwen3-Omni] Handle short Code2Wav chunk outputs by @Sy0307 in #3687
  • [XPU] set flash_attn as default diffusion attn backend and fix k_len for cross_attn by @xuechendi in #3525
  • [Feature] Add support for Pipeline Parallel and integrate it into Wan 2.2 by @hadipash in #2322
  • Disable sampler kernel for XPU test by @pi314ever in #3718
  • [Bugfix] Fix hunyuanimage3 dit quant storageshape mismatch error by @fan2956 in #3694
  • [Refactor]Rename diffusion benchmark backend to endpoint by @bjf-frz in #3137
  • [Bugfix] Reject empty prompts in Flux2 Klein diffusion pipeline by @MmMaiIIi in #3711
  • Reject non-positive Flux2 Klein inference steps by @MmMaiIIi in #3717
  • [large-scale-serving] Integrate OmniCoordinator into stage engine pipeline by @chickeyton in #3569
  • [CI] invalid_param reliability suite and weekly http_invalid jobs by @yenuo26 in #3652
  • [CI] improve Buildkite testcase statistics reports by @yenuo26 in #3543
  • [Qwen-Image] Drop unused vision tower from text encoder by @lulugoodcoder in #3608
  • [Cleanup] Remove unused build_base_engine_args after #1115 by @bitborne in #3720
  • [Recipe] Qwen/Qwen-Image-Edit by @yixiaoer in #3684
  • [BugFix] fix mult cli timeout with get kv by @Bounty-hunter in #3741
  • [Quantization][tools] Add diffusion quantization output comparison tool by @david6666666 in #3175
  • [CI] optional --assert-baseline and update perf JSON baselines by @yenuo26 in #3695
  • [Feat] Enable VAE parallel in HunyuanImage3 by @Fishermanykx in #3091
  • [Bugfix][TTS] Only populate voice_name for uploaded voices without inline ref_audio by @NickCao in #3523
  • [XPU][CI] fix test_qwen2_5_omni_expansion.py::test_mix_to_audio by @xuechendi in #3761
  • [Perf][VoxCPM2][Ming-Flash-Omni] Use global CUDA graph pool by @NickCao in #3361
  • [Bench] Add audio-streaming continuity metric for TTS by @linyueqian in #3618
  • [Bugfix] Treat kv_cache_dtype=auto as unset for ring attention by @RuixiangMa in #3622
  • [NPU][Quant] Add W4A4 MXFP4 online & MXFP4 dual-scale online/offline quantization support for Wan2.2 T2V / I2V inference on Ascend NPU by @hxhhhlalala in #3578
  • Yuanrong TransferEngine Connector for NPU by @yangsonglin13 in #3180
  • [Doc][TTS] CosyVoice3 online docs + residual TTS yaml cleanup + remove VoxCPM v1 by @linyueqian in #3748
  • [Test] add run_nightly_jobs.sh for local nightly pytest parity by @yenuo26 in #3670
  • [Bugfix]Fix distributed stage0 multimodal cache routing by @bjf-frz in #3740
  • [Perf] Optimize sampler D2H sync for HY-Image by @gcanlin in #3617
  • [Docs] Complete quantization nav and online guide by @david6666666 in #3764
  • [Diffusion] Support LoRA in step-wise execution by @SamitHuang in #3639
  • [Bugfix] Fix qwen2_5_omni weight loading by @ksiyuan in #3598
  • [Benchmark] Route i2i/ti2i to POST /v1/images/edits in diffusion_benchmark_serving by @NumberWan in #3693
  • [AutoRound] Support WAN2.2 W4A16 quantization model by @lvliang-intel in #3353
  • [Feat] Support online quantization (fp8/int8) for LTX-2 by @yuanheng-zhao in #3700
  • Add new committers to governance page by @hsliuustc0106 in #3749
  • [Bugfix] Fix MiMo-Audio voice instability: stochastic local_sampler + codec streaming context by @Galleons2029 in #3686
  • Update WeChat group QR code by @david6666666 in #3806
  • [Bugfix] Fix Hunyuan worker device context by @fake0fan in #3768
  • (Phase 2)Add ModelOpt mixed FP8/NVFP4 support for image generation by @baonudesifeizhai in #3570
  • Fix OmniDiffusionConfig master_port selection for parallel launches by @SamitHuang in #3803
  • [Bugfix] Remove stale OmniStage import and type annotation by @qidaye in #3541
  • [BugFix] Fix prefer_model_sampler token history in async scheduling by @zengchuang-hw in #3681
  • [feature]: support Hidream-I1-Full model by @ANHDY in #2572
  • [Bugfix] Align Offline and Online Inference by @skf-1999 in #3506
  • [CI] Fix email bug & skip email distribution. by @congw729 in #3814
  • [Bugfix] Revert MiMo-Audio local_sampler to greedy to fix text truncation under concurrent batching (followup to #3686) by @Galleons2029 in #3817
  • [Bugfix] Set separate CFG flag in Helios for CacheDiT by @alex-jw-brooks in #3756
  • [Recipe] Add Fish Speech S2 Pro 2-GPU deploy profile by @linyueqian in #3323
  • [Perf] [OmniVoice] Triton kernel fusion + CUDA Graph acceleration by @univa-HARRY in #3336
  • [Bugfix][CI] Run Whisper validation on CPU for single-GPU runners by @linyueqian in #3822
  • [Feat] support cache-dit for DreamID-Omni by @fywc in #3265
  • [BugFix] code2wav supports disabling CUDA graph. by @amy-why-3459 in #3732
  • [Model] Add GLM-TTS text-to-speech model support by @BeatSeat in #3141
  • [Bugfix] Fix LTX2 CacheDiT Integration by @alex-jw-brooks in #3621
  • docs: fix CUDA pre-built image command by @akshatvishu in #3836
  • [BugFix][NPU] Honor prefer_model_sampler in NPU AR runner by @gcanlin in #3517
  • [Bugfix][Example][OmniVoice] Drop hardcoded "voice": "default" from speech_client.py by @nagisa-kunhah in #3829
  • Add hunyuan online accuracy test by @BLANKETusers in #3795
  • [CI] Increase timeout for Quantization Test in nightly build to 60 minutes by @zhumingjue138 in #3845
  • [Bugfix] Fix Qwen3-TTS Stage 0 prefix-caching correctness by @linyueqian in #3665
  • [Bugfix] fix when diffusion model not set sleeping_stages by @lengrongfu in #3023
  • [Higgs-Audio] bosonai/higgs-audio-v2-generation-3B-base TTS model support by @yuekaizhang in #3762
  • [UX] Rename default config to hunyuan_image_3_moe by @gcanlin in #3835
  • [Test] Qwen-Image Perf Test with High Concurrency by @wtomin in #2822
  • [BugFix]: CUDA device-side assert failures on single-stage BAGEL i2i requests by @NumberWan in #3680
  • [CI] Add nightly-ci for multi-stage deployment by @ZhengWG in #3610
  • [CI][Bugfix]Fix Wan2.2 I2V reference image upload by @bjf-frz in #3869
  • [HunyuanImage][End2End Performance CI] Add hunyuan end2end test by @Bounty-hunter in #3849
  • [BugFix] Fix LTX-2.3 audio latent padding for sequence parallelism by @mglyn in #3854
  • Update CUDA Docker base image to vLLM v0.21.0 by @hsliuustc0106 in #3859
  • [Docs] Strengthen diffusion perf optimization quality gate by @david6666666 in #3851
  • [bugfix] fix default deploy config in hunyuan_image offline example by @zengchuang-hw in #3879
  • [BugFix] Fix Qwen3-TTS Code2Wav compatibility with transformers >= 5.9.0 by @Dan250124 in #3880
  • glm-image: fix(npu)per-stage runtime env for HCCL ports + GLM-Image NPU stage config by @lyj-jjj in #3235
  • [Feat][HunyuanImage3] Stream AR text for IT2I image edits by @TaffyOfficial in #3723
  • [Doc][Benchmark] Rewrite benchmarks/README.md as repo-wide index by @Dnoob in #1939
  • [Bugfix] Fix Qwen-Image-Edit-2511 TeaCache zero_cond_t handling by @JasonJ2021 in #3219
  • [Perf] Trim HunyuanVideo encoder padding tokens by @david6666666 in #3844
  • [Feat] opt qwen image model load use ColumnParallelLinear replace ReplicatedLinear by @lengrongfu in #3875
  • [Bugfix]Fix Hunyuan Image3 denoise flow alignment by @bjf-frz in #3857
  • [ROCm] Add support for AITER GroupNorm by @avjves in #3419
  • [Feature] support SP for FLUX.2-dev by @nuclearwu in #3244
  • [Model] Add Ming-flash-omni-2.0 Image Generation (Diffusion) Stage by @ZhengWG in #2875
  • [BugFix] Fix diffusion parallel_config YAML override and add deploy config field allowlist by @xiaohajiayou in #3483
  • [TTS][Perf] Optimize Fish Speech S2 Pro high-concurrency serving by @Sy0307 in #3773
  • Fix Ovis image text encoder dtype by @akshatvishu in #3876
  • [Bugfix] Ensure stage and diffusion subprocesses exit when parent dies unexpectedly by @RuixiangMa in #3751
  • [Test] Add long text output correctness test for Qwen3-Omni by @ZeldaHuang in #3539
  • fix image edit docs about use error image url by @lengrongfu in #3873
  • [Perf] Bagel Performance Nightly CI test by @NumberWan in #2175
  • [Feat] Support online quantization (fp8/int8) for DreamID-Omni by @yuanheng-zhao in #3902
  • [MXFP8][XPU] enable mxfp8 using vLLM main repo method by @xuechendi in #3782
  • [Blackwell] Add CUDNN_ATTN and FLASHINFER_ATTN backends for diffusion (auto-route) by @lishunyang12 in #3079
  • [CI] Add HunyuanVideo 1.5 X2V accuracy tests by @david6666666 in #3852
  • [Feature] Add cfg-parallel for LTX-2.3 by @mglyn in #3905
  • [Refactor] Unify Snake/SnakeBeta and alias-free activation into common modules by @BeatSeat in #3886
  • [Perf][Bugfix] cache hot buffers in qwen3_tts talker; fall back on evicted state by @JuanPZuluaga in #3688
  • [3/5][core]refactor communication layer: PR 3 of 5, all other models in non async mode by @natureofnature in #3719
  • [Doc] Refine vace offline inference example README by @blondeCS in #3584
  • [Diffusion] Unify diffusion request identity on request_id by @yJader in #3744
  • [Bugfix] Remove duplicate ffmpeg options in random video generation by @JLiu4Coding in #3923
  • [AutoRound] Support GLM-Image W4A16 quantization model by @lvliang-intel in #3059
  • [Doc] Reduce browser memory usage for docs by @david6666666 in #3870
  • [Refactor][Qwen3-TTS] Construct speech tokenizer encoder natively by @NickCao in #3360
  • [CI][Bugfix] Add request id to LTX2.3 CFG parallel test by @mglyn in #3934
  • [Perf] Trim Code2Wav CUDA Graph buckets for Qwen3-TTS single-GPU deploy by @R2-Y in #3932
  • [CI] Rectify L2~L4 Qwen Image Edit series tests by @fhfuih in #3901
  • [Docs]Add recipe for GLM-Image on 2x A800 GPUs and 1x A800 GPU by @nainiu258 in #2950
  • [CI][BugFix] Fix and Validate FP8 Z-Image quality gate by @david6666666 in #3929
  • [Test] Add scenarios for L5 reliability test by @zhumingjue138 in #3729
  • [Blackwell][1/N] Add SageAttention3 diffusion backend on blackwell(GB200/B200/RTX5090/PRO6000/DGX Spark available) by @david6666666 in #3015
  • [bugfix, rl] Fix sleep do not release full memory in custom pipeline by @knlnguyen1802 in #3818
  • Fix Qwen3-omni accuracy degradation from deepstack inputs under torch.compile by @andakai in #3885
  • [Bugfix] Fix Triton SnakeBeta kernel for bf16/fp16 inputs by @wuli666 in #3472
  • [XPU] Add CosyVoice3Model support on Intel XPU by @Liangyx2 in #2325
  • [Docs] Add recipe for Helios by @JasonJ2021 in #3114
  • [ci][nightly] Voxcpm2 performance benchmark by @Shirley125 in #3864
  • [BugFix] Fix prefix-caching issue by @amy-why-3459 in #3726
  • [bugfix] Solve Nightly / CI failed - tests/e2e/online_serving/test_bagel_expansion.py #3918 by @natureofnature in #3936
  • [BugFix] Avoid Voxtral TTS loading error msg by @y123456y78 in #3951
  • [Bugfix/Feature] Remove Hardcoded Flash Attention in Bagel & Support GQA in SDPA Backend by @alex-jw-brooks in #3728
  • [feat] Support prompt embedding caching for diffusion model by @knlnguyen1802 in #2962
  • [Feat]Support voice clone for omnivoice in online serving & add seed parameter for reproducible by @sphinxkkkbc in #3668
  • [CI][Bugfix] Fix LTX audio-video warmup output typing by @david6666666 in #3964
  • [Bugfix] Fix IndexError in DistributedVaeExecutor when vae_patch_parallel_size < world_size by @QingZhou-YangHY in #3928
  • Temp skip TEST - Entrypoint Test with H100 by @congw729 in #3989
  • [Perf] Qwen3-Omni performance optimization by @amy-why-3459 in #3878
  • [ROCm] Enable AITER backend with ring attention by @avjves in #3511
  • [Feat] Support MagCache by @RuixiangMa in #1287
  • [Perf][Qwen3-TTS] Restore Code2Wav cross-request batching (RFC #3163 P0) by @ischencheng in #3322
  • [Bugfix][Model] Qwen3-TTS: don't collapse 2D ref_code list when estimating prompt length by @nperraud in #3940
  • [minor, fix] Allow passing class interface as custom pipeline argument by @knlnguyen1802 in #2973
  • [Feat] support cache-dit for SenseNova-U1 by @fywc in #3906
  • [CI][XPU]Fix sage_attn hard-code import for cuda by @xuechendi in #3994
  • [Diffusion] Support USP and VAE patch parallel for HunyuanVideo 1.5 by @david6666666 in #3979
  • [HunyuanImage][Perf] adapt to deploy config changes by @Bounty-hunter in #3996
  • [Refactor][Qwen3-TTS] Extract reusable prompt-embeds builder and make tts_pad_embed a persistent buffer by @vklimkov-nvidia in #3992
  • docs: update WeChat QR code by @david6666666 in #4003
  • [Config Refactor] Migrate Ming-flash-omni-2.0 Image-Gen deploy configs by @yuanheng-zhao in #3975
  • [Bugfix][Tests] Remove unnecessary device map in tests init by @wuhang2014 in #3958
  • [CI/Bugfix] Async Request ID Aliasing by @alex-jw-brooks in #3953
  • [CI] Temporarily skip failing Bagel connector tests by @david6666666 in #4005
  • [Bugfix] Fix DiffusionWorker crash on SIGINT: ensure NCCL/ZMQ cleanup on shutdown by @wuhang2014 in #3872
  • [Recipe] add mistralai voxtral tts recipe by @Dmaner in #3498
  • Fix hunyuan resolve stop token ids by @BLANKETusers in #3896
  • [Refactor] Unify _talker_mtp_forward across GPU and NPU model runners by @gcanlin in #3476
  • [BugFix]Qwen-Image performance regression by using omni RMSNorm(RMSNorm backend) by @NumberWan in #3933
  • [Feat]audio streaming input for async chunk by @Shirley125 in #3614
  • [model, omni] feat: Qwen3-Omni Thinker LoRA for RL training by @qinganrice in #3915
  • [Feature] Add precomputed custom voices and Qwen3-TTS ref-context cache by @Sy0307 in #3492
  • [Rebase] Rebase to vllm releases/v0.22.0 by @tzhouam in #3891

New Contributors

  • @MmMaiIIi made their first contribution in #3711
  • @lulugoodcoder made their first contribution in #3608
  • @bitborne made their first contribution in #3720
  • @yixiaoer made their first contribution in #3684
  • @ksiyuan made their first contribution in #3598
  • @Galleons2029 made their first contribution in #3686
  • @qidaye made their first contribution in #3541
  • @ANHDY made their first contribution in #2572
  • @univa-HARRY made their first contribution in #3336
  • @BeatSeat made their first contribution in #3141
  • @nagisa-kunhah made their first contribution in #3829
  • @BLANKETusers made their first contribution in #3795
  • @yuekaizhang made their first contribution in #3762
  • @mglyn made their first contribution in #3854
  • @avjves made their first contribution in #3419
  • @blondeCS made their first contribution in #3584
  • @JLiu4Coding made their first contribution in #3923
  • @nainiu258 made their first contribution in #2950
  • @andakai made their first contribution in #3885
  • @wuli666 made their first contribution in #3472
  • @Liangyx2 made their first contribution in #2325
  • @QingZhou-YangHY made their first contribution in #3928
  • @ischencheng made their first contribution in #3322
  • @nperraud made their first contribution in #3940
  • @vklimkov-nvidia made their first contribution in #3992
  • @Dmaner made their first contribution in #3498
  • @qinganrice made their first contribution in #3915

Full Changelog: v0.21.0rc1...v0.22.0rc1

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.