vllm-project/vllm-omni v0.20.0rc1 on GitHub

Highlights

This release features 327 commits from 99 contributors, including 36 new contributors.

vLLM-Omni v0.20.0rc1 is a release candidate aligned with upstream vLLM v0.20.0. It expands the model catalog across speech, omni, image, and video workloads; improves production readiness for TTS, diffusion, BAGEL, and multi-stage serving; and broadens hardware coverage across CUDA, ROCm, MUSA, NPU, XPU, and AMD validation paths. This release candidate is intended to validate the upstream rebase, the refreshed serving/runtime behavior, and the expanded model and platform matrix before the final v0.20.0 release.

Key Improvements

Rebased to upstream vLLM v0.20.0, with entrypoint cleanup, stage CLI refactoring, sleep-mode support, coordinator reliability fixes, and removed vLLM entrypoint hijacking for the 0.20.0 integration path. (#3232, #3082, #2020, #2022, #1899)
Expanded model support across omni, TTS, image, and video workloads, including MagiHuman, Dynin-omni, InternVLA-A1, Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, LTX-2.3, and FastGen Wan 2.1 pipelines. (#2301, #1759, #2737, #2890, #3089, #2753, #2658, #2893, #2749)
Improved TTS and audio production behavior, with lower VRAM usage, CUDA graph reuse, voice-cloning fixes, deterministic Fish Speech generation, a unified TTS pipeline/deploy schema, and a new universal TTS benchmark. (#2480, #2429, #2430, #2520, #2609, #2624, #2676, #2690, #2835, #2958, #3253)
Strengthened diffusion, image, and video generation, adding Z-Image image-to-image, FLUX.1/FLUX.2 TeaCache and CFG-parallel paths, Wan/LTX model support, VAE tiling, diffusion profiler/progress tooling, MP4 latency optimization, and HSDP coverage for LTX-2 and Stable-Audio-Open. (#1580, #1871, #2010, #2134, #2160, #2368, #2489, #2735, #2774, #2899, #2982)
Made BAGEL more capable in serving and RL workflows, with LoRA support, think mode, fused projections, RDMA/connector work, TP/CFG parallel flow, layerwise offload, diffusion metrics, and rollout fixes. (#2490, #2494, #2503, #2546, #2650, #2705, #2731, #2734, #2932, #3258)
Broadened quantization, memory, and hardware coverage, including OmniGen2 FP8, Qwen Omni W4A16, HunyuanImage3 NPU quantization, safer pre-quantized checkpoint handling, MUSA flash attention and torch.accelerator support, NPU graph/fused-op improvements, ROCm/AMD CI fixes, and XPU torch inductor support. (#2441, #2670, #2702, #2795, #2979, #2451, #2695, #2766, #3067, #3101, #3113, #3225)

Core Architecture & Runtime

Rebased vLLM-Omni to upstream vLLM v0.20.0 and removed the old vLLM entrypoint hijack used before the 0.20.0 integration path. (#3232, #3082)
Refreshed the runtime and stage lifecycle with a stage CLI refactor, omni sleep mode and acknowledgement protocol, coordinator reconnect/race/heartbeat fixes, stage launch-lock handling, and restored user-configurable stage initialization timeout behavior. (#2020, #2022, #1899, #2717, #2519)
Improved request/runtime configuration behavior by preserving media access arguments, removing invalid LLM-only diffusion stage args, centralizing stage sampling parameter resolution, and guarding silent config failures when trust_remote_code is unset. (#2956, #2622, #3153, #3241)
Added lower-level runtime support such as inline-client health checks, optional MROPE kwargs for HunyuanImage3, GPU-buffer accessor fixes, and clearer runtime failure reporting. (#3052, #2654, #2068, #2426)

Model Support

Added or expanded omni and speech model coverage for MagiHuman, Dynin-omni, InternVLA-A1, Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, and OmniVoice voice cloning. (#2301, #2542, #2554, #1759, #2737, #2890, #3089, #2753, #2658, #2676)
Expanded video and diffusion model support with Wan2.2-I2V-A14B, FastGen DMD2-distilled Wan 2.1 text-to-video and image-to-video pipelines, LTX-2.3, LTX-2 HSDP, and Stable-Audio-Open HSDP. (#2134, #2749, #2893, #2899, #2982)
Improved GLM-Image, HunyuanImage3, FLUX, and Z-Image behavior with config migration, benchmark/bug fixes, multi-stage serving fixes, system prompt alignment, text encoder fixes, image-to-image support, and quantization support for GLM-Image. (#2977, #3024, #3084, #2270, #2760, #1580, #2292)

Audio, Speech & Omni Production Optimization

Improved Qwen3-TTS serving quality and stability by fixing streaming chunk-boundary artifacts, aligning code predictor dtypes, handling missing ref_text, correcting Code2Wav length and eager-mode behavior, supporting local-path reference audio, and using float32 code prediction on fp16-only GPUs. (#2480, #2470, #2203, #2508, #2868, #2984, #3253)
Reduced TTS and audio memory/latency overhead by freeing unused Qwen3-TTS and Fish Speech decoder/codec components, enabling Fish Speech CUDA graph capture and reference-audio caching, sharing CUDA graph memory pools, and optimizing VoxCPM2 streaming VAE/compile and manual CUDA graph paths. (#2429, #2430, #2520, #2609, #2386, #2758, #2803)
Expanded voice and speech API behavior with speaker as a voice alias, deterministic Fish Speech seed support, raw-audio VoxCPM2 voice cloning, OmniVoice voice cloning, and speaker validation/case-insensitive lookup. (#2424, #2624, #2720, #2676, #2407)
Migrated VoxCPM2, CosyVoice3, MiMo Audio, Voxtral TTS, and Fish Speech S2 Pro to the Pipeline + Deploy schema, added the universal TTS benchmark, and offloaded blocking TTS/speech work to avoid event-loop stalls. (#2958, #2835, #2511)

Diffusion, Image & Video Generation

Expanded generation capabilities with Z-Image image-to-image, FLUX.1/FLUX.2 TeaCache, FLUX.2-dev CFG parallel, generalized diffusers adapter backend support, and profiler/progress tooling for diffusion pipelines. (#1580, #2774, #1871, #2010, #2724, #2489)
Improved video generation with Wan2.2 BF16 VAE conversion, fused RMSNorm/AdaLayerNorm and NPU fused RMSNorm paths, LightX2V offline conversion, reduced duplicate preprocessing, MP4 encoding latency optimization, and FastGen Wan 2.1 pipeline support. (#2391, #2583, #2585, #3067, #2134, #2963, #2735, #2749)
Strengthened distributed and memory-efficient diffusion with VAE tiling parallel encode, unified CFG parallel support for LTX2, 3/4-branch CFG dispatch, per-pipeline offloadable-module declarations, layerwise offload for additional diffusion models and BAGEL, HSDP support, and inline execution for single-stage diffusion. (#2368, #2160, #2423, #2427, #2339, #2734, #2899, #2982, #2736)
Improved online image/video serving correctness with request cancellation for /v1/images/generations, max-generated-image-size enforcement, default video sampling fixes, ComfyUI image-to-image DALL-E endpoint fixes, media/default sampling preservation, and pure-diffusion offline example fixes. (#2621, #2599, #3049, #2980, #2780, #3181)

Quantization & Memory Efficiency

Added or improved quantization coverage for OmniGen2 FP8, Qwen Omni W4A16 via AutoRound, HunyuanImage3 offline quantization on the NPU diffusion path, and GLM-Image quantization, alongside updated quantization documentation. (#2441, #2670, #2979, #2292, #3200)
Fixed pre-quantized checkpoint behavior by avoiding FP8 quant configs on vision/audio encoders and repairing broken FP8 quantization on Z-Image-Turbo, Qwen-Image, and FLUX.1-dev. (#2702, #2795)
Improved memory efficiency across TTS and diffusion by combining CUDA graph reuse, codec/decoder cleanup, multi-block and layerwise CPU offloading, TeaCache/offload compatibility fixes, and pipeline-declared offloadable modules. (#2386, #2429, #2430, #1486, #2339, #2689, #2427)

RL, Serving & Integrations

Improved BAGEL serving and RL flows with LoRA adapter injection, end-to-end LoRA support, text2text/img2text think mode, single-stage think mode, fused gate_proj/up_proj, trajectory recording, RDMA flow updates, TP/CFG transfer-engine support, and rollout trajectory fixes. (#2490, #2494, #2503, #2650, #2546, #2483, #2000, #2705, #2731, #3258)
Added serving controls and API reliability improvements including least-queue-length and round-robin load balancers, OpenAI-compatible request cancellation for image generation, streaming delta messages, graceful multi-stage shutdown, guarded app-state access during shutdown, and response body fixes. (#2448, #2621, #2911, #3001, #2587, #3094)
Improved diffusion and multimodal serving observability with diffusion metrics surfaced in chat completions, corrected metric keys, profiler output fixes, Nsight Systems support for serving, PyTorch profiler ops/memory recording, and multimodal benchmark token accounting fixes. (#2932, #2692, #2647, #1098, #2472, #2549)

Platforms, Distributed Execution & Hardware Coverage

Expanded MUSA support with flash attention through MATE, torch.accelerator support, torchada updates, device capability/version APIs, and matching behavior with CUDA/ROCm flash attention paths. (#2451, #2766, #3101, #3132, #3179)
Improved NPU coverage with code predictor graph support, MindIE SD fused RoPE/cache paths, Wan2.2 fused ops, VAE parallel gather performance fixes, HunyuanImage3 quantization, and Ascend NPU documentation for Wan2.2 image-to-video. (#2695, #2571, #2583, #2585, #3067, #2969, #2979, #2919)
Strengthened ROCm/AMD and XPU readiness through ROCm CI signal restoration and environment fixes, AMD simple-unit-test fixes, XPU torch inductor support, and platform capability cleanup such as supports_float64() and flash-attention package detection. (#2340, #2708, #3225, #3113, #2488, #3068)
Improved distributed execution paths with Bagel TP/CFG transfer-engine support, diffusion TP-size propagation, non-contiguous gather fixes, and CFG companion/orchestrator cleanup. (#2731, #2867, #2367, #2623)

CI, Benchmarks & Documentation

Added a CUDA Dockerfile for NVIDIA GPU users, doc-only CI change detection, a Buildkite skip-CI upload pipeline, and reorganized Buildkite nightly/ready/merge coverage for Omni and Diffusion models. (#1439, #1284, #2582, #2620, #2945)
Expanded validation with stability and reliability tests for Wan2.2, Qwen3-Omni, Qwen3-TTS, Qwen-Image, Stable Audio TeaCache, Qwen image edit performance, L5 reliability, and selected previously skipped expansion tests. (#2377, #2216, #2817, #2972, #3211)
Refreshed documentation for MUSA installation, multi-thread weight loading, expert parallelism, LTX-2 online serving, CLI usage, diffusion attention backends, profiling, quantization, and add-model skills for diffusion and TTS. (#2359, #2445, #2471, #1971, #2978, #3011, #3196, #3200, #2806)
Replaced model-specific TTS benchmark folders with a more general TTS benchmark flow covering Qwen3-TTS and VoxCPM2 voice-clone/default/design tasks. (#2835)

Note

v0.20.0rc1 is a release candidate. Use it to validate the upstream vLLM 0.20.0 rebase, the refreshed runtime and stage configuration behavior, and the expanded model/platform matrix before the final release.The remaining issues include #3268, #3266, #3264, #3257, #3256, #3255, and #2354.
The generated release appendix contains several reverted changes. This editorial note intentionally does not claim the reverted Qwen3-Omni performance optimization, VoxCPM2 instructions/cfg_value change, Z-image text encoder FP8 online quantization, or deploy override field refactor as shipped features. (#3202, #3204, #3272, #3287)
Some low-signal CI, documentation, typo, and release-script maintenance changes were merged into broader themes instead of being listed one-by-one.

What's Changed

Keep the existing GitHub-generated What's Changed appendix below this editorial section when updating the release body, or regenerate it from:

v0.19.0rc1...v0.20.0rc1

New Contributors

Keep the existing GitHub-generated New Contributors appendix below this editorial section. The current generated appendix lists 36 first-time contributors for v0.20.0rc1.

What's Changed

[Model Support]: Magihuman support by @princepride in #2301
[Docs] Update WeChat QR code for community support by @david6666666 in #2481
[CI] Fix missing queue for Voxtral-TTS E2E test step by @linyueqian in #2484
[CosyVoice3] Fix vLLM 0.19.0 compatibility issues by @linyueqian in #2486
[Model][Core] Enable async_chunk streaming pipeline for CosyVoice3 by @indevn in #1703
[Chore] Fix Bagel model import compatibility by @yuanheng-zhao in #2491
ci: remove CosyVoice3 post-merge test by @linyueqian in #2492
[Feat] add diffusion pipeline profiler and progress bar support to FluxKontextPipeline et.al by @RuixiangMa in #2489
[Bugfix] Include uv.lock in .gitignore by @timzsu in #2493
[Bugfix] Assign original prompt back to RequestOutput by @yuanheng-zhao in #2498
[CI/Build] Add Dockerfile.cuda for NVIDIA GPU users [Skip-CI] by @loveysuby in #1439
[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts by @Sy0307 in #2480
[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM by @Sy0307 in #2429
[Perf][Fish Speech] Free unused DAC codec components to save VRAM by @Sy0307 in #2430
fix(qwen3_tts): align code predictor buffer dtype with model parameters by @willamhou in #2470
[Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU by @RuixiangMa in #1486
[Feature] Enable LoRA adapter injection for BAGEL by @timzsu in #2490
[Feature] Support vae tiling parallel encode by @gcanlin in #2368
[Bugfix] Fix load_weights fallback for non-fused stacked_params_mapping entries by @timzsu in #2523
[BugFix] Add bagel text2text/img2text think mode support by @princepride in #2503
[BugFix] Continue decode if don't need transfer kv cache between two … by @princepride in #2502
[CI] Add doc-only change detection to skip Buildkite CI. by @congw729 in #1284
[Test] Test whether CI can be correctly skipped when the committed files only contain documentation. by @yenuo26 in #2534
Add supports_float64() to OmniPlatform and clean up MPS by @yeahdongcn in #2488
[Bugfix] Fix DataType Handling in Default Diffusion Config by @alex-jw-brooks in #2530
[Docs] Add installation guide for Moore Threads (MUSA) GPUs by @yeahdongcn in #2359
[bugfix]bugfix dreamid by @erfgss in #2125
[RFC] Offload blocking TTS/speech ops to thread pool to unblock event loop by @scyyh11 in #2511
[Bugfix] To resolve timeout error, update nightly test commands for diffusion model by @yenuo26 in #2532
[HunyuanImage3] Align system_prompt support with official implementation by @skf-1999 in #2270
[daVinci-MagiHuman][Doc][BugFix] Update model support for daVici-MagiHuman and fix media utils bug by @princepride in #2542
[Bagel]Fused gate_proj and up_proj by @princepride in #2546
[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API by @marksverdhei in #2424
[Bugfix] Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall by @pikaxinge in #1899
[release] Fix release script by @khluu in #2566
[release] Fix lint issue by @khluu in #2567
[Feat] Enable Layerwise CPU offloading for SD3.5, Ovis-Image, Nextstep_1, LongCat-Image by @yuanheng-zhao in #2339
[Docs] Add expert_parallel.md by @skf-1999 in #2471
[Feature] Add trajectory recording to BAGEL denoising loop by @timzsu in #2483
[Perf] Wan2.2 I2V optimization: convert datatype from FP32 to BF16 in vae by @Fishermanykx in #2391
[Diffusion] Refactor LTX2 to use unified CFG parallel framework by @TKONIY in #2160
[Feat] image2image for Z-Image by @RuixiangMa in #1580
[Feature] Port Bagel RDMA flow to latest main by @ahengljh in #2000
[Feat] Add MUSA flash attention support via mate package by @yeahdongcn in #2451
[Fix] Align diffusion proc test mock with current output fields by @ahengljh in #2584
[Bugfix] Fix benchmark Total input tokens for multimodal requests (#2540) by @Dnoob in #2549
[Unit Test] Add unit tests for orchestrator by @yinpeiqi in #2096
[TTS] Add missing _generate_pcm_chunks for OmniOpenAIServingSpeech streaming by @vveerrgg in #2569
[Perf][Qwen3-TTS][Voxtral-TTS] Share CUDA graph memory pool across decoder capture sizes by @NickCao in #2386
[Feature] End-to-end LoRA support for BAGEL by @timzsu in #2494
[CI] Reorganize the L1 L2 use cases and add markers by @zhumingjue138 in #2449
[Bugfix] Enforce --max-generated-image-size on /v1/images/generations by @NickCao in #2599
[CI]Refactor nightly test configuration in Buildkite, Add group for Omni and Diffusion models by @yenuo26 in #2582
[Bugfix] Guard app.state access during server shutdown by @pjh4993 in #2587
[MagiHuman] Fix audio sample rate and fps propagation for online serving by @princepride in #2554
[Misc] Clean up method name in BAGEL. by @timzsu in #2501
[Feat] /v1/images/generations api supports request cancel by @Semmer2 in #2621
[Bug] Lazy-import entrypoints to fix subprocess pynvml crash by @RGB-loop in #2187
[Docs] Add multi-thread weight loading documentation by @SamitHuang in #2445
[Model] Add Dynin-omni model in vllm-omni by @DOGEUNNKIM in #1759
[Bugfix] Fix precedence between caller runtime args and default stage configs by @xiaohajiayou in #2076
Revert "[Fix] Fix slow hasattr in CUDAGraphWrapper.getattr (#1982)" by @ZeldaHuang in #2639
[Refactor] Use trajectory_* fields for Qwen-Image structured RL outputs by @SamitHuang in #2513
[Bugfix] Fix Qwen-Image min-size normalization for tiny requests by @david6666666 in #2637
[Bugfix] Fix Fish Speech voice clone FileNotFoundError on multi-GPU by @Sy0307 in #2606
[CI][Bugfix] Update environment variables for test configurations in Buildkite YAML files to resolve HF timeout by @yenuo26 in #2628
[Bugfix] restore legacy stage config precedence by @xiaohajiayou in #2663
[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning by @linyueqian in #2609
[CI] Update merge condition in upload_pipeline_with_skip_ci.sh to include 'merge-test' label for non-main branches by @yenuo26 in #2666
[Feature]: support Flux.2-dev CFG-Parallel by @nuclearwu in #2010
[Entrypoint][Refactor]Stage CLI Refactor by @wuhang2014 in #2020
[CI] Update merge condition in upload_pipeline_with_skip_ci.sh to include 'merge-test' label for non-main branches by @yenuo26 in #2667
[Bugfix] fix mindiesd laserattention unsupported error by @fan2956 in #2673
[Bugfix]: modify diffusion pipeline profiler result in videos by @bjf-frz in #2647
[Profiler] Add Nsight Systems support for serving by @ahengljh in #1098
[Config] Remove invalid LLM-only engine_args from diffusion stage configs by @ianliuy in #2622
[Refactor] Remove dependency on librosa by @NickCao in #2273
[Model] VoxCPM2 native AR TTS support by @linyueqian in #2658
[BUG FIX]: prevent EngineCore crash when Qwen TTS Base task is missing ref_text by @teith in #2203
[Doc] Add LTX-2 online serving deployment recipes with optimization benchmarks by @SamitHuang in #1971
[feature] : add cache-dit for stable-audio-open-1.0 by @akshatvishu in #1341
[ROCm] [CI] [Bugfix] Resurface CI Signal, fix MHA AR selection, sync with cuda tests by @tjtanaa in #2340
[Perf] Use global CUDA graph pool for MiMo Audio by @NickCao in #2657
[TTS][OmniVoice] Add voice cloning support for OmniVoice TTS by @JuanPZuluaga in #2676
[CI] [Resource] Remove unused test cases to cutdown agent resources usage by @tjtanaa in #2688
[Bugfix] Restore user config/runtime stage init timeout by @yuanheng-zhao in #2519
[Bugfix] Validate speaker in chat endpoint and fix case-insensitive lookup by @reidliu41 in #2407
[Docs] Update WeChat QR code for community support by @david6666666 in #2701
[Log] Wire stat loggers into AsyncOmniEngine to match AsyncLLM by @gcanlin in #2551
[Bugfix] Fix Incompatible Multihook Integration (TeaCache <-> CPU Offload) by @alex-jw-brooks in #2689
[Refactor] Extend CFG Parallel to support 3 or 4 branch dispatch across M GPUs by @zzhuoxin1508 in #2423
[Bugfix] Fix UT for the missing of log_stats in Engine by @gcanlin in #2706
[ROCm] [CI] Fix environment issue by @tjtanaa in #2708
[Feat] Override single stage CLI args when stage_configs_path is set in OmniEngineArgs by @timzsu in #2684
[Bugfix] Fix Bagel online mode for 1. Hang after several requests 2. Non-deterministic image quality regression. by @natureofnature in #2458
[Perf][Fish Speech] Enable CUDA Graph capture for Fast AR code predictor by @Sy0307 in #2520
[Model] Adapt Wan2.2-I2V-A14B via LightX2V offline conversion path by @Celeste-jq in #2134
[Fix] VoxCPM2: support raw audio for voice cloning via OpenAI API by @linyueqian in #2720
[CI][Bugfix] Refactor the test case to add support for increasing init timeout and stage init timeout in order to resolve the CI timeout error. by @yenuo26 in #2711
[Revert] Revert "[Log] Wire stat loggers into AsyncOmniEngine to match AsyncLL… by @amy-why-3459 in #2716
[core]refactor communication layer: PR1(Added Refactor Infra Only) by @natureofnature in #1555
[Feature]: support Flux.2-dev tea_cache by @nuclearwu in #1871
[Bugfix] Release stage launch lock before handshake by @fake0fan in #2717
[Tests][Qwen3-Omni]Modify Qwen3-Omni performance test cases by @amy-why-3459 in #2600
[Bagel]: Support think mode in single stage deployment of Bagel by @princepride in #2650
[Misc] Cleanup: use consistent pytest-mock in unit tests by @yuanheng-zhao in #2698
[skip ci][doc]Update async_chunk design diagram by @amy-why-3459 in #2420
[Bugfix] Update Flux2-dev & Dynin_omni L4 e2e test by @wtomin in #2723
[Voxtral TTS] Correct decode steps param in Voxtral TTS by @y123456y78 in #2524
[Perf]: Speedup VoxCPM2 TTS performance and Support PagedAttention by @Sy0307 in #2690
[Voxtral TTS] Fix Voxtral TTS input with text and ref_audio by @y123456y78 in #2750
[CI] Qwen image edit performance benckmark by @fhfuih in #2216
[BugFix] Remove stage_configs_path validation by @amy-why-3459 in #2741
[Perf] Optimize MP4 encoding latency in video generation by @SamitHuang in #2735
[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance by @iancarrasco-b10 in #2604
[Test] Add Stable Audio offline e2e TeaCache Test by @zhangj1an in #2377
[Omni Connector] Omni Transfer Engine Connector: Enable 1-receiver-to-N-senders to support Bagel TP/CFG parallel by @natureofnature in #2731
[skip ci] fix docs, gdown remove --id param by @lengrongfu in #2787
[Tests][Qwen3-Omni]Add test cases for long videos and long audios. by @amy-why-3459 in #2598
[skip ci]add skills by @hsliuustc0106 in #2710
[Misc] clean Temporary CI Configs by @n1ptune in #2784
[CI][Bugfix] Update thresholds for accuracy tests by @yenuo26 in #2725
[CI/BugFix] Fix Flaky Test for Qwen Omni Perf by @alex-jw-brooks in #2754
[Bugfix] Reject /v1/audio/speech for Qwen omni models by @scyyh11 in #2763
fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints by @ianliuy in #2702
[BugFix] Fix NoneType' object has no attribute 'detach' by @amy-why-3459 in #2797
[Bugfix] Make mrope kwargs optional in HunyuanImage3 get_mrope_input_positions by @ianliuy in #2654
[Bugfix] Handle numpy array outputs when generate image by @lengrongfu in #1680
[Perf] VoxCPM2: streaming VAE + compile optimization (45% RTF reduction) by @linyueqian in #2758
[Perf] Enhance benchmark script to support baseline thresholds and proved result handling by @yenuo26 in #2789
[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval) by @amy-why-3459 in #2558
[CI] qwen image edit L4 accuracy test by @fhfuih in #2761
[Perf] Eliminate Hop 3 IPC overhead for single-stage diffusion via inline execution by @SamitHuang in #2736
[Feature] feat: add video frame interpolation postprocess by @david6666666 in #2555
[Fix] HunyuanImage-3.0: unify naming hunyuan_image_3 → hunyuan_image3 by @TaffyOfficial in #2712
[PERF] Wan2.2 support adalayernorm fused op by @fan2956 in #2585
[hotfix] API connection error in CI by @fhfuih in #2810
[Perf] VoxCPM2: Speedup by manual CUDA Graph capture for scaffold/residual forward by @Sy0307 in #2803
Add voxcpm model support. by @IsleOfDawnlight in #2467
[Feat][Qwen3-Omni] Shared code predictor module for Qwen3-TTS and Qwen3-Omni by @JuanPZuluaga in #2375
[Feature] HunyuanImage3 allow guidance_scale<=1 in DiT stage by @Fishermanykx in #2762
[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev by @zhangj1an in #2795
[feature] Hidden State Prefix Caching by @alex-jw-brooks in #2164
[Perf] Add Performance Test for Qwen-Image Step-Level Execution by @wtomin in #2707
[CI] Skip test_thinker_prefix_caching in tests/e2e/online_serving/test_qwen3_omni.py by @yenuo26 in #2836
[CI][Perf] Add nightly PR labels, consolidate pipeline, and switch benchmark flag to --test-config-file by @yenuo26 in #2816
[Doc][Misc] Update DreamID-Omni Example; Add DreamID-Omni post process function by @yuanheng-zhao in #2809
[Feat] add GLM-Image SP support by @RuixiangMa in #1983
[CI] add qwen image and layered accuracy test by @david6666666 in #2772
[Feature] Bagel: Support tp+cfg parallel using mooncake transfer engine connector by @natureofnature in #2705
[PERF] Wan2.2 support rmsnorm fused op by @fan2956 in #2583
[Test] Add performance tests for Qwen-Image-Layered model by @kechengliu97 in #2807
[Fix][Fish Speech] Remove redundant get_vocab() in control token encoding by @Sy0307 in #2842
[Test] Skip tests for known issues in audio and speaker recognition by @yenuo26 in #2851
[FIX] Preserve YAML default stop words when request sends empty list by @QiuMike in #2855
[BugFix][VoxCPM2]: split multichar Chinese tokens to match training tokenization by @Sy0307 in #2832
Feat/Add HunyuanImage-3.0-Instruct ar part support: by @TaffyOfficial in #2713
[Quantization] feat: add FP8 for Omnigen2 by @zhangj1an in #2441
[Feature] Flux2 klein inpaint by @RuixiangMa in #1180
[Refactor] Remove sox from dependencies by @NickCao in #2745
[Bugfix] enforce max_sequence_length for Qwen-Image and Wan2.2 series before encoding by @david6666666 in #2847
[Bugfix] Preserve default diffusion sampling params in default stage by @david6666666 in #2780
[Model] Support Flux1 Schnell by @alex-jw-brooks in #2528
[Core] Refactor CFG companion tracker and use in Orchestrator by @yinpeiqi in #2623
[CI][Bugfix] Fix the error in generating the performance data table and add a fallback mechanism that prevents the result file from being generated when test case execution fails. by @yenuo26 in #2839
[BugFix] Fixing occasional engine crashes caused by abort requests by @amy-why-3459 in #2871
[Feature] Support Prefill-Decode disaggregation via vLLM KV transfer by @spencerr221 in #2220
[Model] Add Ming-flash-omni-2.0 Thinker Stage by @yuanheng-zhao in #1822
[Bugfix] Fix RIFE device selection for CPU-transported videos by @david6666666 in #2876
[Bugfix] Limit Qwen-Image-Edit-2511 input image count by @david6666666 in #2840
[Test] Add ModelRunner V2 with Qwen3-TTS Base E2E Test to CI pipeline by @tzhouam in #2321
[Bugfix] Fix image quality in /v1/images/generations for multi-stage pipeline by @RuixiangMa in #2267
Fix NoneType error of outputs by @QiuMike in #2315
[Refactor] refactor wan2.2 diffuse && add ut by @bjf-frz in #2672
[Misc] Warn When vLLM / vLLM-Omni Have Mismatched Versions by @alex-jw-brooks in #2691
[Bugfix] Fix cache dit for Longcat & LTX2 by @alex-jw-brooks in #2860
[CI] Skip test_bagel[parallel_tp_2] and test_wan22_i2v_online_serving_generates_video[wan22_i2v_usp2_hsdp2] by @yenuo26 in #2883
[Bugfix] fix CI failure by @RuixiangMa in #2884
[Cleanup] Remove dead runtime.defaults config parameters by @NickCao in #2343
[skip CI][Docs] Add Qwen3-Omni and Qwen3-TTS performance blog and figures by @Shirley125 in #1837
Nextstep online e2e by @Joshna-Medisetty in #2107
Add Teacache Support for LongCat Image by @alex-jw-brooks in #1487
[skip ci][recipe] draft vllm-omni recipes by @hsliuustc0106 in #2646
[Docs] Update WeChat QR code for community support by @david6666666 in #2895
[Refactor] Remove resampy dependency by @NickCao in #2891
[Feature]Support audio streaming input and output-phase2 by @Shirley125 in #2581
[BugFix]: Fix multi-stage cfg bug by @princepride in #2801
[doc][skip ci] remove redundant content in readme by @Shirley125 in #2901
[Feat] cache-dit for GLM-Image by @RuixiangMa in #1399
[Agent] Add NPU main2main skill by @gcanlin in #2858
[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt by @Sy0307 in #2894
[Config Refactor][2/N] Pipeline + Deploy Config Schema by @lishunyang12 in #2383
[Bugfix][VoxCPM2]: Fix vectorized_gather OOB under concurrent prefill+decode batches by @Sy0307 in #2903
perf(helios): replace strided RoPE with stack+flatten for contiguous memory by @willamhou in #2474
[Bugfix] diffusion end points allow model mismatch by @xiaohajiayou in #2805
[Feat] Support layerwise CPU offloading for more videogen models by @yuanheng-zhao in #2018
[Config Refactor 2.5/N] Centralize pipeline registry by @lishunyang12 in #2915
[Perf] Optimize Wan2.2 device free on image preprocess by @fan2956 in #2852
[Docs] update documents by @R2-Y in #2921
[BugFix] Fixed the issue where --no-async-chunk was not working. by @amy-why-3459 in #2934
[CI] Restructure vLLM-Omni Test Layout, Fixture Scope, and Support Modules by @yenuo26 in #2620
[Model] Add HSDP support for LTX-2 by @fywc in #2899
[Revert] drop Wan2.2 prompt-length enforcement from #2847 by @david6666666 in #2877
[Bugfix] Fix GLM-Image output dimensions and image edit pipeline by @JaredforReal in #2320
[Docs] Add Wan2.2 image-to-video recipe for Ascend NPU (A2/A3) by @gcanlin in #2919
[Example] Add Hunyuan-Image3 end2end.py and README.md by @kechengliu97 in #2590
CI: publish Omni images to a separate Docker Hub repository by @sheralskumar in #2829
[Enhancement] add pytorch profiler ops and memory record by @bjf-frz in #2472
[Model] feat: FastGen DMD2-distilled Wan 2.1 pipelines (T2V, I2V) by @ayushag-nv in #2749
[CI] Remove small resolution test in Qwen-Image Perf test when vae patch parallel is enabled by @wtomin in #2872
[Bugfix] Truncate mimo-audio code2wav prompt to MAX_CODE2WAV_TOKENS by @lishunyang12 in #2693
[Feat][sleepmode] add omni sleepmode and ack protocol by @Flink-ddd in #2022
[CI][Bugfix] Improve cosine similarity calculation by incorporating length harmony adjustment in text comparison by @yenuo26 in #2964
[BugFix] Fix the issue with stream=True by @amy-why-3459 in #2955
[Enhancement] Engine runtime errors by @pi314ever in #2426
[BugFix] add missing subtalker sampling config to Qwen3-TTS deploy YAML by @xiaohajiayou in #2940
[Model] Add HSDP support for Stable-Audio-Open by @fywc in #2982
[Enhancement]remove duplicate video preprocess in Wan2.2 pipeline by @bjf-frz in #2963
[Bugfix] Fix VAE parallelism dist.gather performance bottleneck on NPU by @fan2956 in #2969
[Config Refactor] Migrate 5 TTS models (VoxCPM2 / CosyVoice3 / MiMo Audio / Voxtral TTS / Fish Speech S2 Pro) to Pipeline + Deploy schema by @linyueqian in #2958
[Bugfix] ComfyUI image-to-image DALL-E endpoint cases #2886 by @david6666666 in #2980
Codex revert pr reviewer by @hsliuustc0106 in #2959
[Bugfix] treewide: drop references to librosa by @NickCao in #2996
[Feature] Load Balancer - Add LeastQueueLengthBalancer RoundRobinBalancer by @NumberWan in #2448
Remove dead code by @dhonnappa-amd in #2998
[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture by @ChipMates in #2910
[Bugfix][Model] Fix Qwen3-TTS Code2Wav max_model_len validation by @greenhandzpx in #2508
[Feature][Voxtral TTS] Support per-model extra sampling params & add cfg_alpha for Voxtral TTS by @y123456y78 in #2338
[ci]add merge ci for streaming input by @Shirley125 in #2965
[BugFix][CI]: Fix test_omni_sleep_mode ci bug by @princepride in #3010
[ConfigRefactor] GLM-Image by @JaredforReal in #2977
[CI] Update test markers and configurations to use 'full_model' for L4 nightly tests by @yenuo26 in #2641
[NPU] Support code predictor NPU graph by @gxxx-hum in #2695
[Tests] Modify test cases by @amy-why-3459 in #2991
[AutoRound] Support Qwen Omni W4A16 quantization model by @lvliang-intel in #2670
Fix links in glm_image.md by @oglok in #3028
[feat]: General diffusers adapter backend to run diffusion models by @fhfuih in #2724
[BugFix] Surface diffusion metrics in chat completions; sanitize Bagel img2img mm kwargs by @NumberWan in #2932
[Refactor] Let diffusion pipelines declare offloadable modules via SupportsModuleOffload by @NickCao in #2427
[Feature] Failure message shows more details by @wuhang2014 in #2961
[Tests]Add accuracy benchmark L4 test cases by @amy-why-3459 in #2843
[MUSA][Feat] Upgrade MATE to match CUDA/ROCm behavior on FA by @yeahdongcn in #2766
[Doc] Add diffusion attention backend docs by @david6666666 in #3011
[Bugfix] Fix default sampling params for /v1/videos on main by @david6666666 in #3049
[Model] Ming-flash-omni-2.0 Omni-Speech and TTS by @yuanheng-zhao in #2890
[BugFix] Bagel img2img e2e: drop extra_body height/width by @NumberWan in #3054
[BugFix] Preserve media access args for stage configs by @xiaohajiayou in #2956
feat: add LTX-2.3 video generation model support by @oglok in #2893
[Test] add stability test case for wan2.2, qwen-tts, qwen3-omni and qwen-image model and modified conftest.py in test/dfx/ by @zhumingjue138 in #2817
[PERF]use mindiesd fused rope and rope cache by @Hu1Lcode in #2571
[BugFix]: Fix Qwen3-TTS code2wav fails when enforce_eager: false by @ChefWu551 in #2868
[Perf][Wan2.2] Add fused RMSNorm replace WanRMS_norm on npu by @lyj-jjj in #3067
[Docs] CLI Docs updates by @wuhang2014 in #2978
[CI] Re-enable Prefix Cache Test by @alex-jw-brooks in #2869
[Chore] refine the offline examples by @RuixiangMa in #3095
[Feat] support quantization for GLM-IMAGE by @RuixiangMa in #2292
[Feature] XiaomiMiMo/MiMo-V2.5-ASR support by @qibaoyuan in #3089
[Feature] Coordinator PUB mechanism optimization by @NumberWan in #2442
[Enhancement]modify profiling.md by @bjf-frz in #3051
[CI][Perf] Add Wan22 i2v perf nightly ci by @bjf-frz in #3063
[Bugfix] T5 text encoder to render correct text in FLUX.1-dev by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2760
[Bugfix] Fix Flux2klein Text Input Processing by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3098
docs: update add-tts-model skill and contributing guide with single-stage patterns by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2806
[Bugfix] fix GLM-Image multi-stage online generation fail by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3084
[Bugfix] graceful shutdown for multi-stage engine processes by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3001
[Config Refactor] sentinel default precedence by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3078
[Config Refactor]: Remove bagel yaml by @princepride in https://github.com/vllm-project/vllm-omni/pull/2936
[Refactor] Remove redundant benchmarks from Qwen3-Omni by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3108
[Feature] Streaming video input with EVS frame filtering (RFC #2201 Phase 2-4) by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2342
[MUSA][Feat] torch.accelerator support by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3101
[Benchmark] Universal TTS benchmark: Qwen3-TTS + VoxCPM2 with 3 task types (voice-clone/default/design) by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2835
[XPU] Enable torch inductor for xpu by @xuechendi in https://github.com/vllm-project/vllm-omni/pull/3113
fix generations reponse body by @FrosterHan in https://github.com/vllm-project/vllm-omni/pull/3094
[CI] Fix Seed TTS Simple Unit Tests by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3126
[Refactor] Remove the default value "mp" from distributed_executor_backend. by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3007
[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path by @jiangmengyu18 in https://github.com/vllm-project/vllm-omni/pull/2979
[Feat] support layerwise offload for Bagel by @lsyyysky in https://github.com/vllm-project/vllm-omni/pull/2734
[Refactor] Standardize data entry key names to {type}.{qualifier} format by @divyanshsinghvi in https://github.com/vllm-project/vllm-omni/pull/1829
[Bugfix][Refactor] Migrate Voxtral TTS config and parser registry by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3065
Benchmark and Bugfix for GLM-Image by @JaredforReal in https://github.com/vllm-project/vllm-omni/pull/3024
[Refactor/Bugfix] Use Delta Messages for Streaming by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/2911
[Cleanup] Drop empty _DIFFUSION_PIPELINES placeholder by @lishunyang12 in https://github.com/vllm-project/vllm-omni/pull/3040
[Refactor] Centralize stage sampling params resolution by @reidliu41 in https://github.com/vllm-project/vllm-omni/pull/3153
[Feature]: support Flux.1-dev tea_cache by @nuclearwu in https://github.com/vllm-project/vllm-omni/pull/2774
[Bugfix] Add seed support to TTS API for deterministic Fish Speech voice generation by @ianliuy in https://github.com/vllm-project/vllm-omni/pull/2624
feat: add MOSS-TTS-Nano single-stage TTS support by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2753
[Refactor] Extract has_flash_attn_pkg in OmniPlatform by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3068
[Test] L5 reliability test for wan2.2 and qwen3_omni model by @zhumingjue138 in https://github.com/vllm-project/vllm-omni/pull/2972
[CI] Update nightly/ready test scheduling and unskip selected cases by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2945
[MUSA][Chore] Bump torchada by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3132
[BugFix] qwen3_tts: build speaker_encoder in init so load_format: dummy works by @leohuang257 in https://github.com/vllm-project/vllm-omni/pull/3117
[CI] skip MOSS-TTS-Nano E2E Test by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3171
[Model] Add InternVLA-A1 offline inference support by @Greyman-Seu in https://github.com/vllm-project/vllm-omni/pull/2737
fix(diffusion): correct metric keys, remove duplication, minor cleanup by @willamhou in https://github.com/vllm-project/vllm-omni/pull/2692
[Docs] Modify Qwen3-Omni's recipe by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3109
[CI]Skip failing text_to_image README examples in CI due to subprocess exit issue by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3190
[Docs] Add performance profiling step to add-diffusion-model skill by @SamitHuang in https://github.com/vllm-project/vllm-omni/pull/3196
[Doc][Frontend][Model][VoxCPM2] Support instructions and per-request cfg_value by @gnomefin in https://github.com/vllm-project/vllm-omni/pull/3118
[perf]Qwen3-Omni performance optimization by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3164
Revert "[perf]Qwen3-Omni performance optimization" by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/3202
Revert "[Doc][Frontend][Model][VoxCPM2] Support instructions and per-request cfg_value" by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/3204
[BugFix]: recalculate model_config to fix FA3 scheduler_metadata shape mismatches by @ZhengWG in https://github.com/vllm-project/vllm-omni/pull/3110
Update WeChat QR code by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3213
[Docs] Update quantization guides by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3200
[Bugfix][CI] fix file name too long error in Omni · Doc Test with H100 by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3209
Fix "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" by @ekagra-ranjan in https://github.com/vllm-project/vllm-omni/pull/1794
[Bugfix] Fix Qwen3-TTS Base ICL garbled output when ref_audio is a local path by @Dnoob in https://github.com/vllm-project/vllm-omni/pull/2984
[Bugfix] Add missing .gpu accessor for inputs_embeds CpuGpuBuffer in prefill overlay by @dubin555 in https://github.com/vllm-project/vllm-omni/pull/2068
[Rebase] Rebase to vllm 0.20.0 by @tzhouam in https://github.com/vllm-project/vllm-omni/pull/3232
[Bugfix][Qwen3TTS] Use float32 for code predictor on fp16-only GPUs by @NickCao in https://github.com/vllm-project/vllm-omni/pull/3253
[Bugfix][CI] Vendor Qwen3-TTS and CosyVoice3 reference audio fixtures by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3267
[MUSA] Add get_device_capability/get_device_version and remove get_diffusion_model_impl_qualname/prepare_diffusion_op_runtime in musa/platform.py by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3179
[Bugfix]pass TP size to diffusion config by @natureofnature in https://github.com/vllm-project/vllm-omni/pull/2867
[Quantization] Enable FP8 online quantization for Z-image text encoder by @Isotr0py in https://github.com/vllm-project/vllm-omni/pull/1338
[Bugfix][CI] Enhance get_open_port function to handle port binding more robustly by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3216
Revert "[Quantization] Enable FP8 online quantization for Z-image text encoder" by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/3272
[Mergify] Init config by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/812
[BugFix] Fix nullify regressions in pure diffusion offline examples by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3181
[CI] remove skip in test_z_image_expansion and test_qwen3_omni_expansion by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3211
[Bugfix] Prevent silent failure of get_config when trust_remote_code passed as None by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3241
add check_health in inline client by @lengrongfu in https://github.com/vllm-project/vllm-omni/pull/3052
[Misc] Remove Entrypoint Hijack for vLLM / 0.20.0 Changes by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3082
[AMD][CI][Bugfix] Fix "simple unit test" by @tjtanaa in https://github.com/vllm-project/vllm-omni/pull/3225
[BugFix][Bagel]: Fix vLLM-Omni as rollout bug: number of trajectory_latents count less by @princepride in https://github.com/vllm-project/vllm-omni/pull/3258
[Config Refactor] Derive deploy override fields from stage config by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3162
[CI]Change pixels tolerance from 5 to 10 by @princepride in https://github.com/vllm-project/vllm-omni/pull/3289
[CI failed]Revert "[Config Refactor] Derive deploy override fields from stage config" by @Gaohan123 in https://github.com/vllm-project/vllm-omni/pull/3287

New Contributors

@indevn made their first contribution in #1703
@timzsu made their first contribution in #2493
@loveysuby made their first contribution in #1439
@willamhou made their first contribution in #2470
@scyyh11 made their first contribution in #2511
@skf-1999 made their first contribution in #2270
@pikaxinge made their first contribution in #1899
@vveerrgg made their first contribution in #2569
@pjh4993 made their first contribution in #2587
@RGB-loop made their first contribution in #2187
@DOGEUNNKIM made their first contribution in #1759
@xiaohajiayou made their first contribution in #2076
@ianliuy made their first contribution in #2622
@teith made their first contribution in #2203
@Celeste-jq made their first contribution in #2134
@iancarrasco-b10 made their first contribution in #2604
@zhangj1an made their first contribution in #2377
@n1ptune made their first contribution in #2784
@TaffyOfficial made their first contribution in #2712
@IsleOfDawnlight made their first contribution in #2467
@QiuMike made their first contribution in #2855
@fywc made their first contribution in #2899
@sheralskumar made their first contribution in #2829
@ayushag-nv made their first contribution in #2749
@dhonnappa-amd made their first contribution in #2998
@ChipMates made their first contribution in #2910
@greenhandzpx made their first contribution in #2508
@gxxx-hum made their first contribution in #2695
@lvliang-intel made their first contribution in #2670
@ChefWu551 made their first contribution in #2868
@FrosterHan made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3094
@lsyyysky made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2734
@leohuang257 made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3117
@Greyman-Seu made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2737
@gnomefin made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3118
@ZhengWG made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3110

Full Changelog: v0.19.0rc1...v0.20.0rc1