vllm-project/vllm-omni v0.20.0 on GitHub

Highlights

This release features 466 commits from 125 contributors since v0.18.0.

vLLM-Omni v0.20.0 is a major production release aligned with upstream vLLM v0.20.0. It refreshes the serving/runtime stack for large-scale omni workloads, expands model coverage across speech, omni, image, audio, and video generation, and improves performance, quantization, and hardware readiness across CUDA, ROCm, MUSA, NPU, and XPU backends.

Key Improvements

Rebased to upstream vLLM v0.20.0, with CUDA 13.0 and PyTorch 2.11 alignment, Transformers 5.x compatibility fixes, removal of the old vLLM entrypoint hijack, and runtime changes needed for the 0.20.0 integration path. (#3232, #3082, #3352, #3393, #2306)
Large-scale serving for Qwen3-Omni, with Qwen3-Omni performance optimization, CUDA graph support for the Code2Wav decoder, async/sync autoregressive scheduling, multi-stage deployment support, and expanded long-audio/video and performance validation. (#3203, #2376, #3306, #2396, #2598, #2600)
CLI and configuration refactor, including the stage CLI refactor, forwarding CLI tokenizer settings into per-stage engine configs, removal of legacy Omni CLI helpers, cleaner deploy/pipeline config migration, and updated CLI documentation. (#2020, #3120, #3144, #2383, #2978)
Expanded quantization coverage, including AutoRound W4A16 support for Qwen Omni, offline W4A16 quantized model support, OmniGen2 FP8, Z-Image text-encoder FP8 online quantization, HunyuanImage3 NPU quantization, GLM-Image quantization, and fixes for pre-quantized checkpoints. (#2670, #1777, #2441, #3279, #2979, #2292, #2702, #2795)
TTS model speedups and production fixes, improving VoxCPM2, Qwen3-TTS/Qwen-TTS, MiMo Audio, Fish Speech, and Voxtral TTS through CUDA graph reuse, native decoder construction, global speaker/reference-audio caches, streaming VAE optimization, memory-pool sharing, and deterministic sampling fixes. (#2758, #2803, #2341, #2630, #2657, #2520, #2386, #3350)
Hardware plugin and platform optimization, expanding MUSA flash attention and torch.accelerator support, aligning NPU with the v0.20.0/GPU model-runner path, restoring ROCm/AMD CI signal, and refreshing XPU Docker/CI readiness for the PyTorch 2.11 stack. (#2451, #3101, #3325, #3343, #3083, #3393)
More SOTA model support, including Ming-flash-omni-2.0, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, HunyuanImage-3.0 IT2I, ERNIE image T2I, AudioX, Wan2.2-S2V, DreamID-Omni HSDP, LTX-2.3, and FastGen Wan 2.1 pipelines. (#2890, #3089, #2753, #2658, #3107, #2861, #2077, #2751, #3138, #2893, #2749)
Diffusion dynamic step-level batching, adding async batch inference in the DiffusionEngine and strengthening step-level/diffusion serving paths with pipeline-declared offload modules, CFG/HSDP improvements, VAE tiling, and performance validation. (#2729, #2707, #2427, #2423, #2368, #2899, #2982)

Core Architecture & Runtime

Rebased vLLM-Omni to upstream vLLM v0.20.0 and removed the legacy vLLM entrypoint hijack used before the 0.20.0 integration path. (#3232, #3082)
Refreshed the runtime and stage lifecycle with the stage CLI refactor, omni sleep mode and acknowledgement protocol, coordinator reconnect/race/heartbeat fixes, stage launch-lock handling, multi-stage deployment support, and user-configurable stage initialization timeout behavior. (#2020, #2022, #1899, #2717, #2396, #2519)
Improved request/runtime configuration behavior by preserving media access arguments, forwarding CLI tokenizer values to stage configs, removing invalid LLM-only diffusion stage args, centralizing stage sampling parameter resolution, and guarding silent config failures when trust_remote_code is unset. (#2956, #3120, #2622, #3153, #3241)
Added async/sync autoregressive scheduling and lower-level runtime fixes for inline health checks, GPU-buffer accessors, stage port allocation, engine failure reporting, and async scheduler transfer behavior. (#3306, #3052, #2068, #3333, #2426, #3318)

Model Support

Added or expanded omni and speech model coverage for Ming-flash-omni-2.0, Dynin-omni, InternVLA-A1, MagiHuman, XiaomiMiMo/MiMo-V2.5-ASR, MOSS-TTS-Nano, VoxCPM2 native AR TTS, and OmniVoice voice cloning. (#2890, #1759, #2737, #2301, #3089, #2753, #2658, #2676)
Expanded image and video model support with HunyuanImage-3.0 IT2I, ERNIE image T2I, AudioX, Wan2.2-S2V, Wan2.2-I2V-A14B, FastGen DMD2-distilled Wan 2.1, LTX-2.3, DreamID-Omni HSDP, LTX-2 HSDP, and Stable-Audio-Open HSDP. (#3107, #2861, #2077, #2751, #2134, #2749, #2893, #3138, #2899, #2982)
Improved GLM-Image, HunyuanImage3, FLUX, Z-Image, and Ming/Ming-TTS behavior with config migration, benchmark and accuracy fixes, multi-stage serving fixes, image-to-image support, quantization support, and deploy/pipeline config migration. (#2977, #3024, #3084, #3373, #3243, #1580, #2292, #3154)

Audio, Speech & Omni Production Optimization

Improved Qwen3-Omni and Qwen3-TTS serving with Qwen3-Omni performance optimization, Code2Wav CUDA graph support, native Code2Wav decoder construction, streaming input fixes, speaker embedding validation, deterministic Fast AR seed propagation, and max-token mapping fixes. (#3203, #2376, #2341, #3396, #3191, #3350, #3217)
Reduced TTS and audio memory/latency overhead by freeing unused Qwen3-TTS and Fish Speech decoder/codec components, enabling Fish Speech CUDA graph capture and reference-audio caching, sharing CUDA graph memory pools, and optimizing VoxCPM2 streaming VAE/compile and manual CUDA graph paths. (#2429, #2430, #2520, #2609, #2386, #2758, #2803)
Expanded voice and speech API behavior with speaker as a voice alias, chat-completions support for both voice and speaker, raw-audio VoxCPM2 voice cloning, OmniVoice voice cloning, global speaker cache management, and speaker validation/case-insensitive lookup. (#2424, #3248, #2720, #2676, #2630, #2407)
Migrated VoxCPM2, CosyVoice3, MiMo Audio, Voxtral TTS, Fish Speech S2 Pro, Ming, and Ming-TTS to the Pipeline + Deploy schema, added a universal TTS benchmark, and consolidated per-model TTS docs into a TTS hub. (#2958, #3154, #2835, #3234, #3358)

Diffusion, Image & Video Generation

Added dynamic step-level batching with DiffusionEngine async batch inference and continued step-level execution/performance validation for Qwen-Image and diffusion pipelines. (#2729, #2707)
Expanded generation capabilities with Z-Image image-to-image, FLUX.1/FLUX.2 TeaCache, FLUX.2-dev CFG parallel, HunyuanImage3 TeaCache/IT2I, generalized diffusers adapter backend support, and profiler/progress tooling for diffusion pipelines. (#1580, #2774, #1871, #2010, #1927, #3107, #2724, #2489)
Improved video generation with Wan2.2 BF16 VAE conversion, fused RMSNorm/AdaLayerNorm and NPU fused RMSNorm paths, LightX2V offline conversion, reduced duplicate preprocessing, MP4 encoding latency optimization, Wan2.2-S2V support, and FastGen Wan 2.1 pipeline support. (#2391, #2583, #2585, #3067, #2134, #2963, #2735, #2751, #2749)
Strengthened distributed and memory-efficient diffusion with VAE tiling parallel encode, unified CFG parallel support, 3/4-branch CFG dispatch, pipeline-declared offloadable modules, layerwise offload for additional diffusion models, HSDP coverage, and inline execution for single-stage diffusion. (#2368, #2160, #2423, #2427, #2339, #2734, #2899, #2982, #2736)

Quantization & Memory Efficiency

Added or improved quantization coverage for Qwen Omni W4A16 via AutoRound, offline W4A16 quantized models, OmniGen2 FP8, Z-Image text-encoder FP8, HunyuanImage3 NPU quantization, GLM-Image quantization, Flux Kontext quantization, and Helios FP8. (#2670, #1777, #2441, #3279, #2979, #2292, #2184, #1916)
Fixed pre-quantized checkpoint behavior by avoiding FP8 quant configs on vision/audio encoders and repairing broken FP8 quantization on Z-Image-Turbo, Qwen-Image, and FLUX.1-dev. (#2702, #2795)
Improved memory efficiency across TTS and diffusion by combining CUDA graph reuse, codec/decoder cleanup, multi-block and layerwise CPU offloading, TeaCache/offload compatibility fixes, transformer offload, and pipeline-declared offloadable modules. (#2386, #2429, #2430, #1486, #2339, #2689, #3224, #2427)

RL, Serving & Integrations

Improved BAGEL serving and RL flows with LoRA adapter injection, end-to-end LoRA support, text2text/img2text think mode, single-stage think mode, fused gate_proj/up_proj, trajectory recording, RDMA flow updates, TP/CFG transfer-engine support, and rollout trajectory fixes. (#2490, #2494, #2503, #2650, #2546, #2483, #2000, #2705, #2731, #3258)
Added serving controls and API reliability improvements including least-queue-length and round-robin load balancers, OpenAI-compatible request cancellation for image generation, streaming delta messages, graceful multi-stage shutdown, guarded app-state access during shutdown, response body fixes, and multi-stage deployment support. (#2448, #2621, #2911, #3001, #2587, #3094, #2396)
Improved diffusion and multimodal serving observability with diffusion metrics surfaced in chat completions, corrected metric keys, profiler output fixes, Nsight Systems support for serving, PyTorch profiler ops/memory recording, high-load Qwen3-TTS perf CI, and multimodal benchmark token accounting fixes. (#2932, #2692, #2647, #1098, #2472, #3238, #2549)

Platforms, Distributed Execution & Hardware Coverage

HunyuanImage-3.0 on NPU now supports DiT, AR-only, and AR+DiT workflows, adds IT2I image editing, improves DiT performance coverage, and enables offline msModelSlim/vLLM-Ascend quantized inference. (#2713, #3107, #2495, #2979, #2590, #1927, #1751)
Wan2.2 on NPU is now production-ready with major I2V performance optimizations, including MindIE-SD fused RoPE/AdaLayerNorm/RMSNorm, VAE BF16 and parallelism fixes, HSDP/USP deployment recipes, delivering roughly 50-60% performance improvement in tested workloads. (#2919, #2393, #2459, #2391, #2585, #2583, #2571, #3067, #2969, #2852, #3063, #2262, #2817)
Qwen3-TTS and Qwen3-Omni improve NPU speech generation performance with shared Code Predictor infrastructure, NPU graph/fusion-attention support, NPU runner alignment, and stronger benchmark coverage, improving RTF by about 50%. (#2375, #2695, #3325, #2353, #3203, #3238, #2835)
Added and stabilized MUSA/Moore Threads GPU support, including platform discovery, torch.accelerator alignment, torch.compile/Inductor support, MATE Flash Attention for diffusion, device capability/version APIs, and installation docs. (#2337, #2359, #2451, #2766, #3101, #3179, #3132)
Improved ROCm reliability by migrating AMD CI images, restoring CI failure signals, syncing ROCm coverage with CUDA tests, fixing Qwen2.5/Qwen3 Omni CI cases, and selecting a safer default AR attention backend for Omni workloads. (#2303, #2340, #2708, #3225, #3343)
Improved Intel XPU support with Voxtral TTS stage configs, removal of hardcoded CUDA paths in audio tokenization, torch Inductor enablement, Qwen2.5 CI fixes, and updated XPU Docker support for the vLLM 0.20 / PyTorch 2.11 stack. (#2428, #3113, #3083, #3393)
Added distributed and parallel execution improvements including HSDP for Qwen-Image/Z-Image/GLM-Image, DreamID-Omni, LTX-2, and Stable-Audio-Open, plus BAGEL TP/CFG transfer-engine support and CFG parallel dispatch improvements. (#2029, #3138, #2899, #2982, #2705, #2731, #2423)

CI, Benchmarks & Documentation

Expanded reliability and performance coverage with Qwen3-Omni/Qwen-TTS/Qwen-Image/Wan2.2 stability tests, Qwen3-TTS high-load daily performance CI, HunyuanImage3 DiT benchmark tests, universal TTS benchmarks, and model accuracy benchmark coverage. (#2817, #3238, #2495, #2835, #2558)
Updated documentation for CLI usage, Qwen3-Omni recipes, diffusion quantization, diffusion attention backends, TTS hubs, Fish Speech deployment, LTX-2 recipes, and hardware-specific deployment notes. (#2978, #3109, #3200, #3011, #3234, #3193, #3294, #2919)
Improved CI readiness after the v0.20.0 rebase with CUDA 13.0/Qwen-Image performance fixes, NPU alignment, ROCm fixes, Intel XPU CI fixes, PyTorch 2.11 XPU Docker updates, and nightly/ready test scheduling updates. (#3352, #3325, #3343, #3083, #3393, #2945)

What's Changed

[Bugfix][HunyuanImage3.0] Fix default guidance_scale from 1.0 to 4.0 and port GPU MoE ForwardContext fix from NPU by @nussejzz in #2142
[Feat] support quantization for Flux Kontext by @RuixiangMa in #2184
[Tests][Qwen3-Omni] Add performance test cases by @amy-why-3459 in #2011
[Docs] Modify the documentation description for streaming output by @amy-why-3459 in #2300
Fix: Enable /v1/models endpoint for pure diffusion mode by @majiayu000 in #805
[skip ci] [Docs]: add CI Failures troubleshooting guide for contributors by @lishunyang12 in #1259
Qwen3-Omni][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @LJH-LBJ in #2291
[Feature] [HunyuanImage3] Add TeaCache support for inference acceleration by @nussejzz in #1927
[Misc] Make gradio an optional dependency and upgrade to >=6.7.0 by @Lidang-Jiang in #2221
[ROCm] [CI] Migrate to use amd docker hub for ci by @tjtanaa in #2303
[Feat] add helios fp8 quantization by @lengrongfu in #1916
[Bugfix] fix: handle Qwen-Image-Layered layered RGBA output for jpeg edits by @david6666666 in #2297
[Doc] Add transformers version requirement in GLM-Image example doc by @chickeyton in #2265
[Bugfix] Fix Qwen3TTSConfig init order to be compatible with newer Tansformers(5.x) by @RuixiangMa in #2306
[Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #2195
[skip ci][Doc] Refine the Diffusion Features User Guide by @wtomin in #1928
[Bugfix] fix: return 400 for unsupported multi-image edits such as Qwen-Image-Layered by @david6666666 in #2298
[Bugfix] fix: validate layered image layers range by @david6666666 in #2334
[skip ci][Docs] reorganize multiple L4 test guidelines by @fhfuih in #2119
[Diffusion] Refactor CFG parallel for extensibility and performance by @TKONIY in #2063
Fix Qwen3-TTS Base on NPU running failed by @OrangePure in #2353
[Test] Fix 4 broken Qwen3-TTS async chunk unit tests by @linyueqian in #2351
[Test] Add qwen3-omni tests for audio_in_video and one word prompt by @yenuo26 in #2097
[CI] fix test: use minimum supported layered output count by @david6666666 in #2350
[CI]test: add wan22 i2v video similarity e2e by @david6666666 in #2262
[Bugfix] Fix case-sensitivity in Qwen3 TTS speaker name lookup by @reidliu41 in #2358
Fix Qwen3-TTS gradio demo by @noobHappylife in #2372
[skip ci] update release 0.18.0 by @hsliuustc0106 in #2380
[Bugfix] Update Whisper model loading to support multi-GPU configurations and optimize CUDA memory management by @yenuo26 in #2354
[release] Add nightly wheel release index by @khluu in #2345
[BugFix] Add BAGEL single-stage diffusion config and fix multiple <im_start><im_end> bug by @princepride in #2381
[Bugfix] Fix layer-wise offload incompatibility with HSDP by @RuixiangMa in #2021
[BugFix] qwen3_tts chunk boundary handling logic in initial chunk (IC) by @Fattysand in #2378
[Feat][Benchmark] Add synchronous video generation endpoint POST /v1/videos/sync for benchmark test by @SamitHuang in #2049
[Docs] Update WeChat QR code for community support by @david6666666 in #2402
[CI] [skip ci]Nightly Report Optim by @congw729 in #2406
[Feature][HunyuanImage3.0] Add cfgP to HunyuanImage3.0 by @nussejzz in #1751
Fix: ensure input tensor is contiguous in GroupCoordinator.all_gather by @daixinning in #2367
[Perf] Bagel KV-ready early forwarding and time step consistency for /v1/chat/completions by @natureofnature in #2398
[Feat] Support step-boundary abort in diffusion by @asukaqaq-s in #1769
[BugFix]: Fix bagel single-stage img2img fallback to text2img bug by @princepride in #2397
[Feat] Add MUSA platform support for Moore Threads GPUs by @yeahdongcn in #2337
Add new committers to governance page by @ywang96 in #2419
[CI] Tune GPU resources for test by @tjtanaa in #2401
[Feat] support HSDP for Qwen-image series, Z-Image, GLM-Image by @RuixiangMa in #2029
[Bugfix] Fix delayed decoding bug for Bagel AR/DIT workflow (L3 test_bagel_img2img error) by @natureofnature in #2422
[skip ci][Doc] Update RFC template doc by @yuanheng-zhao in #2141
[Test] Add voice or language test case for Qwen3-omni and Qwen-tts by @yenuo26 in #1844
[skip ci][Doc] Small fix of Doc by @wtomin in #2400
[Feat] Add benchmarks for Qwen3-TTS Base/VoiceDesign Model by @JasonJ2021 in #2411
[CI] [skip ci] Rename & reset timout mins for nightly L4 tests. by @congw729 in #2251
[AutoRound] Add offline quantized W4A16 model support by @yiliu30 in #1777
[Perf] Optimize Wan2.2 rotary embedding by @gcanlin in #2393
Add VACE support for WAN 2.1 conditional video generation by @tangbinh in #1885
[skip ci][Bugfix] clean useless log by @R2-Y in #2450
[Test] Skip tests/e2e/online_serving/test_zimage_expansion.py due to issue #2435 by @zhumingjue138 in #2454
[Feature] add session based audio streaming input by @Shirley125 in #2208
Update MRoPE config fallback logic by @vraiti in #2278
[Docs] Update docs to use vllm-ascend v0.18.0rc1 by @gcanlin in #2453
[BAGEL] [Feature]: Add thinking mode in Bagel multi-stage serving by @princepride in #2447
[BugFix][FishSpeech] Fix structured voice clone prefill conditioning by @Sy0307 in #2446
Seperate AR/Generation/Diffusion stage processes by @chickeyton in #2006
[Perf] Skip Wan2.2 cross attn Ulysses SP by @gcanlin in #2459
[Model] Add two stages inference for model LTX-2 distilled. by @Songrui625 in #2260
[Cleanup] Replace bare print() with logger and use specific exception types by @Lidang-Jiang in #2228
[Bugfix] Fix Flux2 Dev Guidance by @alex-jw-brooks in #2433
[OmniVoice] Add two-stage TTS serving support by @linyueqian in #2463
[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager by @JuanPZuluaga in #2108
[CosyVoice3] Add online serving support, fix stage config, and add CI tests by @linyueqian in #2431
[Rebase] Rebase to vllm v0.19.0 by @tzhouam in #2475
Voxtral TTS: drop hardcoded CUDA in audio tokenizer; add XPU stage config by @Joshna-Medisetty in #2428
[Model Support]: Magihuman support by @princepride in #2301
[Docs] Update WeChat QR code for community support by @david6666666 in #2481
[CI] Fix missing queue for Voxtral-TTS E2E test step by @linyueqian in #2484
[CosyVoice3] Fix vLLM 0.19.0 compatibility issues by @linyueqian in #2486
[Model][Core] Enable async_chunk streaming pipeline for CosyVoice3 by @indevn in #1703
[Chore] Fix Bagel model import compatibility by @yuanheng-zhao in #2491
ci: remove CosyVoice3 post-merge test by @linyueqian in #2492
[Feat] add diffusion pipeline profiler and progress bar support to FluxKontextPipeline et.al by @RuixiangMa in #2489
[Bugfix] Include uv.lock in .gitignore by @timzsu in #2493
[Bugfix] Assign original prompt back to RequestOutput by @yuanheng-zhao in #2498
[CI/Build] Add Dockerfile.cuda for NVIDIA GPU users [Skip-CI] by @loveysuby in #1439
[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts by @Sy0307 in #2480
[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM by @Sy0307 in #2429
[Perf][Fish Speech] Free unused DAC codec components to save VRAM by @Sy0307 in #2430
fix(qwen3_tts): align code predictor buffer dtype with model parameters by @willamhou in #2470
[Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU by @RuixiangMa in #1486
[Feature] Enable LoRA adapter injection for BAGEL by @timzsu in #2490
[Feature] Support vae tiling parallel encode by @gcanlin in #2368
[Bugfix] Fix load_weights fallback for non-fused stacked_params_mapping entries by @timzsu in #2523
[BugFix] Add bagel text2text/img2text think mode support by @princepride in #2503
[BugFix] Continue decode if don't need transfer kv cache between two … by @princepride in #2502
[CI] Add doc-only change detection to skip Buildkite CI. by @congw729 in #1284
[Test] Test whether CI can be correctly skipped when the committed files only contain documentation. by @yenuo26 in #2534
Add supports_float64() to OmniPlatform and clean up MPS by @yeahdongcn in #2488
[Bugfix] Fix DataType Handling in Default Diffusion Config by @alex-jw-brooks in #2530
[Docs] Add installation guide for Moore Threads (MUSA) GPUs by @yeahdongcn in #2359
[bugfix]bugfix dreamid by @erfgss in #2125
[RFC] Offload blocking TTS/speech ops to thread pool to unblock event loop by @scyyh11 in #2511
[Bugfix] To resolve timeout error, update nightly test commands for diffusion model by @yenuo26 in #2532
[HunyuanImage3] Align system_prompt support with official implementation by @skf-1999 in #2270
[daVinci-MagiHuman][Doc][BugFix] Update model support for daVici-MagiHuman and fix media utils bug by @princepride in #2542
[Bagel]Fused gate_proj and up_proj by @princepride in #2546
[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API by @marksverdhei in #2424
[Bugfix] Prevent Silent Stage Dropouts: fix coordinator reconnect bug, close/update race, and heartbeat stall by @pikaxinge in #1899
[release] Fix release script by @khluu in #2566
[release] Fix lint issue by @khluu in #2567
[Feat] Enable Layerwise CPU offloading for SD3.5, Ovis-Image, Nextstep_1, LongCat-Image by @yuanheng-zhao in #2339
[Docs] Add expert_parallel.md by @skf-1999 in #2471
[Feature] Add trajectory recording to BAGEL denoising loop by @timzsu in #2483
[Perf] Wan2.2 I2V optimization: convert datatype from FP32 to BF16 in vae by @Fishermanykx in #2391
[Diffusion] Refactor LTX2 to use unified CFG parallel framework by @TKONIY in #2160
[Feat] image2image for Z-Image by @RuixiangMa in #1580
[Feature] Port Bagel RDMA flow to latest main by @ahengljh in #2000
[Feat] Add MUSA flash attention support via mate package by @yeahdongcn in #2451
[Fix] Align diffusion proc test mock with current output fields by @ahengljh in #2584
[Bugfix] Fix benchmark Total input tokens for multimodal requests (#2540) by @Dnoob in #2549
[Unit Test] Add unit tests for orchestrator by @yinpeiqi in #2096
[TTS] Add missing _generate_pcm_chunks for OmniOpenAIServingSpeech streaming by @vveerrgg in #2569
[Perf][Qwen3-TTS][Voxtral-TTS] Share CUDA graph memory pool across decoder capture sizes by @NickCao in #2386
[Feature] End-to-end LoRA support for BAGEL by @timzsu in #2494
[CI] Reorganize the L1 L2 use cases and add markers by @zhumingjue138 in #2449
[Bugfix] Enforce --max-generated-image-size on /v1/images/generations by @NickCao in #2599
[CI]Refactor nightly test configuration in Buildkite, Add group for Omni and Diffusion models by @yenuo26 in #2582
[Bugfix] Guard app.state access during server shutdown by @pjh4993 in #2587
[MagiHuman] Fix audio sample rate and fps propagation for online serving by @princepride in #2554
[Misc] Clean up method name in BAGEL. by @timzsu in #2501
[Feat] /v1/images/generations api supports request cancel by @Semmer2 in #2621
[Bug] Lazy-import entrypoints to fix subprocess pynvml crash by @RGB-loop in #2187
[Docs] Add multi-thread weight loading documentation by @SamitHuang in #2445
[Model] Add Dynin-omni model in vllm-omni by @DOGEUNNKIM in #1759
[Bugfix] Fix precedence between caller runtime args and default stage configs by @xiaohajiayou in #2076
Revert "[Fix] Fix slow hasattr in CUDAGraphWrapper.getattr (#1982)" by @ZeldaHuang in #2639
[Refactor] Use trajectory_* fields for Qwen-Image structured RL outputs by @SamitHuang in #2513
[Bugfix] Fix Qwen-Image min-size normalization for tiny requests by @david6666666 in #2637
[Bugfix] Fix Fish Speech voice clone FileNotFoundError on multi-GPU by @Sy0307 in #2606
[CI][Bugfix] Update environment variables for test configurations in Buildkite YAML files to resolve HF timeout by @yenuo26 in #2628
[Bugfix] restore legacy stage config precedence by @xiaohajiayou in #2663
[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning by @linyueqian in #2609
[CI] Update merge condition in upload_pipeline_with_skip_ci.sh to include 'merge-test' label for non-main branches by @yenuo26 in #2666
[Feature]: support Flux.2-dev CFG-Parallel by @nuclearwu in #2010
[Entrypoint][Refactor]Stage CLI Refactor by @wuhang2014 in #2020
[CI] Update merge condition in upload_pipeline_with_skip_ci.sh to include 'merge-test' label for non-main branches by @yenuo26 in #2667
[Bugfix] fix mindiesd laserattention unsupported error by @fan2956 in #2673
[Bugfix]: modify diffusion pipeline profiler result in videos by @bjf-frz in #2647
[Profiler] Add Nsight Systems support for serving by @ahengljh in #1098
[Config] Remove invalid LLM-only engine_args from diffusion stage configs by @ianliuy in #2622
[Refactor] Remove dependency on librosa by @NickCao in #2273
[Model] VoxCPM2 native AR TTS support by @linyueqian in #2658
[BUG FIX]: prevent EngineCore crash when Qwen TTS Base task is missing ref_text by @teith in #2203
[Doc] Add LTX-2 online serving deployment recipes with optimization benchmarks by @SamitHuang in #1971
[feature] : add cache-dit for stable-audio-open-1.0 by @akshatvishu in #1341
[ROCm] [CI] [Bugfix] Resurface CI Signal, fix MHA AR selection, sync with cuda tests by @tjtanaa in #2340
[Perf] Use global CUDA graph pool for MiMo Audio by @NickCao in #2657
[TTS][OmniVoice] Add voice cloning support for OmniVoice TTS by @JuanPZuluaga in #2676
[CI] [Resource] Remove unused test cases to cutdown agent resources usage by @tjtanaa in #2688
[Bugfix] Restore user config/runtime stage init timeout by @yuanheng-zhao in #2519
[Bugfix] Validate speaker in chat endpoint and fix case-insensitive lookup by @reidliu41 in #2407
[Docs] Update WeChat QR code for community support by @david6666666 in #2701
[Log] Wire stat loggers into AsyncOmniEngine to match AsyncLLM by @gcanlin in #2551
[Bugfix] Fix Incompatible Multihook Integration (TeaCache <-> CPU Offload) by @alex-jw-brooks in #2689
[Refactor] Extend CFG Parallel to support 3 or 4 branch dispatch across M GPUs by @zzhuoxin1508 in #2423
[Bugfix] Fix UT for the missing of log_stats in Engine by @gcanlin in #2706
[ROCm] [CI] Fix environment issue by @tjtanaa in #2708
[Feat] Override single stage CLI args when stage_configs_path is set in OmniEngineArgs by @timzsu in #2684
[Bugfix] Fix Bagel online mode for 1. Hang after several requests 2. Non-deterministic image quality regression. by @natureofnature in #2458
[Perf][Fish Speech] Enable CUDA Graph capture for Fast AR code predictor by @Sy0307 in #2520
[Model] Adapt Wan2.2-I2V-A14B via LightX2V offline conversion path by @Celeste-jq in #2134
[Fix] VoxCPM2: support raw audio for voice cloning via OpenAI API by @linyueqian in #2720
[CI][Bugfix] Refactor the test case to add support for increasing init timeout and stage init timeout in order to resolve the CI timeout error. by @yenuo26 in #2711
[Revert] Revert "[Log] Wire stat loggers into AsyncOmniEngine to match AsyncLL… by @amy-why-3459 in #2716
[core]refactor communication layer: PR1(Added Refactor Infra Only) by @natureofnature in #1555
[Feature]: support Flux.2-dev tea_cache by @nuclearwu in #1871
[Bugfix] Release stage launch lock before handshake by @fake0fan in #2717
[Tests][Qwen3-Omni]Modify Qwen3-Omni performance test cases by @amy-why-3459 in #2600
[Bagel]: Support think mode in single stage deployment of Bagel by @princepride in #2650
[Misc] Cleanup: use consistent pytest-mock in unit tests by @yuanheng-zhao in #2698
[skip ci][doc]Update async_chunk design diagram by @amy-why-3459 in #2420
[Bugfix] Update Flux2-dev & Dynin_omni L4 e2e test by @wtomin in #2723
[Voxtral TTS] Correct decode steps param in Voxtral TTS by @y123456y78 in #2524
[Perf]: Speedup VoxCPM2 TTS performance and Support PagedAttention by @Sy0307 in #2690
[Voxtral TTS] Fix Voxtral TTS input with text and ref_audio by @y123456y78 in #2750
[CI] Qwen image edit performance benckmark by @fhfuih in #2216
[BugFix] Remove stage_configs_path validation by @amy-why-3459 in #2741
[Perf] Optimize MP4 encoding latency in video generation by @SamitHuang in #2735
[Qwen3-TTS] Remove hardcoded distributed_executor_backend to improve single-GPU performance by @iancarrasco-b10 in #2604
[Test] Add Stable Audio offline e2e TeaCache Test by @zhangj1an in #2377
[Omni Connector] Omni Transfer Engine Connector: Enable 1-receiver-to-N-senders to support Bagel TP/CFG parallel by @natureofnature in #2731
[skip ci] fix docs, gdown remove --id param by @lengrongfu in #2787
[Tests][Qwen3-Omni]Add test cases for long videos and long audios. by @amy-why-3459 in #2598
[skip ci]add skills by @hsliuustc0106 in #2710
[Misc] clean Temporary CI Configs by @n1ptune in #2784
[CI][Bugfix] Update thresholds for accuracy tests by @yenuo26 in #2725
[CI/BugFix] Fix Flaky Test for Qwen Omni Perf by @alex-jw-brooks in #2754
[Bugfix] Reject /v1/audio/speech for Qwen omni models by @scyyh11 in #2763
fix: do not apply FP8 quant config to vision/audio encoders for pre-quantized checkpoints by @ianliuy in #2702
[BugFix] Fix NoneType' object has no attribute 'detach' by @amy-why-3459 in #2797
[Bugfix] Make mrope kwargs optional in HunyuanImage3 get_mrope_input_positions by @ianliuy in #2654
[Bugfix] Handle numpy array outputs when generate image by @lengrongfu in #1680
[Perf] VoxCPM2: streaming VAE + compile optimization (45% RTF reduction) by @linyueqian in #2758
[Perf] Enhance benchmark script to support baseline thresholds and proved result handling by @yenuo26 in #2789
[Benchmark]Omni-modality model accuracy benchmark(Daily-Omni & seed-tts-eval) by @amy-why-3459 in #2558
[CI] qwen image edit L4 accuracy test by @fhfuih in #2761
[Perf] Eliminate Hop 3 IPC overhead for single-stage diffusion via inline execution by @SamitHuang in #2736
[Feature] feat: add video frame interpolation postprocess by @david6666666 in #2555
[Fix] HunyuanImage-3.0: unify naming hunyuan_image_3 → hunyuan_image3 by @TaffyOfficial in #2712
[PERF] Wan2.2 support adalayernorm fused op by @fan2956 in #2585
[hotfix] API connection error in CI by @fhfuih in #2810
[Perf] VoxCPM2: Speedup by manual CUDA Graph capture for scaffold/residual forward by @Sy0307 in #2803
Add voxcpm model support. by @IsleOfDawnlight in #2467
[Feat][Qwen3-Omni] Shared code predictor module for Qwen3-TTS and Qwen3-Omni by @JuanPZuluaga in #2375
[Feature] HunyuanImage3 allow guidance_scale<=1 in DiT stage by @Fishermanykx in #2762
[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev by @zhangj1an in #2795
[feature] Hidden State Prefix Caching by @alex-jw-brooks in #2164
[Perf] Add Performance Test for Qwen-Image Step-Level Execution by @wtomin in #2707
[CI] Skip test_thinker_prefix_caching in tests/e2e/online_serving/test_qwen3_omni.py by @yenuo26 in #2836
[CI][Perf] Add nightly PR labels, consolidate pipeline, and switch benchmark flag to --test-config-file by @yenuo26 in #2816
[Doc][Misc] Update DreamID-Omni Example; Add DreamID-Omni post process function by @yuanheng-zhao in #2809
[Feat] add GLM-Image SP support by @RuixiangMa in #1983
[CI] add qwen image and layered accuracy test by @david6666666 in #2772
[Feature] Bagel: Support tp+cfg parallel using mooncake transfer engine connector by @natureofnature in #2705
[PERF] Wan2.2 support rmsnorm fused op by @fan2956 in #2583
[Test] Add performance tests for Qwen-Image-Layered model by @kechengliu97 in #2807
[Fix][Fish Speech] Remove redundant get_vocab() in control token encoding by @Sy0307 in #2842
[Test] Skip tests for known issues in audio and speaker recognition by @yenuo26 in #2851
[FIX] Preserve YAML default stop words when request sends empty list by @QiuMike in #2855
[BugFix][VoxCPM2]: split multichar Chinese tokens to match training tokenization by @Sy0307 in #2832
Feat/Add HunyuanImage-3.0-Instruct ar part support: by @TaffyOfficial in #2713
[Quantization] feat: add FP8 for Omnigen2 by @zhangj1an in #2441
[Feature] Flux2 klein inpaint by @RuixiangMa in #1180
[Refactor] Remove sox from dependencies by @NickCao in #2745
[Bugfix] enforce max_sequence_length for Qwen-Image and Wan2.2 series before encoding by @david6666666 in #2847
[Bugfix] Preserve default diffusion sampling params in default stage by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2780
[Model] Support Flux1 Schnell by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/2528
[Core] Refactor CFG companion tracker and use in Orchestrator by @yinpeiqi in https://github.com/vllm-project/vllm-omni/pull/2623
[CI][Bugfix] Fix the error in generating the performance data table and add a fallback mechanism that prevents the result file from being generated when test case execution fails. by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2839
[BugFix] Fixing occasional engine crashes caused by abort requests by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2871
[Feature] Support Prefill-Decode disaggregation via vLLM KV transfer by @spencerr221 in https://github.com/vllm-project/vllm-omni/pull/2220
[Model] Add Ming-flash-omni-2.0 Thinker Stage by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/1822
[Bugfix] Fix RIFE device selection for CPU-transported videos by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2876
[Bugfix] Limit Qwen-Image-Edit-2511 input image count by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2840
[Test] Add ModelRunner V2 with Qwen3-TTS Base E2E Test to CI pipeline by @tzhouam in https://github.com/vllm-project/vllm-omni/pull/2321
[Bugfix] Fix image quality in /v1/images/generations for multi-stage pipeline by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2267
Fix NoneType error of outputs by @QiuMike in https://github.com/vllm-project/vllm-omni/pull/2315
[Refactor] refactor wan2.2 diffuse && add ut by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2672
[Misc] Warn When vLLM / vLLM-Omni Have Mismatched Versions by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/2691
[Bugfix] Fix cache dit for Longcat & LTX2 by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/2860
[CI] Skip test_bagel[parallel_tp_2] and test_wan22_i2v_online_serving_generates_video[wan22_i2v_usp2_hsdp2] by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2883
[Bugfix] fix CI failure by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2884
[Cleanup] Remove dead runtime.defaults config parameters by @NickCao in https://github.com/vllm-project/vllm-omni/pull/2343
[skip CI][Docs] Add Qwen3-Omni and Qwen3-TTS performance blog and figures by @Shirley125 in https://github.com/vllm-project/vllm-omni/pull/1837
Nextstep online e2e by @Joshna-Medisetty in https://github.com/vllm-project/vllm-omni/pull/2107
Add Teacache Support for LongCat Image by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/1487
[skip ci][recipe] draft vllm-omni recipes by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/2646
[Docs] Update WeChat QR code for community support by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2895
[Refactor] Remove resampy dependency by @NickCao in https://github.com/vllm-project/vllm-omni/pull/2891
[Feature]Support audio streaming input and output-phase2 by @Shirley125 in https://github.com/vllm-project/vllm-omni/pull/2581
[BugFix]: Fix multi-stage cfg bug by @princepride in https://github.com/vllm-project/vllm-omni/pull/2801
[doc][skip ci] remove redundant content in readme by @Shirley125 in https://github.com/vllm-project/vllm-omni/pull/2901
[Feat] cache-dit for GLM-Image by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/1399
[Agent] Add NPU main2main skill by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/2858
[Bugfix][VoxCPM2] Fix voice-clone decode loop by padding prefill prompt by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2894
[Config Refactor][2/N] Pipeline + Deploy Config Schema by @lishunyang12 in https://github.com/vllm-project/vllm-omni/pull/2383
[Bugfix][VoxCPM2]: Fix vectorized_gather OOB under concurrent prefill+decode batches by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2903
perf(helios): replace strided RoPE with stack+flatten for contiguous memory by @willamhou in https://github.com/vllm-project/vllm-omni/pull/2474
[Bugfix] diffusion end points allow model mismatch by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/2805
[Feat] Support layerwise CPU offloading for more videogen models by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/2018
[Config Refactor 2.5/N] Centralize pipeline registry by @lishunyang12 in https://github.com/vllm-project/vllm-omni/pull/2915
[Perf] Optimize Wan2.2 device free on image preprocess by @fan2956 in https://github.com/vllm-project/vllm-omni/pull/2852
[Docs] update documents by @R2-Y in https://github.com/vllm-project/vllm-omni/pull/2921
[BugFix] Fixed the issue where --no-async-chunk was not working. by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2934
[CI] Restructure vLLM-Omni Test Layout, Fixture Scope, and Support Modules by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2620
[Model] Add HSDP support for LTX-2 by @fywc in https://github.com/vllm-project/vllm-omni/pull/2899
[Revert] drop Wan2.2 prompt-length enforcement from #2847 by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2877
[Bugfix] Fix GLM-Image output dimensions and image edit pipeline by @JaredforReal in https://github.com/vllm-project/vllm-omni/pull/2320
[Docs] Add Wan2.2 image-to-video recipe for Ascend NPU (A2/A3) by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/2919
[Example] Add Hunyuan-Image3 end2end.py and README.md by @kechengliu97 in https://github.com/vllm-project/vllm-omni/pull/2590
CI: publish Omni images to a separate Docker Hub repository by @sheralskumar in https://github.com/vllm-project/vllm-omni/pull/2829
[Enhancement] add pytorch profiler ops and memory record by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2472
[Model] feat: FastGen DMD2-distilled Wan 2.1 pipelines (T2V, I2V) by @ayushag-nv in https://github.com/vllm-project/vllm-omni/pull/2749
[CI] Remove small resolution test in Qwen-Image Perf test when vae patch parallel is enabled by @wtomin in https://github.com/vllm-project/vllm-omni/pull/2872
[Bugfix] Truncate mimo-audio code2wav prompt to MAX_CODE2WAV_TOKENS by @lishunyang12 in https://github.com/vllm-project/vllm-omni/pull/2693
[Feat][sleepmode] add omni sleepmode and ack protocol by @Flink-ddd in https://github.com/vllm-project/vllm-omni/pull/2022
[CI][Bugfix] Improve cosine similarity calculation by incorporating length harmony adjustment in text comparison by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2964
[BugFix] Fix the issue with stream=True by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2955
[Enhancement] Engine runtime errors by @pi314ever in https://github.com/vllm-project/vllm-omni/pull/2426
[BugFix] add missing subtalker sampling config to Qwen3-TTS deploy YAML by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/2940
[Model] Add HSDP support for Stable-Audio-Open by @fywc in https://github.com/vllm-project/vllm-omni/pull/2982
[Enhancement]remove duplicate video preprocess in Wan2.2 pipeline by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2963
[Bugfix] Fix VAE parallelism dist.gather performance bottleneck on NPU by @fan2956 in https://github.com/vllm-project/vllm-omni/pull/2969
[Config Refactor] Migrate 5 TTS models (VoxCPM2 / CosyVoice3 / MiMo Audio / Voxtral TTS / Fish Speech S2 Pro) to Pipeline + Deploy schema by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2958
[Bugfix] ComfyUI image-to-image DALL-E endpoint cases #2886 by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2980
Codex revert pr reviewer by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/2959
[Bugfix] treewide: drop references to librosa by @NickCao in https://github.com/vllm-project/vllm-omni/pull/2996
[Feature] Load Balancer - Add LeastQueueLengthBalancer RoundRobinBalancer by @NumberWan in https://github.com/vllm-project/vllm-omni/pull/2448
Remove dead code by @dhonnappa-amd in https://github.com/vllm-project/vllm-omni/pull/2998
[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture by @michael-chipmates in https://github.com/vllm-project/vllm-omni/pull/2910
[Bugfix][Model] Fix Qwen3-TTS Code2Wav max_model_len validation by @greenhandzpx in https://github.com/vllm-project/vllm-omni/pull/2508
[Feature][Voxtral TTS] Support per-model extra sampling params & add cfg_alpha for Voxtral TTS by @y123456y78 in https://github.com/vllm-project/vllm-omni/pull/2338
[ci]add merge ci for streaming input by @Shirley125 in https://github.com/vllm-project/vllm-omni/pull/2965
[BugFix][CI]: Fix test_omni_sleep_mode ci bug by @princepride in https://github.com/vllm-project/vllm-omni/pull/3010
[ConfigRefactor] GLM-Image by @JaredforReal in https://github.com/vllm-project/vllm-omni/pull/2977
[CI] Update test markers and configurations to use 'full_model' for L4 nightly tests by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2641
[NPU] Support code predictor NPU graph by @gxxx-hum in https://github.com/vllm-project/vllm-omni/pull/2695
[Tests] Modify test cases by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2991
[AutoRound] Support Qwen Omni W4A16 quantization model by @lvliang-intel in https://github.com/vllm-project/vllm-omni/pull/2670
Fix links in glm_image.md by @oglok in https://github.com/vllm-project/vllm-omni/pull/3028
[feat]: General diffusers adapter backend to run diffusion models by @fhfuih in https://github.com/vllm-project/vllm-omni/pull/2724
[BugFix] Surface diffusion metrics in chat completions; sanitize Bagel img2img mm kwargs by @NumberWan in https://github.com/vllm-project/vllm-omni/pull/2932
[Refactor] Let diffusion pipelines declare offloadable modules via SupportsModuleOffload by @NickCao in https://github.com/vllm-project/vllm-omni/pull/2427
[Feature] Failure message shows more details by @wuhang2014 in https://github.com/vllm-project/vllm-omni/pull/2961
[Tests]Add accuracy benchmark L4 test cases by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2843
[MUSA][Feat] Upgrade MATE to match CUDA/ROCm behavior on FA by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/2766
[Doc] Add diffusion attention backend docs by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3011
[Bugfix] Fix default sampling params for /v1/videos on main by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3049
[Model] Ming-flash-omni-2.0 Omni-Speech and TTS by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/2890
[BugFix] Bagel img2img e2e: drop extra_body height/width by @NumberWan in https://github.com/vllm-project/vllm-omni/pull/3054
[BugFix] Preserve media access args for stage configs by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/2956
feat: add LTX-2.3 video generation model support by @oglok in https://github.com/vllm-project/vllm-omni/pull/2893
[Test] add stability test case for wan2.2, qwen-tts, qwen3-omni and qwen-image model and modified conftest.py in test/dfx/ by @zhumingjue138 in https://github.com/vllm-project/vllm-omni/pull/2817
[PERF]use mindiesd fused rope and rope cache by @Hu1Lcode in https://github.com/vllm-project/vllm-omni/pull/2571
[BugFix]: Fix Qwen3-TTS code2wav fails when enforce_eager: false by @ChefWu551 in https://github.com/vllm-project/vllm-omni/pull/2868
[Perf][Wan2.2] Add fused RMSNorm replace WanRMS_norm on npu by @lyj-jjj in https://github.com/vllm-project/vllm-omni/pull/3067
[Docs] CLI Docs updates by @wuhang2014 in https://github.com/vllm-project/vllm-omni/pull/2978
[CI] Re-enable Prefix Cache Test by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/2869
[Chore] refine the offline examples by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3095
[Feat] support quantization for GLM-IMAGE by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2292
[Feature] XiaomiMiMo/MiMo-V2.5-ASR support by @qibaoyuan in https://github.com/vllm-project/vllm-omni/pull/3089
[Feature] Coordinator PUB mechanism optimization by @NumberWan in https://github.com/vllm-project/vllm-omni/pull/2442
[Enhancement]modify profiling.md by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/3051
[CI][Perf] Add Wan22 i2v perf nightly ci by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/3063
[Bugfix] T5 text encoder to render correct text in FLUX.1-dev by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2760
[Bugfix] Fix Flux2klein Text Input Processing by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3098
docs: update add-tts-model skill and contributing guide with single-stage patterns by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2806
[Bugfix] fix GLM-Image multi-stage online generation fail by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3084
[Bugfix] graceful shutdown for multi-stage engine processes by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3001
[Config Refactor] sentinel default precedence by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3078
[Config Refactor]: Remove bagel yaml by @princepride in https://github.com/vllm-project/vllm-omni/pull/2936
[Refactor] Remove redundant benchmarks from Qwen3-Omni by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3108
[Feature] Streaming video input with EVS frame filtering (RFC #2201 Phase 2-4) by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2342
[MUSA][Feat] torch.accelerator support by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3101
[Benchmark] Universal TTS benchmark: Qwen3-TTS + VoxCPM2 with 3 task types (voice-clone/default/design) by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2835
[XPU] Enable torch inductor for xpu by @xuechendi in https://github.com/vllm-project/vllm-omni/pull/3113
fix generations reponse body by @FrosterHan in https://github.com/vllm-project/vllm-omni/pull/3094
[CI] Fix Seed TTS Simple Unit Tests by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3126
[Refactor] Remove the default value "mp" from distributed_executor_backend. by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3007
[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path by @jiangmengyu18 in https://github.com/vllm-project/vllm-omni/pull/2979
[Feat] support layerwise offload for Bagel by @lsyyysky in https://github.com/vllm-project/vllm-omni/pull/2734
[Refactor] Standardize data entry key names to {type}.{qualifier} format by @divyanshsinghvi in https://github.com/vllm-project/vllm-omni/pull/1829
[Bugfix][Refactor] Migrate Voxtral TTS config and parser registry by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3065
Benchmark and Bugfix for GLM-Image by @JaredforReal in https://github.com/vllm-project/vllm-omni/pull/3024
[Refactor/Bugfix] Use Delta Messages for Streaming by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/2911
[Cleanup] Drop empty _DIFFUSION_PIPELINES placeholder by @lishunyang12 in https://github.com/vllm-project/vllm-omni/pull/3040
[Refactor] Centralize stage sampling params resolution by @reidliu41 in https://github.com/vllm-project/vllm-omni/pull/3153
[Feature]: support Flux.1-dev tea_cache by @nuclearwu in https://github.com/vllm-project/vllm-omni/pull/2774
[Bugfix] Add seed support to TTS API for deterministic Fish Speech voice generation by @ianliuy in https://github.com/vllm-project/vllm-omni/pull/2624
feat: add MOSS-TTS-Nano single-stage TTS support by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2753
[Refactor] Extract has_flash_attn_pkg in OmniPlatform by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3068
[Test] L5 reliability test for wan2.2 and qwen3_omni model by @zhumingjue138 in https://github.com/vllm-project/vllm-omni/pull/2972
[CI] Update nightly/ready test scheduling and unskip selected cases by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2945
[MUSA][Chore] Bump torchada by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3132
[BugFix] qwen3_tts: build speaker_encoder in init so load_format: dummy works by @leohuang257 in https://github.com/vllm-project/vllm-omni/pull/3117
[CI] skip MOSS-TTS-Nano E2E Test by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3171
[Model] Add InternVLA-A1 offline inference support by @Greyman-Seu in https://github.com/vllm-project/vllm-omni/pull/2737
fix(diffusion): correct metric keys, remove duplication, minor cleanup by @willamhou in https://github.com/vllm-project/vllm-omni/pull/2692
[Docs] Modify Qwen3-Omni's recipe by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3109
[CI]Skip failing text_to_image README examples in CI due to subprocess exit issue by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3190
[Docs] Add performance profiling step to add-diffusion-model skill by @SamitHuang in https://github.com/vllm-project/vllm-omni/pull/3196
[Doc][Frontend][Model][VoxCPM2] Support instructions and per-request cfg_value by @gnomefin in https://github.com/vllm-project/vllm-omni/pull/3118
[perf]Qwen3-Omni performance optimization by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3164
Revert "[perf]Qwen3-Omni performance optimization" by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/3202
Revert "[Doc][Frontend][Model][VoxCPM2] Support instructions and per-request cfg_value" by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/3204
[BugFix]: recalculate model_config to fix FA3 scheduler_metadata shape mismatches by @ZhengWG in https://github.com/vllm-project/vllm-omni/pull/3110
Update WeChat QR code by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3213
[Docs] Update quantization guides by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3200
[Bugfix][CI] fix file name too long error in Omni · Doc Test with H100 by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3209
Fix "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" by @ekagra-ranjan in https://github.com/vllm-project/vllm-omni/pull/1794
[Bugfix] Fix Qwen3-TTS Base ICL garbled output when ref_audio is a local path by @Dnoob in https://github.com/vllm-project/vllm-omni/pull/2984
[Bugfix] Add missing .gpu accessor for inputs_embeds CpuGpuBuffer in prefill overlay by @dubin555 in https://github.com/vllm-project/vllm-omni/pull/2068
[Rebase] Rebase to vllm 0.20.0 by @tzhouam in https://github.com/vllm-project/vllm-omni/pull/3232
[Bugfix][Qwen3TTS] Use float32 for code predictor on fp16-only GPUs by @NickCao in https://github.com/vllm-project/vllm-omni/pull/3253
[Bugfix][CI] Vendor Qwen3-TTS and CosyVoice3 reference audio fixtures by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3267
[MUSA] Add get_device_capability/get_device_version and remove get_diffusion_model_impl_qualname/prepare_diffusion_op_runtime in musa/platform.py by @yeahdongcn in https://github.com/vllm-project/vllm-omni/pull/3179
[Bugfix]pass TP size to diffusion config by @natureofnature in https://github.com/vllm-project/vllm-omni/pull/2867
[Quantization] Enable FP8 online quantization for Z-image text encoder by @Isotr0py in https://github.com/vllm-project/vllm-omni/pull/1338
[Bugfix][CI] Enhance get_open_port function to handle port binding more robustly by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3216
Revert "[Quantization] Enable FP8 online quantization for Z-image text encoder" by @hsliuustc0106 in https://github.com/vllm-project/vllm-omni/pull/3272
[Mergify] Init config by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/812
[BugFix] Fix nullify regressions in pure diffusion offline examples by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3181
[CI] remove skip in test_z_image_expansion and test_qwen3_omni_expansion by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/3211
[Bugfix] Prevent silent failure of get_config when trust_remote_code passed as None by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3241
add check_health in inline client by @lengrongfu in https://github.com/vllm-project/vllm-omni/pull/3052
[Misc] Remove Entrypoint Hijack for vLLM / 0.20.0 Changes by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3082
[AMD][CI][Bugfix] Fix "simple unit test" by @tjtanaa in https://github.com/vllm-project/vllm-omni/pull/3225
[BugFix][Bagel]: Fix vLLM-Omni as rollout bug: number of trajectory_latents count less by @princepride in https://github.com/vllm-project/vllm-omni/pull/3258
[Config Refactor] Derive deploy override fields from stage config by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3162
[CI]Change pixels tolerance from 5 to 10 by @princepride in https://github.com/vllm-project/vllm-omni/pull/3289
[CI failed]Revert "[Config Refactor] Derive deploy override fields from stage config" by @Gaohan123 in https://github.com/vllm-project/vllm-omni/pull/3287
[Cleanup] Extract shared SinusPositionEmbedding/DiTTimestepEmbedding/timestep_embedding layer by @NickCao in https://github.com/vllm-project/vllm-omni/pull/3285
[perf]Qwen3-Omni performance optimization by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3203
[Enhancement] Fix header level warnings and adapt flash_attn of Ming audio VAE by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3146
[Bugfix] Validate Qwen3-TTS speaker embedding dimensions before engine execution by @JasonJ2021 in https://github.com/vllm-project/vllm-omni/pull/3191
[Doc] Add fish-speech-s2-pro online serving deployment recipes by @menjiantong in https://github.com/vllm-project/vllm-omni/pull/3193
[CI] Add vLLM FA to Flash Attention Selection (Bagel H100 Failures) by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3302
glm-image: cache default HF processor to reduce input preprocessing latency by @lyj-jjj in https://github.com/vllm-project/vllm-omni/pull/3245
[BugFix] voxcpm2: eager-init tts_model in init so load_format: dummy works by @leohuang257 in https://github.com/vllm-project/vllm-omni/pull/3229
[Minor] Move FA Imports & Fix Import Check by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3307
[Bugfix][Fish Speech] Flatten Fast AR RoPE positions by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/3299
[CI][Perf] Add high-load stress phase for Qwen3-TTS daily perf by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3238
[CI] Fix Bad TP Initialization in Dynin-Omni Tests by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3298
[Config Refactor] Migrate Ming and Ming-TTS deploy/pipline configs by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3154
[Bugfix][MOSS-TTS-Nano] Drop fictional voice presets, require ref_audio by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3192
[Bugfix] Map Qwen3-TTS max_new_tokens to max_tokens by @sfiisf in https://github.com/vllm-project/vllm-omni/pull/3217
[Feat] ERNIE image model (T2I) by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2861
[BugFix]: Fix async scheduer transfer exceed KV cache by @princepride in https://github.com/vllm-project/vllm-omni/pull/3318
Fixed memory leak and Remove dead code by @Dan250124 in https://github.com/vllm-project/vllm-omni/pull/3312
[Chore][Doc] Fix example arg values in Profiler doc by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/3309
[Bugfix] Fix GLM-Image prior token debug logging by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/3165
[Bugfix] Use get_open_ports_list for stage ports in OmniMasterServer by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3333
[Perf] Optimize RMSNorm in Z-Image by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3304
[Bugfix] GLM-Image: route t2i requests through the multimodal processor (#3034) by @ptarasiewiczNV in https://github.com/vllm-project/vllm-omni/pull/3189
[Cleanup] Use tokens_input() for TTS prompt construction by @NickCao in https://github.com/vllm-project/vllm-omni/pull/3227
[Bugfix] Fix CUBLAS_STATUS_EXECUTION_FAILED when native Flash Attention is available (Wan2.2) by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/3327
[Diffusion] [Model] Support AudioX by @zhangj1an in https://github.com/vllm-project/vllm-omni/pull/2077
[NPU] Upgrade to v0.20.0 & align with GPU model runner by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/3325
[CI] Use Logprobs Check for Flaky Prefix Cache Test by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3199
[Docs] Consolidate per-model TTS examples into a single hub by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3234
[Core] Support Async & Sync AutoRegressive Scheduling by @alex-jw-brooks in https://github.com/vllm-project/vllm-omni/pull/3306
[Bugfix] Add GatedRepoError Report by @ZhanqiuHu in https://github.com/vllm-project/vllm-omni/pull/1616
[Performance CI ]Hunyuan Image 3.0 DIT bench test by @Bounty-hunter in https://github.com/vllm-project/vllm-omni/pull/2495
[bugfix][CI] Diffusers backend update by @fhfuih in https://github.com/vllm-project/vllm-omni/pull/3096
[Feat]add cpu-offload/layerwise-offload for stable-audio-open & fix output inconsistency with same seed by @sphinxkkkbc in https://github.com/vllm-project/vllm-omni/pull/2909
[TTS][SpeakerCacheManager] A global speaker cache manager for Voice Cloning by @JuanPZuluaga in https://github.com/vllm-project/vllm-omni/pull/2630
[Refactor][Qwen3-TTS] Construct Code2Wav decoder natively by @NickCao in https://github.com/vllm-project/vllm-omni/pull/2341
[ROCm] [CI] [Bugfix] 2/N Fix Qwen2.5 and Qwen3 test by @tjtanaa in https://github.com/vllm-project/vllm-omni/pull/3343
[Bugfix][HunyuanImage3] Fix offline AR garbage output by switching to Instruct chat template by @TaffyOfficial in https://github.com/vllm-project/vllm-omni/pull/3243
[Docs] Consolidate moss_tts_nano + ming_flash_omni_tts into TTS hub by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3358
[Bugfix] Propagate seed to Qwen3-TTS Fast AR sampler by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/3350
[Feat] DiffusionEngine Support async batch infer by @Semmer2 in https://github.com/vllm-project/vllm-omni/pull/2729
[Enhancement] Offload transformer after switch to transformer-2 by @xuechendi in https://github.com/vllm-project/vllm-omni/pull/3224
Enable Wan2.2-S2V modeling to vLLM-omni by @xuechendi in https://github.com/vllm-project/vllm-omni/pull/2751
Support both "voice" and "speaker" params in chat completions by @QiuMike in https://github.com/vllm-project/vllm-omni/pull/3248
[CI][Bugfix] Relax stable-audio layerwise offload determinism tolerance to 1e-2 by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3371
Update WeChat QR code by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/3368
[Feat] support HSDP for DreamID-Omni by @fywc in https://github.com/vllm-project/vllm-omni/pull/3138
[BugFix] Fix Whitelist optimization CI failure by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3290
[Hunyuanimage-3.0] Accuracy fix by @Bounty-hunter in https://github.com/vllm-project/vllm-omni/pull/3373
[Bugfix]: skip faulty pipelines during registry iteration by @Flink-ddd in https://github.com/vllm-project/vllm-omni/pull/2999
[Docs] Add LTX-2-T2V and LTX-2-I2V recipes by @fywc in https://github.com/vllm-project/vllm-omni/pull/3294
[FEAT] support multi-stage deployment by @ZhengWG in https://github.com/vllm-project/vllm-omni/pull/2396
[BugFix][CI] Change max_tokens from 150 to 2048 by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3376
[bugfix][CI] Fix qwen image performance degradation w/ vllm 0.20 & CUDA 13.0 by @fhfuih in https://github.com/vllm-project/vllm-omni/pull/3352
[Bugfix] Fix default diffusion stage config generator drops runtime engine args by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/2559
[Feature] HunyuanImage-3.0 IT2I (image editing) support by @skf-1999 in https://github.com/vllm-project/vllm-omni/pull/3107
[Bugfix] Fix missing ANSI colors in CLI logo when output is piped by @Lidang-Jiang in https://github.com/vllm-project/vllm-omni/pull/1636
[Quantization] Redo Z-Image text encoder FP8 online quantization by @Isotr0py in https://github.com/vllm-project/vllm-omni/pull/3279
[Feat][Qwen3-Omni] Add CUDA graph support for Code2Wav decoder by @JuanPZuluaga in https://github.com/vllm-project/vllm-omni/pull/2376
[Bugfix][OmniVoice] Read voice-cloning fields from OmniTextPrompt in offline path by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/3392
[BugFix] Fixed a precision issue with one-word answers. by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/3385
[Config Refactor] Remove legacy Omni CLI arg helper and align tests with nullified parser defaults by @xiaohajiayou in https://github.com/vllm-project/vllm-omni/pull/3144
[BugFix] Forward CLI --tokenizer to per-stage engine configs by @y123456y78 in https://github.com/vllm-project/vllm-omni/pull/3120
[XPU][DOCKER] update dockerfile.xpu after main repo updating to pt2.11 by @xuechendi in https://github.com/vllm-project/vllm-omni/pull/3393
[CI Patch] Qwen 2.5 CI Fixes for Intel XPU by @pi314ever in https://github.com/vllm-project/vllm-omni/pull/3083
[bugfix][hunyuaniamge] Fix parameter issue introduced during PR #3107 rebase by @Bounty-hunter in https://github.com/vllm-project/vllm-omni/pull/3395
[BugFix] Qwen2.5-Omni streaming code2wav input handling by @yinpeiqi in https://github.com/vllm-project/vllm-omni/pull/3396
[BugFix] Probe dict instead of hasattr when patching WanRMS_norm by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/3400

New Contributors

@chickeyton made their first contribution in #2265
@TKONIY made their first contribution in #2063
@OrangePure made their first contribution in #2353
@noobHappylife made their first contribution in #2372
@daixinning made their first contribution in #2367
@yeahdongcn made their first contribution in #2337
@JasonJ2021 made their first contribution in #2411
@yiliu30 made their first contribution in #1777
@tangbinh made their first contribution in #1885
@vraiti made their first contribution in #2278
@Songrui625 made their first contribution in #2260
@indevn made their first contribution in #1703
@timzsu made their first contribution in #2493
@loveysuby made their first contribution in #1439
@willamhou made their first contribution in #2470
@scyyh11 made their first contribution in #2511
@skf-1999 made their first contribution in #2270
@pikaxinge made their first contribution in #1899
@vveerrgg made their first contribution in #2569
@pjh4993 made their first contribution in #2587
@RGB-loop made their first contribution in #2187
@DOGEUNNKIM made their first contribution in #1759
@xiaohajiayou made their first contribution in #2076
@ianliuy made their first contribution in #2622
@teith made their first contribution in #2203
@Celeste-jq made their first contribution in #2134
@iancarrasco-b10 made their first contribution in #2604
@zhangj1an made their first contribution in #2377
@n1ptune made their first contribution in #2784
@TaffyOfficial made their first contribution in #2712
@IsleOfDawnlight made their first contribution in #2467
@QiuMike made their first contribution in #2855
@fywc made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2899
@sheralskumar made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2829
@ayushag-nv made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2749
@dhonnappa-amd made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2998
@michael-chipmates made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2910
@greenhandzpx made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2508
@gxxx-hum made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2695
@lvliang-intel made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2670
@ChefWu551 made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2868
@FrosterHan made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3094
@lsyyysky made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2734
@leohuang257 made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3117
@Greyman-Seu made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2737
@gnomefin made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3118
@ZhengWG made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3110
@menjiantong made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3193
@sfiisf made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3217
@Dan250124 made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3312
@ptarasiewiczNV made their first contribution in https://github.com/vllm-project/vllm-omni/pull/3189
@sphinxkkkbc made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2909

Full Changelog: v0.18.0...v0.20.0