vllm-project/vllm-omni v0.16.0 on GitHub

Highlights

This release features approximately 121 commits (merged PRs) from ~60 contributors (24 new contributors).

vLLM-Omni v0.16.0 is a major alignment + capability release that rebases the project onto upstream vLLM v0.16.0 and significantly expands performance, distributed execution, and production readiness across Qwen3-Omni / Qwen3-TTS, Bagel, MiMo-Audio, GLM-Image and the Diffusion (DiT) image/video stack—while also improving platform coverage (CUDA / ROCm / NPU / XPU), CI quality, and documentation.

Key Improvements

Rebase to upstream vLLM v0.16.0: Tracks the latest vLLM runtime behavior and APIs while keeping Omni’s error handling aligned with upstream expectations. (#1357, #1122, plus follow-up fixes like #1401)
Qwen3-Omni performance + correctness: Performance optimizations (cuda graph, async-chunk, streaming output) make TTFP reduced 90% and RTF 0.22~0.45, plus precision and E2E metric correctness fixes. (#1378, #1352, #1288, #1018, #1292)
MiMo-Audio production Support: Performance optimizations (cuda graph, async-chunk, streaming output) improves the RTF ~ 0.2, 11x faster than baseline. #750
Qwen3-TTS production upgrades: Disaggregated inference pipeline support, streaming output, batched Code2Wav decoding, and CUDA Graph support for speech tokenizer decoding—plus multiple robustness fixes across task type handling and voice cloning. (#1161, #1438, #1426, #1205, #1317, #1554)
Bagel acceleration & scalability: Adds TP support, introduces CFG capabilities, and accelerates multi-branch CFG by merging branches into a single batch; includes KV transfer stability fixes. (#1293, #1310, #1429, #1437)
Diffusion distributed execution expansion: Adds/extends TP/SP/HSDP and reduces redundant communication overhead; improves pipeline parallelism options (e.g., VAE patch parallel) and correctness across multiple diffusion families. (#964, #1275, #1339, #756, #1428)
Quantization for DiT: Introduces FP8 quantization support and native GGUF quantization support for diffusion transformers, with code-path cleanups. (#1034, #1285, #1533)
Broader model coverage (audio + image): Adds MiMo-Audio-7B-Instruct support and performance improvements for GLM-Image pipelines. (#750, #920)

Diffusion, Image & Video Generation

New/expanded model coverage
- HunyuanImage3 support and v0.16.0 follow-ups removing CUDA hardcoding + MOE fixes. (#1085, #1402, #1401)
- OmniGen2 support. (#513)
- nextstep_1 diffusion model (T2I-only). (#612)
Distributed & parallel execution
- TP support additions/expansions for diffusion models (e.g., Wan 2.2, SD 3.5). (#964, #1336)
- HSDP for diffusion models for improved scalability. (#1339)
- VAE patch parallelism support (and enablement for SD3.5). (#756, #1428)
- Sequence-parallel comm reduction by refining SP hook design. (#1275)
Performance & memory efficiency
- Flux caching features (e.g., cache_dit) and CFG-parallel improvements for Flux.1-dev. (#1145, #1269)
- Process-level memory calculation hooks for diffusion workloads. (#1276)
- Platform-wide enablement of layerwise offload. (#1492)
Correctness & stability
- Multiple pipeline stability and correctness fixes (seed handling, attention mask dtype/shape, tokenizer padding issues, init/download safety, model detect robustness, etc.). (e.g., #1249, #1248, #1349, #1241, #1213, #1254, #1562)

Audio, Speech & Omni (Qwen3-TTS / MiMo-Audio)

Qwen3-TTS feature set maturation
- Disaggregated inference pipeline support for stage-based / split deployment. (#1161)
- Streaming output for v1/audio/speech-style workflows. (#1438)
- Code2Wav batched decoding and async-chunk batch inference enhancements. (#1426, #1246)
- CUDA Graph support for the speech tokenizer decoder. (#1205)
Stability & quality
- Fixes for task_type handling, snapshot/download behavior, configuration options, and voice clone corruption edge cases. (#1317, #1318, #1177, #1554, #1455)
- More robust handling of multimodal outputs that attach audio payloads and related server-side audio data processing. (#1203, #1222)

Multimodal Model Improvements

Bagel
- TP support for scaling across devices. (#1293)
- CFG enablement and multi-branch CFG merged into a single batch to improve throughput and reduce per-branch overhead. (#1310, #1429)
- KV transfer and stability fixes. (#1437)
GLM-Image
- Performance improvements for GLM-Image workloads. (#920)
- Additional image-serving hardening that benefits GLM-Image deployments (endpoint/pipeline validation and crash fixes in edge cases). (e.g., #1141, #1265, #1248)

Serving, APIs & Integrations

OpenAI-compatible video serving
- Adds Wan2.2 T2V and I2V online serving via OpenAI /v1/videos API. (#1073)
- Supports irregular output shapes for Wan2.2. (#1279)
Online serving robustness & usability
- Unify CLI argument naming style and forward serve parameters more consistently to models. (#1309, #985)
- Per-request generator_device for online image generation/edit flows. (#1183)
- Fixes for image edit endpoint validation and RoPE crashes on explicit H/W. (#1141, #1265)
Ecosystem integration
- ComfyUI integration for improved workflow adoption. (#1113)

Performance, Scheduling & Memory Accounting

Async chunk enhancements
- Overlap chunk I/O and compute via async scheduling to reduce idle time in chunked pipelines. (#951)
- Async-chunk refactors and shape mismatch fixes for stability. (#1151, #1195)
Metrics & benchmarking
- Metrics structure optimization and multiple fixes for token/stream stats and E2E correctness (including Qwen3-Omni async-chunk E2E metric correctness). (#891, #1292, #1301, #1018)
- Adds benchmarks for audio speech non-streaming and omni performance benchmark tests. (#1408, #1321)
Memory accounting
- Process-scoped GPU memory accounting and diffusion-side process-level tracking improvements. (#1204, #1276)

Platform, Hardware Backends & Deployment

XPU / NPU / ROCm coverage improvements
- XPU Dockerfile + docs, enable FLASH_ATTN on XPU, fix XPU UT coverage; disable diffusion compile on XPU where needed. (#1162, #1332, #1164, #1148)
- NPU upgrade to v0.16.0 and recovery fixes for Qwen3-TTS. (#1375, #1564)
- ROCm CI/docker updates to track vLLM v0.16.0 stable. (#1380, #1500)
Deployment & connectivity
- Stage-based deployment CLI and UDS-based ZMQ address handling for stage serving. (#939, #1522)
- RDMA connector support for high-performance interconnect scenarios. (#1019)
- Platform-dependent package installation improvements. (#1046)

CI, Testing, Docs & Developer Experience

CI quality + coverage
- Expanded test stratification design (L2/L3), nightly(L4) test runs, branch coverage fixes, and CI performance tuning. (#1272, #1333, #1120, #1283)
- Improved CI stability (timeouts, reduced H100 usage, clearer logs). (#1460, #1543, #1463)
Docs & tutorials
- Tutorials on models/pipelines/features, diffusion tutorial refinements, Qwen3-TTS docs consistency, quantization Q&A updates, and installation instructions for vLLM 0.16.0. (#1196, #1305, #1226, #1257, #1505)
- Improved examples (e.g., image-to-video download steps). (#1258)
Tooling
- Online profiling support and other developer ergonomics improvements. (#1136)

Stability & Bug Fixes (Across the Stack)

This release includes broad correctness and robustness fixes spanning:

Diffusion pipelines (dtype/shape, init crashes, model detection, seed and config handling)
Image edit / generation endpoints (format validation, RoPE crash, argument typing, seed handling)
Distributed execution (process group mapping accuracy, scheduler race conditions, kv transfer correctness)
General runtime hygiene (removing unnecessary ZMQ init, CLI naming normalization, upstream-aligned error handling)

What's Changed

[TeaCache]: Add Coefficient Estimation by @princepride in #940
[CI]: Bagel E2E Smoked Test by @princepride in #1074
[Misc] Bump version to 0.14.0 by @ywang96 in #1128
[Doc] First stable release of vLLM-Omni by @ywang96 in #1129
[Misc] Align error handling with upstream vLLM v0.14.0 by @ceanna93 in #1122
[Feature] add Tensor Parallelism to LongCat-Image(-Edit) by @hadipash in #926
[CI] Temporarily remove slow tests. by @congw729 in #1143
[CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat by @mxuax in #1165
Dev/rebase v0.15.0 by @tzhouam in #1159
Docs update paper link by @hsliuustc0106 in #1169
[Debug] Clear Dockerfile.ci to accelerate build image by @tzhouam in #1172
[Debug] Correct Unreasonable Long Timeout by @tzhouam in #1175
[Doc]Fix - Align with repo. by @congw729 in #1176
[Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt by @gcanlin in #1170
[Bugfix] fix qwen image oom by @ZJY0516 in #1168
[Hardware] Disable compile of diffusion on XPU by @zhenwei-intel in #1148
[Doc] Fix vLLM version in user docs by @yuanheng-zhao in #1179
[Refactor] Refactor async chunk and fix the shape mismatch issue by @amy-why-3459 in #1151
bugfix: /images/edits endpoint fails pipeline data format check by @fhfuih in #1141
[Perf] resolving prolonged cudastreamsynchronize execution in z image processing by @erfgss in #1105
[Bugfix] modify RTF use audio_e2e/audio_duration by @yenuo26 in #1157
[Doc] Highlight paper & slides. by @congw729 in #1186
[chore] Remove zmq context initialize by @xiedeyantu in #1187
[NPU] Update Dockerfile and docs for v0.14.0 by @gcanlin in #671
[Bugfix] E2E metric incorrect qwen3-omni with async chunk feature by @LJH-LBJ in #1018
[Doc] opt doc by @david6666666 in #1118
[Bugfix] Fix tp+sp accuracy, incorrect process group mapping by @david6666666 in #1178
[Feature] Enable use_audio_in_video for Qwen 3 Omni Online by @tzhouam in #1198
[Bugfix] async_chunk rebase v0.15.0 by @amy-why-3459 in #1195
[feature]: support flux cache_dit by @nuclearwu in #1145
[CI] Add CI branch coverage calculation, fix statement coverage results and add log before test for buildkite log group by @yenuo26 in #1120
[Wan 2.2][Diffusion] Add TP Support by @Pr0Wh1teGivee in #964
[Hardware] [Feat] Setup platform dependent package installation by @tjtanaa in #1046
[XPU] Fix XPU UTs for basic coverage by @yma11 in #1164
[Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
[Refactor] Reuse upstream Qwen3MoeSparseMoeBlock by @gcanlin in #1202
[Bugfix] Fix wan2.2 ti2v by @mxuax in #1221
[Bugfix] Fix '--max-generated-image-size' cli args type by @ApsarasX in #1249
[Bugfix] Ensure seed=0 is correctly handled in image edit by @ApsarasX in #1248
[Docs] Add example image download step to Image-To-Video examples by @lishunyang12 in #1258
[Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode by @linyueqian in #1241
[bugfix] Fix multimodal_output property to check completion outputs where audio data is attached by @linyueqian in #1203
[Doc] Update QA relevant to quantization by @lishunyang12 in #1257
[Bugfix] Fix Doc link Rrror by @lishunyang12 in #1263
Process-Scoped GPU Memory Accounting by @divyanshsinghvi in #1204
[ComfyUI]: ComfyUI integration by @fhfuih in #1113
fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
[Doc] Adding models/pipelines/features Tutorial by @wtomin in #1196
[CI] Add env variable check for nightly CI by @congw729 in #1281
[CI] Add pytest markers to current tests and update the doc. by @congw729 in #577
[Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design by @mxuax in #1275
[Feature] Opt metrics structure by @LJH-LBJ in #891
[Test] Add example test cases for omni online by @yenuo26 in #1086
[CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
[Model] SupportHunyuanImage3 Diffusion Model in vllm-omni by @ElleElleWu in #1085
[Chore] Update copyright year. by @lishunyang12 in #1256
[feature]: support Flux.1-dev CFG-Parallel by @nuclearwu in #1269
[Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect by @yma11 in #1254
[Doc] Update Qwen3-TTS docs for consistency with Omni examples by @linyueqian in #1226
[Fix]Ensure HuggingFace downloads complete before initialization. by @zzhuoxin1508 in #1213
[BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
[Test] Add e2e tests for Qwen3-TTS speech endpoint by @linyueqian in #1206
[Feat]: support VAE patch parallelism by @dongbo910220 in #756
[CI] Disable Qwen3-TTS E2E Test in pipeline.yml by @Gaohan123 in #1306
[Misc] Add per-request generator_device to online image gen and edit by @gcanlin in #1183
[Bagel]: Support TP by @princepride in #1293
[Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
[Doc] Sync by @congw729 in #1216
[Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
[Debug] Add trigger to concurrent stage init by @tzhouam in #1274
[Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
Unifying CLI Argument Naming Style by @wtomin in #1309
[Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
[CI] Run nightly tests. by @congw729 in #1333
[Feature]: FP8 Quantization Support for DiT by @lishunyang12 in #1034
Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
[Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
[Pref] Support Qwen3 Omni code2wav batch infernce with async chunk by @ZeldaHuang in #1246
update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304
[Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API by @SamitHuang in #1073
[Feature] add Tensor Parallelism to SD_3.5 by @GG-li in #1336
[Feature]async scheduling to overlap chunk IO and compute by @Shirley125 in #951
[Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
Refactor CPU Offloading Backend Pattern by @yuanheng-zhao in #1223
[DOC] Doc for CI test - Details about five level stucture and some other files. by @congw729 in #1167
[Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
[Misc] wechat image update by @david6666666 in #1354
[Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker by @knlnguyen1802 in #764
[Feature][Bugfix] Add CFG feature to Bagel by @nussejzz in #1310
[Feature]: Diffusion sleep to use process level memory calculation by @divyanshsinghvi in #1276
change qwen3-omni open cudagraph by default by @R2-Y in #1352
[XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295
[Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
[BugFix]: Revert utils change by @princepride in #1369
[Rebase] Rebase to vllm v0.16.0 by @tzhouam in #1357
[Test] Fix expansion and example test case for qwen3-omni by @yenuo26 in #1358
[v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 by @xuechendi in #1401
[0.16.0] remove cuda hard-code for Hunyuan Image3 by @xuechendi in #1402
[XPU] Add XPU Dockerfile and related docs by @yma11 in #1162
[Bugfix] Fix Hardcoded Datatypes in Z-image by @alex-jw-brooks in #1393
[Feature] : Support disaggregated inference pipeline for Qwen3_TTS by @Sy0307 in #1161
[Feature] Add automated PR reviewer bot with GLM integration by @hsliuustc0106 in #1424
[Misc] Add Qwen2.5-Omni-3B model support to Gradio demo by @UsamaKenway in #1382
[misc] Feature/pr reviewer auto trigger&update model by @hsliuustc0106 in #1431
Revert "[misc] Feature/pr reviewer auto trigger&update model" by @hsliuustc0106 in #1432
[Doc] Update GPU installation commands by @tzhouam in #1434
[ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 by @tjtanaa in #1380
[Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. by @nussejzz in #1429
[Feat]: add ASCII art logo for vLLM-Omni by @zzhuoxin1508 in #1430
[Bug] [Bagel] Fix kv transfer bug by @nussejzz in #1437
[CI] Set L2 & L3 tests running conditions. by @congw729 in #1344
[Feature] vLLM-Omni RDMA connector by @natureofnature in #1019
[Minor][Refactor] Pass seq_token_counts explicitly by @gcanlin in #1425
[Misc] Extend Diffusion Benchmark script to other backends by @NickLucche in #875
[Feature] Support Stage Based Deployment CLI by @wuhang2014 in #939
[Doc] Optimize vLLM-Omni metrics documentation by @LJH-LBJ in #1311
[Bugfix] Forward all vllm-omni serve command parameters to model by @LJH-LBJ in #985
[Doc]: Add bagel single/multi node usage with mooncake document by @princepride in #1450
[Qwen3TTS][Feat] Code2Wav batched decoding by @JuanPZuluaga in #1426
[CI] Remove overwhelming debug log by @tzhouam in #1463
[Misc] update wechat image by @david6666666 in #1464
[Doc] Refine Diffusion Tutorial Documents by @wtomin in #1305
[Bugfix] Robust Audio Data Handling in _create_audio_choice by @LJH-LBJ in #1222
[Bugfix]: Fix merging updated additional information to ensure dict type by @Dovis01 in #1296
[Model]Add new nextstep_1(Diffusion) model(only T2I) by @sniper35 in #612
[Bugfix] Add TTS configuration options by @YanickSchraner in #1177
[Debug] Multi-Request for Qwen 3 Omni use_audio_in_video by @tzhouam in #1433
[Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration by @upskyy in #1455
[BugFix] process request.num_cached_tokens if it equals to the initial value by @LJH-LBJ in #1468
[Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) by @yJader in #1349
[Test] Reduce Perf test case and fix modify stage config by @yenuo26 in #1449
[NPU] Upgrade to v0.16.0 by @gcanlin in #1375
[CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… by @tzhouam in #1491
[Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements by @yuanheng-zhao in #1482
[Bugfix] Fix tuple/list KV cache extraction crash by @junuxyz in #1405
[Doc] format lora related docs for the user's end by @AndyZhou952 in #1009
[Feature] Support Wan2.2 output with irregular shapes by @gcanlin in #1279
[Misc] Migrate L1 tests to use pytest-mock by @yuanheng-zhao in #1315
[Bugfix] Fix LoRA Scaling on Active Adapters by @alex-jw-brooks in #1421
[Bugfix] fix record audio generated frame in offline infer by @LJH-LBJ in #1312
[Model] Support OmniGen2 by @legitnull in #513
[Bugfix][Qwen3TTS] by @JuanPZuluaga in #1289
Use pull through cache image for H100 pool by @khluu in #1518
[ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version by @tjtanaa in #1500
[Bugfix] fix offline text_to_image error from #1009 by @david6666666 in #1515
[XPU] Enable FLASH_ATTN on XPU by @yma11 in #1332
Revert gpu_1 job to use regular image by @khluu in #1521
[Chore] remove unused logger in omni_diffusion (#531) by @fhfuih in #1509
[Qwen3TTS][Feat] Streaming output by @JuanPZuluaga in #1438
[Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler by @knlnguyen1802 in #1448
[Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning by @fhfuih in #1435
[Performance]Qwen3-Omni performance optimization by @amy-why-3459 in #1378
[Feature] Support HSDP for diffusion models by @gcanlin in #1339
[CI] fixed CI timeout by @zhumingjue138 in #1460
[Bugfix] Use uds for zmq address if not set --stage-id by @wuhang2014 in #1522
[BugFix] Restore talker's config by @amy-why-3459 in #1524
[XPU] fix qwen_omni after rebase to v0.16.0 by @xuechendi in #1416
[Platform] Enable layerwise offload on all hardware by @gcanlin in #1492
diffusion: enable VAE patch parallel for SD3.5 by @dongbo910220 in #1428
[Perf] GLM Image by @JaredforReal in #920
[skip ci][Doc] add design docs for async chunk in qwen3-omni by @R2-Y in #962
feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder by @xulusjb in #1205
[New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support by @qibaoyuan in #750
[Feature]: Native GGUF Quantization Support for DiT by @david6666666 in #1285
Add benchmark for v1/audio/speech non-streaming by @ekagra-ranjan in #1408
[Version] Auto generate version using setuptool_scm by @tjtanaa in #1224
[Feat] : Support Async chunk cleanup by @Sy0307 in #1087
[Profiler] Support online profiling by @gcanlin in #1136
[Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler by @Dovis01 in #1510
[XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda by @xuechendi in #1488
[Chore] Cleanup dead code in GGUF DiT code path by @Isotr0py in #1533
[Doc] Update installation instructions for vllm 0.16.0 by @tzhouam in #1505
[Doc] [skip ci]Sync. by @congw729 in #1363
[CI][skip ci]Update H100 image link based on #1518 by @congw729 in #1538
Fix no embed text spk tokens by @LJH-LBJ in #1540
[Debug] Merge vllm pull 35368 by @tzhouam in #1534
[Docs] update async chunk docs diagram [skip ci] by @R2-Y in #1530
fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio by @linyueqian in #1554
[NPU][Bugfix] Align GPU side and recover qwen3-tts by @gcanlin in #1564
[BugFix] Fix unexpected crash when init OmniDiffusion by @Semmer2 in #1562
[CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. by @yenuo26 in #1543
[BugFix]: fix a lot of bug by @princepride in #1565

New Contributors

@ceanna93 made their first contribution in #1122
@hadipash made their first contribution in #926
@zhenwei-intel made their first contribution in #1148
@erfgss made their first contribution in #1105
@xiedeyantu made their first contribution in #1187
@Pr0Wh1teGivee made their first contribution in #964
@yma11 made their first contribution in #1164
@ElleElleWu made their first contribution in #1085
@ekagra-ranjan made their first contribution in #1317
@Shirley125 made their first contribution in #951
@xuechendi made their first contribution in #1401
@alex-jw-brooks made their first contribution in #1393
@Sy0307 made their first contribution in #1161
@UsamaKenway made their first contribution in #1382
@Dovis01 made their first contribution in #1296
@YanickSchraner made their first contribution in #1177
@upskyy made their first contribution in #1455
@yJader made their first contribution in #1349
@junuxyz made their first contribution in #1405
@legitnull made their first contribution in #513
@khluu made their first contribution in #1518
@zhumingjue138 made their first contribution in #1460
@xulusjb made their first contribution in #1205
@Semmer2 made their first contribution in #1562

Full Changelog: v0.14.0...v0.16.0