Highlights
This release features approximately 121 commits (merged PRs) from ~60 contributors (24 new contributors).
vLLM-Omni v0.16.0 is a major alignment + capability release that rebases the project onto upstream vLLM v0.16.0 and significantly expands performance, distributed execution, and production readiness across Qwen3-Omni / Qwen3-TTS, Bagel, MiMo-Audio, GLM-Image and the Diffusion (DiT) image/video stack—while also improving platform coverage (CUDA / ROCm / NPU / XPU), CI quality, and documentation.
Key Improvements
- Rebase to upstream vLLM v0.16.0: Tracks the latest vLLM runtime behavior and APIs while keeping Omni’s error handling aligned with upstream expectations. (#1357, #1122, plus follow-up fixes like #1401)
- Qwen3-Omni performance + correctness: Performance optimizations (cuda graph, async-chunk, streaming output) make TTFP reduced 90% and RTF 0.22~0.45, plus precision and E2E metric correctness fixes. (#1378, #1352, #1288, #1018, #1292)
- MiMo-Audio production Support: Performance optimizations (cuda graph, async-chunk, streaming output) improves the RTF ~ 0.2, 11x faster than baseline. #750
- Qwen3-TTS production upgrades: Disaggregated inference pipeline support, streaming output, batched Code2Wav decoding, and CUDA Graph support for speech tokenizer decoding—plus multiple robustness fixes across task type handling and voice cloning. (#1161, #1438, #1426, #1205, #1317, #1554)
- Bagel acceleration & scalability: Adds TP support, introduces CFG capabilities, and accelerates multi-branch CFG by merging branches into a single batch; includes KV transfer stability fixes. (#1293, #1310, #1429, #1437)
- Diffusion distributed execution expansion: Adds/extends TP/SP/HSDP and reduces redundant communication overhead; improves pipeline parallelism options (e.g., VAE patch parallel) and correctness across multiple diffusion families. (#964, #1275, #1339, #756, #1428)
- Quantization for DiT: Introduces FP8 quantization support and native GGUF quantization support for diffusion transformers, with code-path cleanups. (#1034, #1285, #1533)
- Broader model coverage (audio + image): Adds MiMo-Audio-7B-Instruct support and performance improvements for GLM-Image pipelines. (#750, #920)
Diffusion, Image & Video Generation
-
New/expanded model coverage
-
Distributed & parallel execution
-
Performance & memory efficiency
-
Correctness & stability
Audio, Speech & Omni (Qwen3-TTS / MiMo-Audio)
-
Qwen3-TTS feature set maturation
-
Stability & quality
Multimodal Model Improvements
-
Bagel
-
GLM-Image
Serving, APIs & Integrations
-
OpenAI-compatible video serving
-
Online serving robustness & usability
-
Ecosystem integration
- ComfyUI integration for improved workflow adoption. (#1113)
Performance, Scheduling & Memory Accounting
-
Async chunk enhancements
-
Metrics & benchmarking
-
Memory accounting
Platform, Hardware Backends & Deployment
-
XPU / NPU / ROCm coverage improvements
-
Deployment & connectivity
CI, Testing, Docs & Developer Experience
-
CI quality + coverage
-
Docs & tutorials
-
Tooling
- Online profiling support and other developer ergonomics improvements. (#1136)
Stability & Bug Fixes (Across the Stack)
This release includes broad correctness and robustness fixes spanning:
- Diffusion pipelines (dtype/shape, init crashes, model detection, seed and config handling)
- Image edit / generation endpoints (format validation, RoPE crash, argument typing, seed handling)
- Distributed execution (process group mapping accuracy, scheduler race conditions, kv transfer correctness)
- General runtime hygiene (removing unnecessary ZMQ init, CLI naming normalization, upstream-aligned error handling)
What's Changed
- [TeaCache]: Add Coefficient Estimation by @princepride in #940
- [CI]: Bagel E2E Smoked Test by @princepride in #1074
- [Misc] Bump version to 0.14.0 by @ywang96 in #1128
- [Doc] First stable release of vLLM-Omni by @ywang96 in #1129
- [Misc] Align error handling with upstream vLLM v0.14.0 by @ceanna93 in #1122
- [Feature] add Tensor Parallelism to LongCat-Image(-Edit) by @hadipash in #926
- [CI] Temporarily remove slow tests. by @congw729 in #1143
- [CI] Refactor test_sequence_parallel.py and add a warmup run for more accurate performance stat by @mxuax in #1165
- Dev/rebase v0.15.0 by @tzhouam in #1159
- Docs update paper link by @hsliuustc0106 in #1169
- [Debug] Clear Dockerfile.ci to accelerate build image by @tzhouam in #1172
- [Debug] Correct Unreasonable Long Timeout by @tzhouam in #1175
- [Doc]Fix - Align with repo. by @congw729 in #1176
- [Bugfix][Qwen-Image-Edit] Add a warning log for none negative_prompt by @gcanlin in #1170
- [Bugfix] fix qwen image oom by @ZJY0516 in #1168
- [Hardware] Disable compile of diffusion on XPU by @zhenwei-intel in #1148
- [Doc] Fix vLLM version in user docs by @yuanheng-zhao in #1179
- [Refactor] Refactor async chunk and fix the shape mismatch issue by @amy-why-3459 in #1151
- bugfix: /images/edits endpoint fails pipeline data format check by @fhfuih in #1141
- [Perf] resolving prolonged
cudastreamsynchronizeexecution in z image processing by @erfgss in #1105 - [Bugfix] modify RTF use audio_e2e/audio_duration by @yenuo26 in #1157
- [Doc] Highlight paper & slides. by @congw729 in #1186
- [chore] Remove zmq context initialize by @xiedeyantu in #1187
- [NPU] Update Dockerfile and docs for v0.14.0 by @gcanlin in #671
- [Bugfix] E2E metric incorrect qwen3-omni with async chunk feature by @LJH-LBJ in #1018
- [Doc] opt doc by @david6666666 in #1118
- [Bugfix] Fix tp+sp accuracy, incorrect process group mapping by @david6666666 in #1178
- [Feature] Enable use_audio_in_video for Qwen 3 Omni Online by @tzhouam in #1198
- [Bugfix] async_chunk rebase v0.15.0 by @amy-why-3459 in #1195
- [feature]: support flux cache_dit by @nuclearwu in #1145
- [CI] Add CI branch coverage calculation, fix statement coverage results and add log before test for buildkite log group by @yenuo26 in #1120
- [Wan 2.2][Diffusion] Add TP Support by @Pr0Wh1teGivee in #964
- [Hardware] [Feat] Setup platform dependent package installation by @tjtanaa in #1046
- [XPU] Fix XPU UTs for basic coverage by @yma11 in #1164
- [Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
- [Refactor] Reuse upstream Qwen3MoeSparseMoeBlock by @gcanlin in #1202
- [Bugfix] Fix wan2.2 ti2v by @mxuax in #1221
- [Bugfix] Fix '--max-generated-image-size' cli args type by @ApsarasX in #1249
- [Bugfix] Ensure seed=0 is correctly handled in image edit by @ApsarasX in #1248
- [Docs] Add example image download step to Image-To-Video examples by @lishunyang12 in #1258
- [Bugfix] Fix padding bug in 12Hz tokenizer ConvTranspose1d decode by @linyueqian in #1241
- [bugfix] Fix multimodal_output property to check completion outputs where audio data is attached by @linyueqian in #1203
- [Doc] Update QA relevant to quantization by @lishunyang12 in #1257
- [Bugfix] Fix Doc link Rrror by @lishunyang12 in #1263
- Process-Scoped GPU Memory Accounting by @divyanshsinghvi in #1204
- [ComfyUI]: ComfyUI integration by @fhfuih in #1113
- fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
- [Doc] Adding models/pipelines/features Tutorial by @wtomin in #1196
- [CI] Add env variable check for nightly CI by @congw729 in #1281
- [CI] Add pytest markers to current tests and update the doc. by @congw729 in #577
- [Diffusion][Perf] Remove Redundant Communication Cost by Refining SP Hook Design by @mxuax in #1275
- [Feature] Opt metrics structure by @LJH-LBJ in #891
- [Test] Add example test cases for omni online by @yenuo26 in #1086
- [CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
- [Model] SupportHunyuanImage3 Diffusion Model in vllm-omni by @ElleElleWu in #1085
- [Chore] Update copyright year. by @lishunyang12 in #1256
- [feature]: support Flux.1-dev CFG-Parallel by @nuclearwu in #1269
- [Bugfix] Fix 'NoneType' AttributeError in stable-diffusion model detect by @yma11 in #1254
- [Doc] Update Qwen3-TTS docs for consistency with Omni examples by @linyueqian in #1226
- [Fix]Ensure HuggingFace downloads complete before initialization. by @zzhuoxin1508 in #1213
- [BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
- [Test] Add e2e tests for Qwen3-TTS speech endpoint by @linyueqian in #1206
- [Feat]: support VAE patch parallelism by @dongbo910220 in #756
- [CI] Disable Qwen3-TTS E2E Test in pipeline.yml by @Gaohan123 in #1306
- [Misc] Add per-request generator_device to online image gen and edit by @gcanlin in #1183
- [Bagel]: Support TP by @princepride in #1293
- [Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
- [Doc] Sync by @congw729 in #1216
- [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
- [Debug] Add trigger to concurrent stage init by @tzhouam in #1274
- [Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
- Unifying CLI Argument Naming Style by @wtomin in #1309
- [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
- [CI] Run nightly tests. by @congw729 in #1333
- [Feature]: FP8 Quantization Support for DiT by @lishunyang12 in #1034
- Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
- [Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
- [Pref] Support Qwen3 Omni code2wav batch infernce with async chunk by @ZeldaHuang in #1246
- update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304
- [Feature] Support Wan2.2 T2V and I2V Online Serving with OpenAI /v1/videos API by @SamitHuang in #1073
- [Feature] add Tensor Parallelism to SD_3.5 by @GG-li in #1336
- [Feature]async scheduling to overlap chunk IO and compute by @Shirley125 in #951
- [Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
- Refactor CPU Offloading Backend Pattern by @yuanheng-zhao in #1223
- [DOC] Doc for CI test - Details about five level stucture and some other files. by @congw729 in #1167
- [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
- [Misc] wechat image update by @david6666666 in #1354
- [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker by @knlnguyen1802 in #764
- [Feature][Bugfix] Add CFG feature to Bagel by @nussejzz in #1310
- [Feature]: Diffusion sleep to use process level memory calculation by @divyanshsinghvi in #1276
- change qwen3-omni open cudagraph by default by @R2-Y in #1352
- [XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295
- [Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
- [BugFix]: Revert utils change by @princepride in #1369
- [Rebase] Rebase to vllm v0.16.0 by @tzhouam in #1357
- [Test] Fix expansion and example test case for qwen3-omni by @yenuo26 in #1358
- [v0.16.0][BUG FIX]Fix hunyuan MOE after update to 0.16.0 by @xuechendi in #1401
- [0.16.0] remove cuda hard-code for Hunyuan Image3 by @xuechendi in #1402
- [XPU] Add XPU Dockerfile and related docs by @yma11 in #1162
- [Bugfix] Fix Hardcoded Datatypes in Z-image by @alex-jw-brooks in #1393
- [Feature] : Support disaggregated inference pipeline for Qwen3_TTS by @Sy0307 in #1161
- [Feature] Add automated PR reviewer bot with GLM integration by @hsliuustc0106 in #1424
- [Misc] Add Qwen2.5-Omni-3B model support to Gradio demo by @UsamaKenway in #1382
- [misc] Feature/pr reviewer auto trigger&update model by @hsliuustc0106 in #1431
- Revert "[misc] Feature/pr reviewer auto trigger&update model" by @hsliuustc0106 in #1432
- [Doc] Update GPU installation commands by @tzhouam in #1434
- [ROCM] [CI] fix dockerfile.rocm to support nightly build and also fix amd ci v0.16.0rc1 by @tjtanaa in #1380
- [Feature][BAGEL] Combine multi-branch cfg into a single batch to accelerate inference. by @nussejzz in #1429
- [Feat]: add ASCII art logo for vLLM-Omni by @zzhuoxin1508 in #1430
- [Bug] [Bagel] Fix kv transfer bug by @nussejzz in #1437
- [CI] Set L2 & L3 tests running conditions. by @congw729 in #1344
- [Feature] vLLM-Omni RDMA connector by @natureofnature in #1019
- [Minor][Refactor] Pass seq_token_counts explicitly by @gcanlin in #1425
- [Misc] Extend Diffusion Benchmark script to other backends by @NickLucche in #875
- [Feature] Support Stage Based Deployment CLI by @wuhang2014 in #939
- [Doc] Optimize vLLM-Omni metrics documentation by @LJH-LBJ in #1311
- [Bugfix] Forward all vllm-omni serve command parameters to model by @LJH-LBJ in #985
- [Doc]: Add bagel single/multi node usage with mooncake document by @princepride in #1450
- [Qwen3TTS][Feat] Code2Wav batched decoding by @JuanPZuluaga in #1426
- [CI] Remove overwhelming debug log by @tzhouam in #1463
- [Misc] update wechat image by @david6666666 in #1464
- [Doc] Refine Diffusion Tutorial Documents by @wtomin in #1305
- [Bugfix] Robust Audio Data Handling in _create_audio_choice by @LJH-LBJ in #1222
- [Bugfix]: Fix merging updated additional information to ensure dict type by @Dovis01 in #1296
- [Model]Add new nextstep_1(Diffusion) model(only T2I) by @sniper35 in #612
- [Bugfix] Add TTS configuration options by @YanickSchraner in #1177
- [Debug] Multi-Request for Qwen 3 Omni use_audio_in_video by @tzhouam in #1433
- [Bugfix] Fix case-sensitive task_type matching in Qwen3TTSModelForGeneration by @upskyy in #1455
- [BugFix] process request.num_cached_tokens if it equals to the initial value by @LJH-LBJ in #1468
- [Bugfix] Fix SDPA attention mask dtype and shape (Fix #857) by @yJader in #1349
- [Test] Reduce Perf test case and fix modify stage config by @yenuo26 in #1449
- [NPU] Upgrade to v0.16.0 by @gcanlin in #1375
- [CI] Update Dockerfile for vllm-omni CI image and remove obsolete dep… by @tzhouam in #1491
- [Fix][Chore] Qwen3-TTS Modeling Minor Code Sanity Improvements by @yuanheng-zhao in #1482
- [Bugfix] Fix tuple/list KV cache extraction crash by @junuxyz in #1405
- [Doc] format lora related docs for the user's end by @AndyZhou952 in #1009
- [Feature] Support Wan2.2 output with irregular shapes by @gcanlin in #1279
- [Misc] Migrate L1 tests to use pytest-mock by @yuanheng-zhao in #1315
- [Bugfix] Fix LoRA Scaling on Active Adapters by @alex-jw-brooks in #1421
- [Bugfix] fix record audio generated frame in offline infer by @LJH-LBJ in #1312
- [Model] Support OmniGen2 by @legitnull in #513
- [Bugfix][Qwen3TTS] by @JuanPZuluaga in #1289
- Use pull through cache image for H100 pool by @khluu in #1518
- [ROCm] [CI] [Docker] Point to use the latest vLLM v0.16.0 stable version by @tjtanaa in #1500
- [Bugfix] fix offline text_to_image error from #1009 by @david6666666 in #1515
- [XPU] Enable FLASH_ATTN on XPU by @yma11 in #1332
- Revert gpu_1 job to use regular image by @khluu in #1521
- [Chore] remove unused logger in omni_diffusion (#531) by @fhfuih in #1509
- [Qwen3TTS][Feat] Streaming output by @JuanPZuluaga in #1438
- [Bugfix] Race condition in MultiprocExecutor when concurent access to Scheduler by @knlnguyen1802 in #1448
- [Doc][Test][Misc] ComfyUI test, more screenshot, and code cleaning by @fhfuih in #1435
- [Performance]Qwen3-Omni performance optimization by @amy-why-3459 in #1378
- [Feature] Support HSDP for diffusion models by @gcanlin in #1339
- [CI] fixed CI timeout by @zhumingjue138 in #1460
- [Bugfix] Use uds for zmq address if not set --stage-id by @wuhang2014 in #1522
- [BugFix] Restore talker's config by @amy-why-3459 in #1524
- [XPU] fix qwen_omni after rebase to v0.16.0 by @xuechendi in #1416
- [Platform] Enable layerwise offload on all hardware by @gcanlin in #1492
- diffusion: enable VAE patch parallel for SD3.5 by @dongbo910220 in #1428
- [Perf] GLM Image by @JaredforReal in #920
- [skip ci][Doc] add design docs for async chunk in qwen3-omni by @R2-Y in #962
- feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder by @xulusjb in #1205
- [New Model]: XiaomiMiMo/MiMo-Audio-7B-Instruct support by @qibaoyuan in #750
- [Feature]: Native GGUF Quantization Support for DiT by @david6666666 in #1285
- Add benchmark for
v1/audio/speechnon-streaming by @ekagra-ranjan in #1408 - [Version] Auto generate version using
setuptool_scmby @tjtanaa in #1224 - [Feat] : Support Async chunk cleanup by @Sy0307 in #1087
- [Profiler] Support online profiling by @gcanlin in #1136
- [Bugfix] Fix redundant finished req status updating on OmniGenerationScheduler by @Dovis01 in #1510
- [XPU][NPU][ROCM] enable cpu_offloading flag for non_cuda by @xuechendi in #1488
- [Chore] Cleanup dead code in GGUF DiT code path by @Isotr0py in #1533
- [Doc] Update installation instructions for vllm 0.16.0 by @tzhouam in #1505
- [Doc] [skip ci]Sync. by @congw729 in #1363
- [CI][skip ci]Update H100 image link based on #1518 by @congw729 in #1538
- Fix no embed text spk tokens by @LJH-LBJ in #1540
- [Debug] Merge vllm pull 35368 by @tzhouam in #1534
- [Docs] update async chunk docs diagram [skip ci] by @R2-Y in #1530
- fix(qwen3-tts): fix Base ICL voice clone producing corrupted audio by @linyueqian in #1554
- [NPU][Bugfix] Align GPU side and recover qwen3-tts by @gcanlin in #1564
- [BugFix] Fix unexpected crash when init OmniDiffusion by @Semmer2 in #1562
- [CI] Modify some CI test cases to run on L4 environment to reduce H100 resource usage. by @yenuo26 in #1543
- [BugFix]: fix a lot of bug by @princepride in #1565
New Contributors
- @ceanna93 made their first contribution in #1122
- @hadipash made their first contribution in #926
- @zhenwei-intel made their first contribution in #1148
- @erfgss made their first contribution in #1105
- @xiedeyantu made their first contribution in #1187
- @Pr0Wh1teGivee made their first contribution in #964
- @yma11 made their first contribution in #1164
- @ElleElleWu made their first contribution in #1085
- @ekagra-ranjan made their first contribution in #1317
- @Shirley125 made their first contribution in #951
- @xuechendi made their first contribution in #1401
- @alex-jw-brooks made their first contribution in #1393
- @Sy0307 made their first contribution in #1161
- @UsamaKenway made their first contribution in #1382
- @Dovis01 made their first contribution in #1296
- @YanickSchraner made their first contribution in #1177
- @upskyy made their first contribution in #1455
- @yJader made their first contribution in #1349
- @junuxyz made their first contribution in #1405
- @legitnull made their first contribution in #513
- @khluu made their first contribution in #1518
- @zhumingjue138 made their first contribution in #1460
- @xulusjb made their first contribution in #1205
- @Semmer2 made their first contribution in #1562
Full Changelog: v0.14.0...v0.16.0