vllm-project/vllm-omni v0.16.0rc1 on GitHub

This pre-release is an alignment to the upstream vLLM v0.16.0.

Highlights

Rebase to Upstream vLLM v0.16.0: vLLM-Omni is now fully aligned with the latest vLLM v0.16.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1357).
Tensor Parallelism for Bagel & SD 3.5: Added Tensor Parallelism (TP) support for the Bagel model and Stable Diffusion 3.5, improving inference scalability for these diffusion workloads (#1293, #1336).
CFG Parallel Expansion: Extended Classifier-Free Guidance (CFG) parallel support to Bagel and FLUX.1-dev models, enabling faster guided generation (#1310, #1269).
Async Scheduling for Chunk IO Overlap: Introduced async scheduling to overlap chunk IO and computation across stages, reducing idle time and improving end-to-end throughput (#951).
Diffusion Sequence Parallelism Optimization: Removed redundant communication cost by refining the SP hook design, improving diffusion parallelism efficiency (#1275).
ComfyUI Integration: Added a full ComfyUI integration (ComfyUI-vLLM-Omni) as an official app, supporting image generation, multimodal comprehension, and TTS workflows via vLLM-Omni's online serving API (multiple files under apps/ComfyUI-vLLM-Omni/). (#1113)
Qwen3-Omni Cudagraph by Default: Enabled cudagraph for Qwen3-Omni by default for improved inference performance (#1352).

[Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker by @knlnguyen1802 in #764
Refactor CPU Offloading Backend Pattern by @yuanheng-zhao in #1223

Unifying CLI Argument Naming Style by @wtomin in #1309
fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
[Debug] Add trigger to concurrent stage init by @tzhouam in #1274

[Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
[Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
[Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
[BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
[Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
[Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
[XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295

[CI] Run nightly tests. by @congw729 in #1333
[CI] Add env variable check for nightly CI by @congw729 in #1281
[CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
[CI] Add CI branch coverage calculation, fix statement coverage results by @yenuo26 in #1120
[Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
[Test] Add example test cases for omni online by @yenuo26 in #1086
[Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
[Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
[Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
[DOC] Doc for CI test - Details about five level structure and some other files. by @congw729 in #1167
[Bugfix] Fix Doc link Error by @lishunyang12 in #1263
update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304

nvidia-cublas-cu12 is pinned to 12.9.1.4 via force-reinstall in Dockerfile.ci, waiting for updates from vLLM main repo and PyTorch. pytorch/pytorch#174949
Qwen2.5-omni with mixed_modalities input only uses first frame of video, which originates from vLLM main repo: vllm-project/vllm#34506

Full Changelog: v0.15.0rc1...v0.16.0rc1