This pre-release is an alignment to the upstream vLLM v0.16.0.
Highlights
- Rebase to Upstream vLLM v0.16.0: vLLM-Omni is now fully aligned with the latest vLLM v0.16.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1357).
- Tensor Parallelism for Bagel & SD 3.5: Added Tensor Parallelism (TP) support for the Bagel model and Stable Diffusion 3.5, improving inference scalability for these diffusion workloads (#1293, #1336).
- CFG Parallel Expansion: Extended Classifier-Free Guidance (CFG) parallel support to Bagel and FLUX.1-dev models, enabling faster guided generation (#1310, #1269).
- Async Scheduling for Chunk IO Overlap: Introduced async scheduling to overlap chunk IO and computation across stages, reducing idle time and improving end-to-end throughput (#951).
- Diffusion Sequence Parallelism Optimization: Removed redundant communication cost by refining the SP hook design, improving diffusion parallelism efficiency (#1275).
- ComfyUI Integration: Added a full ComfyUI integration (
ComfyUI-vLLM-Omni) as an official app, supporting image generation, multimodal comprehension, and TTS workflows via vLLM-Omni's online serving API (multiple files underapps/ComfyUI-vLLM-Omni/). (#1113) - Qwen3-Omni Cudagraph by Default: Enabled cudagraph for Qwen3-Omni by default for improved inference performance (#1352).
What's Changed
Features & Optimizations
- [Misc] Support WorkerWrapperBase and CustomPipeline for Diffusion Worker by @knlnguyen1802 in #764
- Refactor CPU Offloading Backend Pattern by @yuanheng-zhao in #1223
Alignment & Integration
- Unifying CLI Argument Naming Style by @wtomin in #1309
- fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
- [Debug] Add trigger to concurrent stage init by @tzhouam in #1274
Bug Fixes
- [Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
- [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
- [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
- [BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
- [Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
- [Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
- Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
- [XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295
Infrastructure (CI/CD) & Documentation
- [CI] Run nightly tests. by @congw729 in #1333
- [CI] Add env variable check for nightly CI by @congw729 in #1281
- [CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
- [CI] Add CI branch coverage calculation, fix statement coverage results by @yenuo26 in #1120
- [Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
- [Test] Add example test cases for omni online by @yenuo26 in #1086
- [Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
- [Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
- [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
- [DOC] Doc for CI test - Details about five level structure and some other files. by @congw729 in #1167
- [Bugfix] Fix Doc link Error by @lishunyang12 in #1263
- update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304
Remaining notes
- nvidia-cublas-cu12 is pinned to 12.9.1.4 via force-reinstall in Dockerfile.ci, waiting for updates from vLLM main repo and PyTorch. pytorch/pytorch#174949
- Qwen2.5-omni with mixed_modalities input only uses first frame of video, which originates from vLLM main repo: vllm-project/vllm#34506
New Contributors
- @ekagra-ranjan made their first contribution in #1317
- @zhenwei-intel made their first contribution in #1295
- @Shirley125 made their first contribution in #951
Full Changelog: v0.15.0rc1...v0.16.0rc1