github vllm-project/vllm-omni v0.16.0rc1

latest releases: v0.18.0, v0.18.0rc1, v0.17.0rc1...
pre-releaseone month ago

This pre-release is an alignment to the upstream vLLM v0.16.0.

Highlights

  • Rebase to Upstream vLLM v0.16.0: vLLM-Omni is now fully aligned with the latest vLLM v0.16.0 core, bringing in all the latest upstream features, bug fixes, and performance improvements (#1357).
  • Tensor Parallelism for Bagel & SD 3.5: Added Tensor Parallelism (TP) support for the Bagel model and Stable Diffusion 3.5, improving inference scalability for these diffusion workloads (#1293, #1336).
  • CFG Parallel Expansion: Extended Classifier-Free Guidance (CFG) parallel support to Bagel and FLUX.1-dev models, enabling faster guided generation (#1310, #1269).
  • Async Scheduling for Chunk IO Overlap: Introduced async scheduling to overlap chunk IO and computation across stages, reducing idle time and improving end-to-end throughput (#951).
  • Diffusion Sequence Parallelism Optimization: Removed redundant communication cost by refining the SP hook design, improving diffusion parallelism efficiency (#1275).
  • ComfyUI Integration: Added a full ComfyUI integration (ComfyUI-vLLM-Omni) as an official app, supporting image generation, multimodal comprehension, and TTS workflows via vLLM-Omni's online serving API (multiple files under apps/ComfyUI-vLLM-Omni/). (#1113)
  • Qwen3-Omni Cudagraph by Default: Enabled cudagraph for Qwen3-Omni by default for improved inference performance (#1352).

What's Changed

Features & Optimizations

Alignment & Integration

  • Unifying CLI Argument Naming Style by @wtomin in #1309
  • fix: add diffusion offload args to OmniConfig group instead of serve_parser by @fake0fan in #1271
  • [Debug] Add trigger to concurrent stage init by @tzhouam in #1274

Bug Fixes

  • [Bugfix][Qwen3-TTS] Fix task type by @ekagra-ranjan in #1317
  • [Bugfix][Qwen3-TTS] Preserve original model ID in omni_snapshot_download by @linyueqian in #1318
  • [Bugfix] fix precision issues of qwen3-omni when enable async_chunk without system prompt by @R2-Y in #1288
  • [BugFix] Fixed the issue where ignore_eos was not working. by @amy-why-3459 in #1286
  • [Bugfix] Fix image edit RoPE crash when explicit height/width are provided by @lishunyang12 in #1265
  • [Bugfix] reused metrics to modify the API Server token statistics in Stream Response by @kechengliu97 in #1301
  • Fix yield token metrics and opt metrics record stats by @LJH-LBJ in #1292
  • [XPU] Update Bagel's flash_attn_varlen_func to fa utils by @zhenwei-intel in #1295

Infrastructure (CI/CD) & Documentation

  • [CI] Run nightly tests. by @congw729 in #1333
  • [CI] Add env variable check for nightly CI by @congw729 in #1281
  • [CI] Reduce the time for Diffusion Sequence Parallelism Test by @congw729 in #1283
  • [CI] Add CI branch coverage calculation, fix statement coverage results by @yenuo26 in #1120
  • [Test] Add BuildKite test-full script for full CI. by @yenuo26 in #867
  • [Test] Add example test cases for omni online by @yenuo26 in #1086
  • [Test] L2 & L3 Test Case Stratification Design for Omni Model by @yenuo26 in #1272
  • [Test] Add Omni Model Performance Benchmark Test by @yenuo26 in #1321
  • [Bugfix] remove Tongyi-MAI/Z-Image-Turbo related test from L2 ci by @Bounty-hunter in #1348
  • [DOC] Doc for CI test - Details about five level structure and some other files. by @congw729 in #1167
  • [Bugfix] Fix Doc link Error by @lishunyang12 in #1263
  • update qwen3-omni & qwen2.5-omni openai client by @R2-Y in #1304

Remaining notes

  • nvidia-cublas-cu12 is pinned to 12.9.1.4 via force-reinstall in Dockerfile.ci, waiting for updates from vLLM main repo and PyTorch. pytorch/pytorch#174949
  • Qwen2.5-omni with mixed_modalities input only uses first frame of video, which originates from vLLM main repo: vllm-project/vllm#34506

New Contributors

Full Changelog: v0.15.0rc1...v0.16.0rc1

Don't miss a new vllm-omni release

NewReleases is sending notifications on new releases.