vllm-project/vllm-omni v0.18.0 on GitHub

Highlights

This release features 324 commits from 83 contributors, including 38 new contributors.

vLLM-Omni v0.18.0 is a major rebase and systems release that aligns the project with upstream vLLM v0.18.0, strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.

Key Improvements

Rebased to upstream vLLM v0.18.0, with follow-up updates to docs and dockerfiles, plus cleanup of patches that were no longer needed after the rebase. (#2037, #2038, #2062, #2271)
Refactored the serving entrypoint architecture, making the stack cleaner and easier to extend, while also laying groundwork for PD disaggregation, multimodal output decoupling, coordinator-based orchestration, and pipeline config cleanup. (#1908, #1863, #1816, #1465, #1115)
Strengthened audio, speech, and omni production serving, especially for Qwen3-TTS, Qwen3-Omni, MiMo-Audio, Fish Speech S2 Pro, and Voxtral TTS, with lower latency, better concurrency, more robust streaming, and improved online serving stability. (#1583, #1617, #1797, #1913, #1985, #1852, #1656, #1963, #2009, #2019, #2239, #1688, #1752, #1964, #2225, #1859, #2145, #2151, #2156, #2158)
Delivered substantial diffusion optimization, with scheduler/executor refactoring, faster startup, better cache-dit / TeaCache integration, broader TP/SP/HSDP support, and multiple correctness fixes for online and offline serving. (#1625, #1504, #1715, #1834, #1848, #1234, #2163, #1979, #2101, #2176)
Expanded model support across omni, speech, image, and video, including Helios, Helios-Mid / Distilled, MammothModa2, Fun CosyVoice3-0.5B-2512, FLUX.2-dev, FLUX.1-Kontext-dev, Hunyuan Image3 AR, Fish Speech S2 Pro, Voxtral TTS, DreamID-Omni, LTX-2, and HunyuanVideo-1.5. (#1604, #1648, #336, #498, #1629, #561, #759, #1798, #1803, #1855, #841, #1516)
Introduced a unified quantization framework and expanded quantization support across diffusion and image workloads, including INT8, FP8, and GGUF-related enablement. (#1764, #1470, #1640, #1755, #1473, #2180)
Improved RL and custom pipeline readiness, verl collaboration & Qwen-Image E2E RL, Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support. Including collective RPC support at the entrypoint, custom input/output support, async batching for Qwen-Image, and dedicated E2E coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)

Core Architecture & Runtime

Reworked the core serving architecture through the vLLM-Omni Entrypoint Refactoring, while also adding PD disaggregation scaffolding, coordinator support, multimodal output decoupling foundations, and cleaner model/pipeline configuration handling. (#1908, #1863, #1465, #1816, #1115, #1958, #2105)
Continued cleanup of runtime internals with stage/step pipeline refactors, dead-code cleanup, and improvements to async engine robustness and scheduler state handling. (#1368, #1579, #2153, #2028, #1893)

Model Support

Omni / speech / audio models: added or expanded support for MammothModa2, Fun CosyVoice3-0.5B-2512, Fish Speech S2 Pro, and Voxtral TTS. (#336, #498, #1798, #1803)
Image / diffusion models: added or expanded support for Hunyuan Image-3.0, FLUX.2-dev, FLUX.1-Kontext-dev, and continued improvements for Qwen-Image, Qwen-Image-Edit, Qwen-Image-Layered, LongCat-Image, GLM-Image, Bagel, and OmniGen2. (#759, #1629, #561, #1682, #2085, #1970, #2035, #1918, #1578, #1669, #1903, #1711, #1934)
Video models: added or expanded support for Helios, Helios-Mid / Distilled, DreamID-Omni, LTX-2, HunyuanVideo-1.5, and updated supported video-generation coverage for Wan2.1-T2V. (#1604, #1648, #1855, #841, #1516, #1920)

Audio, Speech & Omni Production Optimization

Qwen3-TTS received major optimization work, including lower TTFA, better high-concurrency throughput, improved Code Predictor / Code2Wav execution, websocket streaming audio output, async scheduling by default, voice upload support, optional ref_text, and long ref_audio handling fixes. (#1583, #1617, #1797, #1913, #1985, #1852, #1719, #1853, #1201, #1879, #2046, #2104)
Qwen3-Omni gained lower inter-packet latency, speaker-switching support, decode-alignment fixes, and multiple correctness fixes for answer quality and online serving stability. (#1656, #1963, #2009, #2019, #2239)
MiMo-Audio improved compatibility and production robustness with TP fixes, broader attention backend support, configurable chunk sizing, and documentation to prevent noise-only outputs under unsupported attention setups. (#1688, #1752, #1964, #2225, #2205)
Fish Speech S2 Pro and Voxtral TTS were productionized further with online serving, voice cloning, better TTFP / inference performance, multilingual demo support, lighter flow matching, and voice-embedding fixes. (#1798, #1859, #2145, #1803, #2045, #2056, #2067, #2151, #2156, #2158, #2023)
Added or improved speech-serving interfaces, including speech batch entrypoint, speaker embedding support for speech and voices APIs, proper HTTP status handling, and streaming wav response support. (#1701, #1227, #1687, #1819)

Diffusion, Image & Video Generation

Runtime refactor & benchmarking: Refactored the diffusion runtime with cleaner scheduler/executor boundaries, better request-state flow, unified profiling, and stronger benchmarking infrastructure. (#1625, #2099, #1757, #1917, #1995)
Performance & startup gains: Improved diffusion performance through multi-threaded weight loading for Wan2.2, reduced IPC overhead for single-stage serving, cache-dit upgrades, TeaCache support, and nightly performance improvements for Qwen-Image. (#1504, #1715, #1834, #1234, #1314, #1805, #2111)
Distributed scaling: Expanded distributed diffusion execution with broader TP/SP/HSDP support across Flux, GLM-Image, Hunyuan, and Bagel. (#1250, #1900, #1918, #2163, #1903)
Serving UX & API ergonomics: Improved serving usability with a progress bar for diffusion models, richer image-edit parameters such as layers and resolution, and extra request-body support for video APIs. (#1652, #2053, #1955)
Correctness & stability fixes: Fixed a wide range of diffusion correctness issues, including config misalignment between offline and online inference, TP/no-seed broken-image issues, GLM-Image stage/device bugs, and TeaCache incompatibilities. (#1979, #2176, #2137, #2101, #1894, #2025)

Quantization & Memory Efficiency

Added the Unified Quantization Framework as a core infrastructure upgrade for more consistent quantized execution across model families. (#1764)
Expanded quantization support for diffusion/image workloads, including INT8 for DiT (Z-Image and Qwen-Image), FP8 for Flux transformers, and GGUF adapter support for Qwen-Image. (#1470, #1640, #1755)
Improved compatibility between quantization and runtime features such as CPU offload, tensor parallelism, and Flux-family execution. (#1473, #1723, #1978, #2180)

RL, Serving & Integrations

verl collaboration & Qwen-Image E2E RL: Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support, custom input/output, async batching for Qwen-Image, and dedicated E2E CI coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
Rollout scaling for visual RL: Added rollout building blocks referenced by verl’s Qwen-Image integration plan, including async batching for Qwen-Image plus tensor-parallel and data-parallel support for diffusion serving. (#1593, #1713, #1706)
Deployment & ecosystem integrations: Improved deployment and ecosystem integration with a Helm chart for Kubernetes, ComfyUI video & LoRA support, and a rewritten async video API lifecycle. (#1337, #1596, #1665)

Platforms, Distributed Execution & Hardware Coverage

Continued improving portability across CUDA, ROCm, NPU, and XPU/Intel GPU environments, including rebase follow-ups, ROCm CI setup, Intel CI dispatch, Intel GPU docs, and NPU docker/docs refreshes. (#2017, #1984, #1721, #2154, #2271, #2091)
Expanded distributed execution coverage with T5 tensor parallelism, more model-level TP/SP/HSDP support, and better handling of visible GPUs and stage-device initialization. (#1881, #1250, #1900, #1918, #2163, #2025)

CI, Benchmarks & Documentation

Strengthened release engineering and CI with a release pipeline, richer nightly benchmark/report generation, L3/L4/L5 test layering, expanded model E2E coverage, and stronger diffusion test coverage. (#1726, #1831, #1995, #1514, #1799, #2086, #1869, #2085, #2087, #2132, #2129, #2023)
Improved benchmarking with Qwen3-TTS benchmark scripts, nightly Qwen3-TTS and Qwen-Image performance tracking, diffusion timing, random benchmark datasets, and T2I/I2I accuracy benchmark integration. (#1573, #1700, #1805, #2111, #1757, #1657, #1917)
Refreshed project docs across installation, omni/TTS docs, diffusion serving parameters, UAA documentation, developer guides, and governance. (#1762, #1693, #2051, #2130, #2148, #1889)

Note

GLM-Image requires manually upgrading the transformers version to >= 5.0.

What's Changed

0.16.0 release by @ywang96 in #1576
[Refactor]: Phase1 for rebasing_additional_info by @divyanshsinghvi in #1394
[Feature]: Support cfg kv-cache transfer in multi-stage by @princepride in #1422
[BugFix] Fix load_weights error when loading HunyuanImage3.0 by @Semmer2 in #1598
[Bugfix] fix kernel error for qwen3-omni by @R2-Y in #1602
[bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio by @qibaoyuan in #1570
[Bugfix] Import InputPreprocessor into Renderer by @lengrongfu in #1566
[Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading by @SamitHuang in #1504
[Bugfix][Model] Fix LongCat Image Config Handling / Layer Creation by @alex-jw-brooks in #1485
[Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context by @ZhanqiuHu in #1619
[Debug] Enable curl retry aligned with openai by @tzhouam in #1539
[Doc] Fix links in the configuration doc by @yuanheng-zhao in #1615
[CI] Add scripts for bechmark collection and email distribution. by @congw729 in #1307
[FEATURE] Tile/Patch parallelism refactor for easily support other models by @Bounty-hunter in #1366
[Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation by @yuanheng-zhao in #1609
Make chunk_size and left_context_size configurable via YAML for async chunking by @LJH-LBJ in #1423
[Bugfix] Fix transformers 5.x compat issues in online TTS serving by @linyueqian in #1536
[Refactor] lora: reuse load_weights packed mapping by @dongbo910220 in #991
[Model]: Support Helios from ByteDance by @princepride in #1604
[chore] add _repeated_blocks for regional compilation support by @RuixiangMa in #1642
[Bugfix] Add TTS request validation to prevent engine crashes by @linyueqian in #1641
[CI] Fix ASCII codes. by @congw729 in #1647
[Misc] update wechat by @david6666666 in #1649
docs: Announce vllm-omni-skills community project by @hsliuustc0106 in #1651
[Model] Add Hunyuan Image3 AR Support by @usberkeley in #759
[Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases by @amy-why-3459 in #1628
[Bugfix] Fix Dtype Parsing by @alex-jw-brooks in #1391
[XPU] fix UMD version in docker file by @yma11 in #1545
add support for MammothModa2 model by @HonestDeng in #336
[Model] Fun cosy voice3-0.5-b-2512 by @divyanshsinghvi in #498
[Bugfix] Enable torch.compile for low noise model (transformer_2) by @lishunyang12 in #1541
[NPU] [Features] [Bugfix] Support mindiesd adaln by @jiangmengyu18 in #1537
[FP8 Quantization] Add FP8 quantization support for Flux transformer by @zzhuoxin1508 in #1640
Replace hard-coded cuda generator with current_omni_platform.device_type by @pi314ever in #1677
[BugFix] Fix LongCat Sequence Parallelism / Small Cleanup by @alex-jw-brooks in #1631
[Misc] remove logits_processor_pattern this field, because vllm have … by @lengrongfu in #1675
[CI] Remove high concurrency tests before issue #1374 fixed. by @congw729 in #1683
[Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk by @ZeldaHuang in #1656
[Feat][Qwen3TTS] reduce TTFA with flexible initial phase by @JuanPZuluaga in #1583
[Model] support LTX-2 text-to-video image-to-video by @david6666666 in #841
[BugFix] Return proper HTTP status for ErrorResponse in create_speech by @Lidang-Jiang in #1687
[Doc] Add the test guide document. [skip ci] by @yenuo26 in #1376
[UX] Add progress bar for diffusion models by @gcanlin in #1652
[Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder by @ZhanqiuHu in #1664
[Feature] Support flexible task_type configuration for Qwen3-TTS models by @JackLeeHal in #1197
[Cleanup] Move cosyvoice3 tests to model subdirectory by @linyueqian in #1666
[Feature][Bagel] Add CFG parallel mode by @nussejzz in #1578
perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor by @dubin555 in #1614
[Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph by @Sy0307 in #1617
[MiMo-Audio] Bugfix tp lg than 1 by @qibaoyuan in #1688
Add non-async chunk support for Qwen3-TTS by @linyueqian in #1678
[1/N][Refactor] Clean up dead code in output processor by @gcanlin in #1579
[feature]: support flux2.klein cache_dit by @nuclearwu in #1209
[skip CI][Docs] Add TTS model developer guide by @linyueqian in #1693
[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline by @erfgss in #668
[Feature]: Add vae-patch-parallel CLI argument in online serving by @wtomin in #1716
Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" by @gcanlin in #1724
[CI] Add release-pipeline.yaml. by @congw729 in #1726
[NPU] Support Helios-Mid / Distilled by @gcanlin in #1648
[skip ci] Update slides link by @hsliuustc0106 in #1730
[Bugfix] (qwen3_tts): enable batched offline inference by fixing tens… by @RomanKoshkin in #1417
[Bugfix] Use upstream MediaConnector for ref_audio resolution by @linyueqian in #1661
[RL] Support collective rpc api to entrypoint && Support custom input output by @knlnguyen1802 in #1646
Pre-download Qwen3-TTS model in CI to avoid intermittent download timeouts by @linyueqian in #1727
[1/N] fix CP for Helios by @SHYuanBest in #1729
feat(tts): add voice upload API for Qwen3-TTS by @zhaotyer in #1201
[Bagel] Eliminate broadcast in CFG parallel denoising loop by @nussejzz in #1695
[Feat]: Offline inference supports async_chunk by @Sy0307 in #1415
[Bugfix] Allow to enable HSDP alone by @gcanlin in #1567
Disable mm processor cache in CI stage configs by @linyueqian in #1739
Dev/rebase v0170 by @tzhouam in #1639
[Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2 by @SamitHuang in #1715
[Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
[Bagel]: Support multistage img2img by @princepride in #1669
[BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
[Doc] CLI Args Naming Style Correction by @wtomin in #1750
[Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
[Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint by @ekagra-ranjan in #1255
[Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
[CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
[Doc][skip-ci] Update installation instructions by @tzhouam in #1762
Revert "Add online serving to Stable Audio Diffusion and introduce v1/audio/generate endpoint" by @hsliuustc0106 in #1789
[BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
[Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
[Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
[Docs] update async chunk performance diagram by @R2-Y in #1741
[Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
[Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
[skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
Add coordinator module and corresponding unit test by @NumberWan in #1465
[Model]: add FLUX.2-dev model by @nuclearwu in #1629
[skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
[Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
[Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
[Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
[Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
[Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
[Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
[Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
[Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
[Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
[Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
[Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
[Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever in #1561
[Feat][Diffusion]: Implement Component-Level VRAM Quota and Resource Domain Isolation by @Flink-ddd in #1582
[Feature]: Enable directly use OmniLLM init AR model by @princepride in #1821
[Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0 by @SamitHuang in #1834
[Bugfix] Modify _resolve_pytest_target to support glob patterns and return multiple paths by @yenuo26 in #1843
[Feat] add wav response_format when stream is true in /v1/audio/speec… by @lengrongfu in #1819
[BugFix]: Revert #1582 by @princepride in #1842
[Feature]: support Flux.2-dev cache_dit by @nuclearwu in #1814
[skip ci] update readme slides link by @hsliuustc0106 in #1850
[Model] Extend NPU support for HunyuanImage3 Diffusion Model by @ElleElleWu in #1689
[Config Refactor][1/2] Model Pipeline Configuration System by @lishunyang12 in #1115
[Test] Reduce SP & Offloading test cases for L2 by @fhfuih in #1839
[bugfix] Add Interleaved 2D Rotary Embedding for HunyuanImage3 by @usberkeley in #1784
[Bugfix] Fix Helios text_encoder embed_tokens all-zeros due to untied weights by @dubin555 in #1728
Enable async_scheduling by default for Qwen3-TTS by @linyueqian in #1853
[CI failure] Comment out test_zimage_vae_patch_parallel_tp2 by @Gaohan123 in #1856
Add Fish Speech S2 Pro support with online serving and voice cloning by @linyueqian in #1798
[skip CI][Docs] add connector design document by @natureofnature in #1737
[BugFix] Readme and example runner file for cosyvoice3 missed in refactoring by @divyanshsinghvi in #1685
[Refactor] Use SP Plan for LongCat Sequence Parallelism by @alex-jw-brooks in #1772
[CI failed] Disable test for zimage tensor parallelism by @Gaohan123 in #1870
[Bugfix] Fix SD3.5-medium attn2 uninitialized weights by @lishunyang12 in #1659
[Bugfix] fix layer-wise offload incompatible with cache-dit by @RuixiangMa in #1786
[CI failed] Disable Diffusion Tensor Parallelism Test by @Gaohan123 in #1876
[BugFix]: Fix bagel online inference bug by @princepride in #1804
[Frontend] Rewrite video API for async job lifecycle by @ieaves in #1665
[Diffusion] [Model] Dreamid-Omni from bytedance by @Bounty-hunter in #1855
[Bugfix] Restore voice upload API and profiler endpoints reverted by #1719 by @linyueqian in #1879
[BugFix] Fix Max Rank Handling in LoRA by @alex-jw-brooks in #1397
Buildkite hardware ci xpu test by @pi314ever in #1340
[CI] add multimodal processing correctness tests for Omni models by @zzhuoxin1508 in #1445
fix: propagate parallel_config through create_default_diffusion by @lishunyang12 in #1878
[CI pipeline] Re-enable Diffusion Tensor Parallelism Test in pipeline by @Gaohan123 in #1892
[skip CI][Docs][Benchmark]: clarify vbench parameter behavior and add t2v example by @asukaqaq-s in #1497
[Bugfix] Fix cpu offload and quantization compatibility by @RuixiangMa in #1473
[Feat] support SP for FLUX.2-klein by @RuixiangMa in #1250
[CI]: Add/Fix bagel e2e online/offline test by @princepride in #1895
[Feat] support HSDP for Flux family by @RuixiangMa in #1900
Add Governance section by @ywang96 in #1889
Update latest news section in README.md by @ywang96 in #1909
[Feature] Split #1303 Part 1: PD disaggregation scaffolding by @ahengljh in #1863
[NPU] Upgrade to v0.17.0 by @gcanlin in #1890
[Misc] removed qwen3_tts.py as it is out-dated by @lengrongfu in #1926
[Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request by @JuanPZuluaga in #1930
Fix Base voice clone streaming quality and stop-token crash by @linyueqian in #1945
[Docs] Update WeChat QR code for community support by @david6666666 in #1974
[skip ci][Docs] Update WeChat QR code (fix filename case) by @david6666666 in #1976
[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring by @fake0fan in #1908
[Bugfix] Set PREEMPTED status when moving requests from running to waiting queue by @gcanlin in #1893
[Feature] Add cache-dit support for HunyuanImage3 by @Fishermanykx in #1848
[Feature]: Remove some useless hf_overrides in yaml by @princepride in #1898
[CI] Nightly Benchmark - Add an HTML generator, Update the EXCEL generator. by @congw729 in #1831
[Bug]: fix CUDA OOM during diffusion post-processing by @lishunyang12 in #1670
[Optim][Qwen3TTS] big boost model throughput+latency high concurrency by @JuanPZuluaga in #1852
[CI] [ROCm] Bugfix device environment issue by @tjtanaa in #1984
[CI]init intel ci dispatch in buildkite folder by @xuechendi in #1721
Fix OmniGen2 transformer config loading for HF models by @Joshna-Medisetty in #1934
[Test] L4 complete diffusion feature test for Bagel models by @NumberWan in #1938
[Performance] diffusion timing by @Bounty-hunter in #1757
[Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls by @DomBrown in #1985
[CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) by @princepride in #1998
[Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) by @SamitHuang in #1979
Add HF token to H100 jobs by @khluu in #2008
[Bugfix] Fix Ovis Image crash when guidance_scale is set without negative_prompt by @Dnoob in #1956
[Bugfix] fix helios video generate use cpu device by @lengrongfu in #1915
[XPU] update bagel modeling to remove cuda hardcode, add xpu stage_config by @xuechendi in #1931
[Fix] Fix slow hasattr in CUDAGraphWrapper.getattr by @ZeldaHuang in #1982
[Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni by @R2-Y in #2009
[Bugfix]Fix bug of online server can not return mutli images by @Hu1Lcode in #2007
[CI] [ROCm] Setup test-ready.yml and test-merge.yml by @tjtanaa in #2017
Int8 Quantization Support for DiT (Z-Image & Qwen-Image) by @yjb767868009 in #1470
[Model] Add Voxtral TTS model by @y123456y78 in #1803
[Feat] Support T5 Tensor Parallelism by @yuanheng-zhao in #1881
[Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph by @JuanPZuluaga in #1797
[Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False by @JuanPZuluaga in #1913
[CI] Change Bagel online test environment variable VLLM_TEST_CLEAN_GPU_MEMORY to 0 by @princepride in #2032
[BugFix][Doc]Update voxtral_tts end2end.py & README.md by @y123456y78 in #2026
[Docs] Add Wan2.1-T2V as supported video generation models by @SamitHuang in #1920
[Bugfix] Remove duplicated config keyword max batch size by @tzhouam in #1851
[Test] Implement mock HTTP request handling in benchmark CLI tests by @yenuo26 in #2014
[CI] Fix test. by @congw729 in #2031
reafator pipeline stage/step pipeline by @asukaqaq-s in #1368
[Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #2012
[Benchmark] [Diffusion] [Enhancement] Random dataset by @Bounty-hunter in #1657
[Bugfix] Z-Image CFG threshold should be > 0 instead of > 1 by @RuixiangMa in #1634
[Voxtral TTS] Remove redundant yaml by @y123456y78 in #2056
[Bugfix]: fixed ServerDisconnectedError in benchmark test (reapply #1683, fixes #1374) by @NumberWan in #1841
[Perf] Improve Fish Speech S2 Pro inference performance by @Sy0307 in #1859
[Voxtral] Improve example by @patrickvonplaten in #2045
[CI] Uncomment condition for nightly build in YAML by @Gaohan123 in #2057
[bugfix] /chat/completion doesn't read extra_body for diffusion model by @fhfuih in #2042
[BugFix][Qwen3TTS] CodePredictor CudaGraph Pool by @JuanPZuluaga in #2059
[Rebase] Rebase to vllm v0.18.0 by @tzhouam in #2037
[Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 by @tzhouam in #2038
[Model] Add HunyuanVideo-1.5 T2V and I2V support by @lishunyang12 in #1516
[Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection by @linyueqian in #2058
Remove mm_prefix_lm patch because vllm==0.18.0 already support by @princepride in #2062
[Bugfix] Fix HunyuanVideo-1.5 CI failures by @lishunyang12 in #2066
[Voxtral] Fix Voxtral TTS end2end.py by @y123456y78 in #2067
[FP8] enable hunyuan-image-3 diffusion model with fp8 online quant by @xuechendi in #1935
[CI] Add Flux2 Klein Tests by @alex-jw-brooks in #2027
[Bugfix] Restore chunk-waiting requests on OmniNewRequestData rewrap failure by @dubin555 in #1691
[Fix] Fix non-unique request IDs in /v1/images/edits endpoint by @zJuuu in #2050
[Bugfix] Fix cache-dit for single-transformer Wan2.2 models(eg. Wan2.2-TI2V-5B) by @RuixiangMa in #1392
[Core] Simplify OmniModelConfig Initialization by @alex-jw-brooks in #1768
Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #2070
[BugFix]: Fix OmniGen2 Model Loading by @legitnull in #1711
[Feat] support TeaCache for Flux2 klein by @RuixiangMa in #1234
[Feature] add Tensor Parallelism to Omnigen2 by @zzhuoxin1508 in #2065
[Bugfix] fix gguf TypeError: GGUFConfig.get_name() missing 1 required positional argument: 'self', test: add diffusion gguf unit coverage by @david6666666 in #1865
[Docs][CI] doc update & L4 example test for text-to-image page by @fhfuih in #1910
[Bugfix] Fix NPU Hunyuan fused MoE forward context after rebase to 0.18.0 by @Fishermanykx in #2091
[Feature][RL] Support batching for QwenImage in async mode by @knlnguyen1802 in #1593
[Test] L5 Long-Term Stability Test and GPU Memory Monitoring Main L5 last by @zhumingjue138 in #1799
[CI] Update Diffusion Model Test Configuration for Nightly Builds by @yenuo26 in #2086
[Refactor] Refactor Diffusion Scheduler/Executor Boundaries and Request State Flow by @yJader in #1625
[Quantization] feat: add qwen-image gguf adapter by @david6666666 in #1755
[Bagel]: Support SP by @princepride in #1903
[Feat] Phase 1 foundation types for multimodal output decoupling by @meghaagr13 in #1816
[Unit Test] add unit tests for AsyncOmni and Omni by @yinpeiqi in #2034
[Perf] Qwen-Image Performance Nightly CI test by @wtomin in #1805
[Bugfix] Fix Qwen-Image SP and TeaCache incompatibility by @wtomin in #2101
[Bugfix][Chunk Transfer Adapter] deque mutated fix by @JuanPZuluaga in #2102
[model] support FLUX.1-Kontext-dev by @RuixiangMa in #561
[Doc] Sync and fix. by @congw729 in #2110
[Feat] support TP for GLM-Image by @RuixiangMa in #1918
[Feat][Qwen3-TTS] Better Qwen3-TTS online serving demo by @linyueqian in #1857
Add tool to configure gpu_memory_utilization for multi-stage pipelines by @linyueqian in #1958
[BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text by @xuechendi in #1826
[feature] stable_audio_open_1 teacache support by @akshatvishu in #1314
[Feature] Add a extra body param in create video api by @lengrongfu in #1955
[Bugfix] Support base64 input for --ref-audio in Qwen3-TTS client by @lolyhop in #1389
[Feature] support to change the speaker of qwen3-omni by @R2-Y in #1963
[skip ci] Keep the latest version. by @congw729 in #2112
[Enhancement] Add force_refresh support for GLM-Image for cache-dit 1.3.0 upgrade by @SamitHuang in #1858
[Bugfix] fix offload and hsdp incompatibility by @RuixiangMa in #1888
[Feature]: add Ulysses advanced_uaa mode by @dongbo910220 in #1379
[Test] L4 complete diffusion feature test for Qwen-Image-Layered models by @kechengliu97 in #2085
[Refactor] Unify torch profiler for omni and diffusion models by @gcanlin in #2099
[API] Add layers and resolution parameters to /v1/images/edits endpoint by @gcanlin in #2053
[CI] [RL]: Add e2e test for custom pipeline by @knlnguyen1802 in #2005
[Perf] Qwen-Image Nightly Performance CI Improvement by @wtomin in #2111
[CI] Add conditions for L3 (tests after merging) and L4 (tests for nightly). by @congw729 in #1514
[Enhancement] Custom chunk_size for mimo-audio model by @qibaoyuan in #1964
[CI] Trigger nightly diffusion benchmark collects and html generates. by @congw729 in #1995
[Core] Unified quantization framework by @lishunyang12 in #1764
[Fix CI] Reduce num gpus to prevent ci failure by @wtomin in #2131
[Feat] Support scalar types in AdditionalInformationEntry by @NickCao in #2105
[Docs][skip ci] Fix omni and tts docs by @gcanlin in #2130
[Bugfix] Fix high TTFP for Base task in Gradio TTS demo by @linyueqian in #2116
[Feature] Speech batch entrypoint by @divyanshsinghvi in #1701
[Bugfix] Fix memory leak: missing chunk_transfer_adapter.cleanup() in OmniARScheduler by @dubin555 in #2028
[Fix] Qwen3 TTS audio handling for long ref_audio by @Sy0307 in #2104
[CI/Build] Fix Doc 404s by @alex-jw-brooks in #2155
[Voxtral TTS] Add multilingual support in gradio demo by @y123456y78 in #2151
Add TTS Text Preprocessing to Gradio Demo by @rohinarora73 in #2152
[Docs] Update WeChat QR code for community support by @david6666666 in #2165
[Voxtral TTS] Use 8 step flow matching instead of 16 by @y123456y78 in #2158
[CI] Add online e2e test for qwen2.5 omni by @LJH-LBJ in #1668
[Test] Add L4 diffusion feature test for LongCat-Image by @lcukyfuture in #1970
[DOC] intel GPU model support list by @xuechendi in #2154
[Test] L4 complete diffusion feature test for LongCat Image Edit models by @NumberWan in #2035
[CI] Fix examples tests error by @zhumingjue138 in #2138
[Fixbug] increase qwen2 5 online test timeout limit by @LJH-LBJ in #2171
[Docs] refine UAA documentation by @dongbo910220 in #2148
[CI] Add Stable Diffusion 3.5 Tests by @spencerr221 in #2120
[Cleanup] Remove stray test file from engine directory by @linyueqian in #2161
[Bug-Fix]fix bug of empty prompt input by @Hu1Lcode in #2041
[BugFix] Make Stage Device Initialization Respect Visible GPUs by @alex-jw-brooks in #2025
[Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #1911
[Perf] [TTS] Improve Fish Speech S2 Pro voice cloning TTFP by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2145
Revert "[Test] Add Qwen-tts test cases and unify the style of existing test cases" by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2192
[CI] Skip test_sd3_expansion due to CI failure 5148 by @Gaohan123 in https://github.com/vllm-project/vllm-omni/pull/2191
[Frontend] Speaker embedding support for speech and voices APIs by @marksverdhei in https://github.com/vllm-project/vllm-omni/pull/1227
[Bugfix] add inject model_arch to hf_overrides by @lengrongfu in https://github.com/vllm-project/vllm-omni/pull/2178
[CI] Add nightly-test label trigger. by @congw729 in https://github.com/vllm-project/vllm-omni/pull/2172
[Bugfix] resolve stage config for GLM-Image with diffusers format by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/1894
[Bugfix] Maintain model-level CPU offload in a blocking way by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/1978
[skip ci][Docs] Add FlashAttention requirement for audio generation to prevent noise-only outputs in mimo-audio model by @qibaoyuan in https://github.com/vllm-project/vllm-omni/pull/2205
[Test]Add FLUX.2-dev online serving expansion test by @yangjianjuan in https://github.com/vllm-project/vllm-omni/pull/2174
[Bugfix] Fix qwen3-omni async thinker to talker decode alignment for #1758 by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2019
[Fix] [skip ci] Fix path. by @congw729 in https://github.com/vllm-project/vllm-omni/pull/2204
[Bugfix] remove default sampling parameters by @R2-Y in https://github.com/vllm-project/vllm-omni/pull/2173
[BugFix]Fix keyError: num_processed_tokens_delta by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2213
[Enhancement] Patch AsyncOmniEngine try_get_output[_async] hanging issues by @pi314ever in https://github.com/vllm-project/vllm-omni/pull/2153
[Accuracy Benchmark] feat: add accuracy benchmark integrations for t2i and i2i by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/1917
[Test] L4 complete diffusion feature test for Wan2.2 models by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2087
[Bug Fix] GLM-Image stage device isolation and t2i prompt preprocessing in Omni runtime by @JaredforReal in https://github.com/vllm-project/vllm-omni/pull/2137
[CI] qwen2.5-omni model cannot recognize the synthetic video by @LJH-LBJ in https://github.com/vllm-project/vllm-omni/pull/2211
[Bugfix] Fix Voxtral TTS voice embeddings not loading by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2156
[CI] fix Wan22 timeout and i2i accuracy threshold by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2235
[Qwen3TTS][ServingSpeech] Bugfix/voice upload and add optional ref_text by @JuanPZuluaga in https://github.com/vllm-project/vllm-omni/pull/2046
[Doc] Improve diffusion generation parameter docs for online serving by @SamitHuang in https://github.com/vllm-project/vllm-omni/pull/2051
[Bugfix] Fix diffusion benchmark issues #1873 by @Dnoob in https://github.com/vllm-project/vllm-omni/pull/1897
[Compatibility] Add Multiple Attention Backends Support in MIMO-Audio Tokenizer by @qibaoyuan in https://github.com/vllm-project/vllm-omni/pull/2225
[Bug Fix] Resolve broken image issue when TP is enabled and no seed is provided. by @zhtmike in https://github.com/vllm-project/vllm-omni/pull/2176
[Test] L4 complete diffusion feature test for Z-Image by @yinpeiqi in https://github.com/vllm-project/vllm-omni/pull/2132
[CI] Increase diffusion initialization timeout from 600 to 700 seconds in online serving tests by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2230
[CI] Add Voxtral TTS e2e test by @y123456y78 in https://github.com/vllm-project/vllm-omni/pull/2023
[Bugfix] Fix tp and Quantization incompatible for Flux by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2180
[CI] Skip tests due to L3 CI failure by @Gaohan123 in https://github.com/vllm-project/vllm-omni/pull/2245
Frontend] Support --dtype in qwen3_omni offline e2e script by @reidliu41 in https://github.com/vllm-project/vllm-omni/pull/2246
[BugFix][Qwen3-Omni]Fixed the issue of incorrect answers for single words. by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2239
[Test] L4 complete diffusion feature test for Qwen-Image models by @SamitHuang in https://github.com/vllm-project/vllm-omni/pull/1869
[Bugfix] Fix dynamic function call on collective_rpc of DiffusionWorker by @knlnguyen1802 in https://github.com/vllm-project/vllm-omni/pull/2217
[Bugfix]fix_test_bagel_online by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2237
[CI] Add sd3 for test by @spencerr221 in https://github.com/vllm-project/vllm-omni/pull/2219
[CI] Add online e2e test for MIMO-Audio by @LJH-LBJ in https://github.com/vllm-project/vllm-omni/pull/2129
[CI] remove benchmark/testing comparison w/ other frameworks by @fhfuih in https://github.com/vllm-project/vllm-omni/pull/2179
[Feature] support sp for hunyuan by @Bounty-hunter in https://github.com/vllm-project/vllm-omni/pull/2163
[Bugfix] Modify conftest.py set unspecified parameters by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2263
[Release] Upgrade NPU dockerfile & docs for v0.18.0 by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/2271
[CI] Update pytest command to exclude specific test in nightly build by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2272
[bugfix] Remove duplicate yaml entry by @pi314ever in https://github.com/vllm-project/vllm-omni/pull/2279
[Bugfix] Fix Fish Speech S2 Pro prompt handling for truncated audio & emotion tag by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2268
[Misc] Clean up unused diffusion timing args in examples by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/2266
[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2277

New Contributors

@lengrongfu made their first contribution in #1566
@ZhanqiuHu made their first contribution in #1619
@usberkeley made their first contribution in #759
@HonestDeng made their first contribution in #336
@jiangmengyu18 made their first contribution in #1537
@pi314ever made their first contribution in #1677
@Lidang-Jiang made their first contribution in #1687
@JackLeeHal made their first contribution in #1197
@dubin555 made their first contribution in #1614
@RomanKoshkin made their first contribution in #1417
@SHYuanBest made their first contribution in #1729
@zhaotyer made their first contribution in #1201
@oglok made their first contribution in #1337
@NumberWan made their first contribution in #1465
@Flink-ddd made their first contribution in #1582
@ieaves made their first contribution in #1665
@ahengljh made their first contribution in #1863
@Fishermanykx made their first contribution in #1848
@Joshna-Medisetty made their first contribution in #1934
@DomBrown made their first contribution in #1985
@Dnoob made their first contribution in #1956
@Hu1Lcode made their first contribution in #2007
@yjb767868009 made their first contribution in #1470
@y123456y78 made their first contribution in #1803
@patrickvonplaten made their first contribution in #2045
@zJuuu made their first contribution in #2050
@salmanmkc made their first contribution in #2070
@meghaagr13 made their first contribution in #1816
@akshatvishu made their first contribution in #1314
@lolyhop made their first contribution in #1389
@NickCao made their first contribution in #2105
@rohinarora73 made their first contribution in #2152
@lcukyfuture made their first contribution in #1970
@spencerr221 made their first contribution in #2120
@yangjianjuan made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2174
@bjf-frz made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2087
@zhtmike made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2176
@reidliu41 made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2246

Full Changelog: v0.16.0...v0.18.0