Highlights
This release features 324 commits from 83 contributors, including 38 new contributors.
vLLM-Omni v0.18.0 is a major rebase and systems release that aligns the project with upstream vLLM v0.18.0, strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.
Key Improvements
- Rebased to upstream vLLM v0.18.0, with follow-up updates to docs and dockerfiles, plus cleanup of patches that were no longer needed after the rebase. (#2037, #2038, #2062, #2271)
- Refactored the serving entrypoint architecture, making the stack cleaner and easier to extend, while also laying groundwork for PD disaggregation, multimodal output decoupling, coordinator-based orchestration, and pipeline config cleanup. (#1908, #1863, #1816, #1465, #1115)
- Strengthened audio, speech, and omni production serving, especially for Qwen3-TTS, Qwen3-Omni, MiMo-Audio, Fish Speech S2 Pro, and Voxtral TTS, with lower latency, better concurrency, more robust streaming, and improved online serving stability. (#1583, #1617, #1797, #1913, #1985, #1852, #1656, #1963, #2009, #2019, #2239, #1688, #1752, #1964, #2225, #1859, #2145, #2151, #2156, #2158)
- Delivered substantial diffusion optimization, with scheduler/executor refactoring, faster startup, better cache-dit / TeaCache integration, broader TP/SP/HSDP support, and multiple correctness fixes for online and offline serving. (#1625, #1504, #1715, #1834, #1848, #1234, #2163, #1979, #2101, #2176)
- Expanded model support across omni, speech, image, and video, including Helios, Helios-Mid / Distilled, MammothModa2, Fun CosyVoice3-0.5B-2512, FLUX.2-dev, FLUX.1-Kontext-dev, Hunyuan Image3 AR, Fish Speech S2 Pro, Voxtral TTS, DreamID-Omni, LTX-2, and HunyuanVideo-1.5. (#1604, #1648, #336, #498, #1629, #561, #759, #1798, #1803, #1855, #841, #1516)
- Introduced a unified quantization framework and expanded quantization support across diffusion and image workloads, including INT8, FP8, and GGUF-related enablement. (#1764, #1470, #1640, #1755, #1473, #2180)
- Improved RL and custom pipeline readiness, verl collaboration & Qwen-Image E2E RL, Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support. Including collective RPC support at the entrypoint, custom input/output support, async batching for Qwen-Image, and dedicated E2E coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
Core Architecture & Runtime
- Reworked the core serving architecture through the vLLM-Omni Entrypoint Refactoring, while also adding PD disaggregation scaffolding, coordinator support, multimodal output decoupling foundations, and cleaner model/pipeline configuration handling. (#1908, #1863, #1465, #1816, #1115, #1958, #2105)
- Continued cleanup of runtime internals with stage/step pipeline refactors, dead-code cleanup, and improvements to async engine robustness and scheduler state handling. (#1368, #1579, #2153, #2028, #1893)
Model Support
- Omni / speech / audio models: added or expanded support for MammothModa2, Fun CosyVoice3-0.5B-2512, Fish Speech S2 Pro, and Voxtral TTS. (#336, #498, #1798, #1803)
- Image / diffusion models: added or expanded support for Hunyuan Image-3.0, FLUX.2-dev, FLUX.1-Kontext-dev, and continued improvements for Qwen-Image, Qwen-Image-Edit, Qwen-Image-Layered, LongCat-Image, GLM-Image, Bagel, and OmniGen2. (#759, #1629, #561, #1682, #2085, #1970, #2035, #1918, #1578, #1669, #1903, #1711, #1934)
- Video models: added or expanded support for Helios, Helios-Mid / Distilled, DreamID-Omni, LTX-2, HunyuanVideo-1.5, and updated supported video-generation coverage for Wan2.1-T2V. (#1604, #1648, #1855, #841, #1516, #1920)
Audio, Speech & Omni Production Optimization
- Qwen3-TTS received major optimization work, including lower TTFA, better high-concurrency throughput, improved Code Predictor / Code2Wav execution, websocket streaming audio output, async scheduling by default, voice upload support, optional
ref_text, and longref_audiohandling fixes. (#1583, #1617, #1797, #1913, #1985, #1852, #1719, #1853, #1201, #1879, #2046, #2104) - Qwen3-Omni gained lower inter-packet latency, speaker-switching support, decode-alignment fixes, and multiple correctness fixes for answer quality and online serving stability. (#1656, #1963, #2009, #2019, #2239)
- MiMo-Audio improved compatibility and production robustness with TP fixes, broader attention backend support, configurable chunk sizing, and documentation to prevent noise-only outputs under unsupported attention setups. (#1688, #1752, #1964, #2225, #2205)
- Fish Speech S2 Pro and Voxtral TTS were productionized further with online serving, voice cloning, better TTFP / inference performance, multilingual demo support, lighter flow matching, and voice-embedding fixes. (#1798, #1859, #2145, #1803, #2045, #2056, #2067, #2151, #2156, #2158, #2023)
- Added or improved speech-serving interfaces, including speech batch entrypoint, speaker embedding support for speech and voices APIs, proper HTTP status handling, and streaming
wavresponse support. (#1701, #1227, #1687, #1819)
Diffusion, Image & Video Generation
- Runtime refactor & benchmarking: Refactored the diffusion runtime with cleaner scheduler/executor boundaries, better request-state flow, unified profiling, and stronger benchmarking infrastructure. (#1625, #2099, #1757, #1917, #1995)
- Performance & startup gains: Improved diffusion performance through multi-threaded weight loading for Wan2.2, reduced IPC overhead for single-stage serving, cache-dit upgrades, TeaCache support, and nightly performance improvements for Qwen-Image. (#1504, #1715, #1834, #1234, #1314, #1805, #2111)
- Distributed scaling: Expanded distributed diffusion execution with broader TP/SP/HSDP support across Flux, GLM-Image, Hunyuan, and Bagel. (#1250, #1900, #1918, #2163, #1903)
- Serving UX & API ergonomics: Improved serving usability with a progress bar for diffusion models, richer image-edit parameters such as layers and resolution, and extra request-body support for video APIs. (#1652, #2053, #1955)
- Correctness & stability fixes: Fixed a wide range of diffusion correctness issues, including config misalignment between offline and online inference, TP/no-seed broken-image issues, GLM-Image stage/device bugs, and TeaCache incompatibilities. (#1979, #2176, #2137, #2101, #1894, #2025)
Quantization & Memory Efficiency
- Added the Unified Quantization Framework as a core infrastructure upgrade for more consistent quantized execution across model families. (#1764)
- Expanded quantization support for diffusion/image workloads, including INT8 for DiT (Z-Image and Qwen-Image), FP8 for Flux transformers, and GGUF adapter support for Qwen-Image. (#1470, #1640, #1755)
- Improved compatibility between quantization and runtime features such as CPU offload, tensor parallelism, and Flux-family execution. (#1473, #1723, #1978, #2180)
RL, Serving & Integrations
- verl collaboration & Qwen-Image E2E RL: Expanded RL-oriented serving in close collaboration with verl, helping enable Qwen-Image end-to-end RL / Flow-GRPO training with collective RPC support, custom input/output, async batching for Qwen-Image, and dedicated E2E CI coverage for custom RL pipelines. (#1646, #1593, #2005, #2217)
- Rollout scaling for visual RL: Added rollout building blocks referenced by verl’s Qwen-Image integration plan, including async batching for Qwen-Image plus tensor-parallel and data-parallel support for diffusion serving. (#1593, #1713, #1706)
- Deployment & ecosystem integrations: Improved deployment and ecosystem integration with a Helm chart for Kubernetes, ComfyUI video & LoRA support, and a rewritten async video API lifecycle. (#1337, #1596, #1665)
Platforms, Distributed Execution & Hardware Coverage
- Continued improving portability across CUDA, ROCm, NPU, and XPU/Intel GPU environments, including rebase follow-ups, ROCm CI setup, Intel CI dispatch, Intel GPU docs, and NPU docker/docs refreshes. (#2017, #1984, #1721, #2154, #2271, #2091)
- Expanded distributed execution coverage with T5 tensor parallelism, more model-level TP/SP/HSDP support, and better handling of visible GPUs and stage-device initialization. (#1881, #1250, #1900, #1918, #2163, #2025)
CI, Benchmarks & Documentation
- Strengthened release engineering and CI with a release pipeline, richer nightly benchmark/report generation, L3/L4/L5 test layering, expanded model E2E coverage, and stronger diffusion test coverage. (#1726, #1831, #1995, #1514, #1799, #2086, #1869, #2085, #2087, #2132, #2129, #2023)
- Improved benchmarking with Qwen3-TTS benchmark scripts, nightly Qwen3-TTS and Qwen-Image performance tracking, diffusion timing, random benchmark datasets, and T2I/I2I accuracy benchmark integration. (#1573, #1700, #1805, #2111, #1757, #1657, #1917)
- Refreshed project docs across installation, omni/TTS docs, diffusion serving parameters, UAA documentation, developer guides, and governance. (#1762, #1693, #2051, #2130, #2148, #1889)
Note
- GLM-Image requires manually upgrading the
transformersversion to >= 5.0.
What's Changed
- 0.16.0 release by @ywang96 in #1576
- [Refactor]: Phase1 for rebasing_additional_info by @divyanshsinghvi in #1394
- [Feature]: Support cfg kv-cache transfer in multi-stage by @princepride in #1422
- [BugFix] Fix load_weights error when loading HunyuanImage3.0 by @Semmer2 in #1598
- [Bugfix] fix kernel error for qwen3-omni by @R2-Y in #1602
- [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio by @qibaoyuan in #1570
- [Bugfix] Import InputPreprocessor into Renderer by @lengrongfu in #1566
- [Feature][Wan2.2] Speed up diffusion model startup by multi-thread weight loading by @SamitHuang in #1504
- [Bugfix][Model] Fix LongCat Image Config Handling / Layer Creation by @alex-jw-brooks in #1485
- [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context by @ZhanqiuHu in #1619
- [Debug] Enable curl retry aligned with openai by @tzhouam in #1539
- [Doc] Fix links in the configuration doc by @yuanheng-zhao in #1615
- [CI] Add scripts for bechmark collection and email distribution. by @congw729 in #1307
- [FEATURE] Tile/Patch parallelism refactor for easily support other models by @Bounty-hunter in #1366
- [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation by @yuanheng-zhao in #1609
- Make chunk_size and left_context_size configurable via YAML for async chunking by @LJH-LBJ in #1423
- [Bugfix] Fix transformers 5.x compat issues in online TTS serving by @linyueqian in #1536
- [Refactor] lora: reuse load_weights packed mapping by @dongbo910220 in #991
- [Model]: Support Helios from ByteDance by @princepride in #1604
- [chore] add _repeated_blocks for regional compilation support by @RuixiangMa in #1642
- [Bugfix] Add TTS request validation to prevent engine crashes by @linyueqian in #1641
- [CI] Fix ASCII codes. by @congw729 in #1647
- [Misc] update wechat by @david6666666 in #1649
- docs: Announce vllm-omni-skills community project by @hsliuustc0106 in #1651
- [Model] Add Hunyuan Image3 AR Support by @usberkeley in #759
- [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases by @amy-why-3459 in #1628
- [Bugfix] Fix Dtype Parsing by @alex-jw-brooks in #1391
- [XPU] fix UMD version in docker file by @yma11 in #1545
- add support for MammothModa2 model by @HonestDeng in #336
- [Model] Fun cosy voice3-0.5-b-2512 by @divyanshsinghvi in #498
- [Bugfix] Enable torch.compile for low noise model (transformer_2) by @lishunyang12 in #1541
- [NPU] [Features] [Bugfix] Support mindiesd adaln by @jiangmengyu18 in #1537
- [FP8 Quantization] Add FP8 quantization support for Flux transformer by @zzhuoxin1508 in #1640
- Replace hard-coded cuda generator with current_omni_platform.device_type by @pi314ever in #1677
- [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup by @alex-jw-brooks in #1631
- [Misc] remove logits_processor_pattern this field, because vllm have … by @lengrongfu in #1675
- [CI] Remove high concurrency tests before issue #1374 fixed. by @congw729 in #1683
- [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk by @ZeldaHuang in #1656
- [Feat][Qwen3TTS] reduce TTFA with flexible initial phase by @JuanPZuluaga in #1583
- [Model] support LTX-2 text-to-video image-to-video by @david6666666 in #841
- [BugFix] Return proper HTTP status for ErrorResponse in create_speech by @Lidang-Jiang in #1687
- [Doc] Add the test guide document. [skip ci] by @yenuo26 in #1376
- [UX] Add progress bar for diffusion models by @gcanlin in #1652
- [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder by @ZhanqiuHu in #1664
- [Feature] Support flexible task_type configuration for Qwen3-TTS models by @JackLeeHal in #1197
- [Cleanup] Move cosyvoice3 tests to model subdirectory by @linyueqian in #1666
- [Feature][Bagel] Add CFG parallel mode by @nussejzz in #1578
- perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor by @dubin555 in #1614
- [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph by @Sy0307 in #1617
- [MiMo-Audio] Bugfix tp lg than 1 by @qibaoyuan in #1688
- Add non-async chunk support for Qwen3-TTS by @linyueqian in #1678
- [1/N][Refactor] Clean up dead code in output processor by @gcanlin in #1579
- [feature]: support flux2.klein cache_dit by @nuclearwu in #1209
- [skip CI][Docs] Add TTS model developer guide by @linyueqian in #1693
- [Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline by @erfgss in #668
- [Feature]: Add vae-patch-parallel CLI argument in online serving by @wtomin in #1716
- Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" by @gcanlin in #1724
- [CI] Add release-pipeline.yaml. by @congw729 in #1726
- [NPU] Support Helios-Mid / Distilled by @gcanlin in #1648
- [skip ci] Update slides link by @hsliuustc0106 in #1730
- [Bugfix] (qwen3_tts): enable batched offline inference by fixing tens… by @RomanKoshkin in #1417
- [Bugfix] Use upstream MediaConnector for ref_audio resolution by @linyueqian in #1661
- [RL] Support collective rpc api to entrypoint && Support custom input output by @knlnguyen1802 in #1646
- Pre-download Qwen3-TTS model in CI to avoid intermittent download timeouts by @linyueqian in #1727
- [1/N] fix CP for Helios by @SHYuanBest in #1729
- feat(tts): add voice upload API for Qwen3-TTS by @zhaotyer in #1201
- [Bagel] Eliminate broadcast in CFG parallel denoising loop by @nussejzz in #1695
- [Feat]: Offline inference supports async_chunk by @Sy0307 in #1415
- [Bugfix] Allow to enable HSDP alone by @gcanlin in #1567
- Disable mm processor cache in CI stage configs by @linyueqian in #1739
- Dev/rebase v0170 by @tzhouam in #1639
- [Perf] Reduce IPC overhead for single-stage diffusion serving for Wan2.2 by @SamitHuang in #1715
- [Test] Solving the Issue of Whisper Model's GPU Memory Not Being Successfully Cleared and the Occasional Accuracy Problem of the Qwen3-omni Model Test by @yenuo26 in #1744
- [Bagel]: Support multistage img2img by @princepride in #1669
- [BugFix] Enable CPU offloading and Cache-DiT together on Diffusion Model by @yuanheng-zhao in #1723
- [Doc] CLI Args Naming Style Correction by @wtomin in #1750
- [Feature] Add Helm Chart to deploy vLLM-Omni on Kubernetes by @oglok in #1337
- [Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL by @Sy0307 in #1731
- Add online serving to Stable Audio Diffusion and introduce
v1/audio/generateendpoint by @ekagra-ranjan in #1255 - [Enhancement][pytest] Check for process running during start server by @pi314ever in #1559
- [CI]: Add core_model and cpu markers for L1 use case. by @zhumingjue138 in #1709
- [Doc][skip-ci] Update installation instructions by @tzhouam in #1762
- Revert "Add online serving to Stable Audio Diffusion and introduce
v1/audio/generateendpoint" by @hsliuustc0106 in #1789 - [BUGFIX] Add compatibility for mimo-audio with vLLM 0.17.0 by @qibaoyuan in #1752
- [feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by @JuanPZuluaga in #1714
- [Refactor][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #1758
- [Feat][Qwen3-tts]: Add Gradio demo for online serving by @lishunyang12 in #1231
- [Docs] update async chunk performance diagram by @R2-Y in #1741
- [Feat] Enable expert parallel for diffusion MoE layers by @Semmer2 in #1323
- [Bugfix]: SP attention not enabling when _sp_plan hooks are not applied by @wtomin in #1704
- [skip ci] [Docs] Update WeChat QR code for community support by @david6666666 in #1802
- update GpuMemoryMonitor to DeviceMemoryMonitor for all HW by @xuechendi in #1526
- Add coordinator module and corresponding unit test by @NumberWan in #1465
- [Model]: add FLUX.2-dev model by @nuclearwu in #1629
- [skip ci][Docs] doc fix for example snippets by @SamitHuang in #1811
- [Test] L4 complete diffusion feature test for Qwen-Image-Edit models by @fhfuih in #1682
- [Frontend] ComfyUI video & LoRA support by @fhfuih in #1596
- [Bugfix] Adjust Z-Image Tensor Parallelism Diff Threshold by @wtomin in #1808
- [Bugfix] Expose base_model_paths property in _DiffusionServingModels by @RuixiangMa in #1771
- [Bugfix] Report supported tasks for omni models to skip unnecessary chat init by @linyueqian in #1645
- [Test] Add Qwen3-TTS nightly performance benchmark by @linyueqian in #1700
- Add Qwen3-TTS benchmark scripts by @linyueqian in #1573
- [Test] Skip the qwen3-omni relevant validation for a known issue 1367. by @yenuo26 in #1812
- Fix duplicate get_supported_tasks definition in async_omni.py by @linyueqian in #1825
- [Enhancement] Patch OmniStage.try_collect() with _proc alive checks by @pi314ever in #1560
- [Doc][skip ci] Update readme with Video link for vLLM HK First Meetup by @congw729 in #1833
- [Feat][Qwen3-TTS] Support streaming audio output for websocket by @Sy0307 in #1719
- [Test] Nightly Buildkite Pytest Test Case Statistics And Send HTML Report By Email by @yenuo26 in #1674
- [Enhancement] Patch OmniStage.try_collect() with ray alive checks by @pi314ever in #1561
- [Feat][Diffusion]: Implement Component-Level VRAM Quota and Resource Domain Isolation by @Flink-ddd in #1582
- [Feature]: Enable directly use OmniLLM init AR model by @princepride in #1821
- [Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0 by @SamitHuang in #1834
- [Bugfix] Modify _resolve_pytest_target to support glob patterns and return multiple paths by @yenuo26 in #1843
- [Feat] add wav response_format when stream is true in /v1/audio/speec… by @lengrongfu in #1819
- [BugFix]: Revert #1582 by @princepride in #1842
- [Feature]: support Flux.2-dev cache_dit by @nuclearwu in #1814
- [skip ci] update readme slides link by @hsliuustc0106 in #1850
- [Model] Extend NPU support for HunyuanImage3 Diffusion Model by @ElleElleWu in #1689
- [Config Refactor][1/2] Model Pipeline Configuration System by @lishunyang12 in #1115
- [Test] Reduce SP & Offloading test cases for L2 by @fhfuih in #1839
- [bugfix] Add Interleaved 2D Rotary Embedding for HunyuanImage3 by @usberkeley in #1784
- [Bugfix] Fix Helios text_encoder embed_tokens all-zeros due to untied weights by @dubin555 in #1728
- Enable async_scheduling by default for Qwen3-TTS by @linyueqian in #1853
- [CI failure] Comment out test_zimage_vae_patch_parallel_tp2 by @Gaohan123 in #1856
- Add Fish Speech S2 Pro support with online serving and voice cloning by @linyueqian in #1798
- [skip CI][Docs] add connector design document by @natureofnature in #1737
- [BugFix] Readme and example runner file for cosyvoice3 missed in refactoring by @divyanshsinghvi in #1685
- [Refactor] Use SP Plan for LongCat Sequence Parallelism by @alex-jw-brooks in #1772
- [CI failed] Disable test for zimage tensor parallelism by @Gaohan123 in #1870
- [Bugfix] Fix SD3.5-medium attn2 uninitialized weights by @lishunyang12 in #1659
- [Bugfix] fix layer-wise offload incompatible with cache-dit by @RuixiangMa in #1786
- [CI failed] Disable Diffusion Tensor Parallelism Test by @Gaohan123 in #1876
- [BugFix]: Fix bagel online inference bug by @princepride in #1804
- [Frontend] Rewrite video API for async job lifecycle by @ieaves in #1665
- [Diffusion] [Model] Dreamid-Omni from bytedance by @Bounty-hunter in #1855
- [Bugfix] Restore voice upload API and profiler endpoints reverted by #1719 by @linyueqian in #1879
- [BugFix] Fix Max Rank Handling in LoRA by @alex-jw-brooks in #1397
- Buildkite hardware ci xpu test by @pi314ever in #1340
- [CI] add multimodal processing correctness tests for Omni models by @zzhuoxin1508 in #1445
- fix: propagate parallel_config through create_default_diffusion by @lishunyang12 in #1878
- [CI pipeline] Re-enable Diffusion Tensor Parallelism Test in pipeline by @Gaohan123 in #1892
- [skip CI][Docs][Benchmark]: clarify vbench parameter behavior and add t2v example by @asukaqaq-s in #1497
- [Bugfix] Fix cpu offload and quantization compatibility by @RuixiangMa in #1473
- [Feat] support SP for FLUX.2-klein by @RuixiangMa in #1250
- [CI]: Add/Fix bagel e2e online/offline test by @princepride in #1895
- [Feat] support HSDP for Flux family by @RuixiangMa in #1900
- Add
Governancesection by @ywang96 in #1889 - Update latest news section in README.md by @ywang96 in #1909
- [Feature] Split #1303 Part 1: PD disaggregation scaffolding by @ahengljh in #1863
- [NPU] Upgrade to v0.17.0 by @gcanlin in #1890
- [Misc] removed qwen3_tts.py as it is out-dated by @lengrongfu in #1926
- [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request by @JuanPZuluaga in #1930
- Fix Base voice clone streaming quality and stop-token crash by @linyueqian in #1945
- [Docs] Update WeChat QR code for community support by @david6666666 in #1974
- [skip ci][Docs] Update WeChat QR code (fix filename case) by @david6666666 in #1976
- [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring by @fake0fan in #1908
- [Bugfix] Set PREEMPTED status when moving requests from running to waiting queue by @gcanlin in #1893
- [Feature] Add cache-dit support for HunyuanImage3 by @Fishermanykx in #1848
- [Feature]: Remove some useless
hf_overridesin yaml by @princepride in #1898 - [CI] Nightly Benchmark - Add an HTML generator, Update the EXCEL generator. by @congw729 in #1831
- [Bug]: fix CUDA OOM during diffusion post-processing by @lishunyang12 in #1670
- [Optim][Qwen3TTS] big boost model throughput+latency high concurrency by @JuanPZuluaga in #1852
- [CI] [ROCm] Bugfix device environment issue by @tjtanaa in #1984
- [CI]init intel ci dispatch in buildkite folder by @xuechendi in #1721
- Fix OmniGen2 transformer config loading for HF models by @Joshna-Medisetty in #1934
- [Test] L4 complete diffusion feature test for Bagel models by @NumberWan in #1938
- [Performance] diffusion timing by @Bounty-hunter in #1757
- [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls by @DomBrown in #1985
- [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) by @princepride in #1998
- [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) by @SamitHuang in #1979
- Add HF token to H100 jobs by @khluu in #2008
- [Bugfix] Fix Ovis Image crash when guidance_scale is set without negative_prompt by @Dnoob in #1956
- [Bugfix] fix helios video generate use cpu device by @lengrongfu in #1915
- [XPU] update bagel modeling to remove cuda hardcode, add xpu stage_config by @xuechendi in #1931
- [Fix] Fix slow hasattr in CUDAGraphWrapper.getattr by @ZeldaHuang in #1982
- [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni by @R2-Y in #2009
- [Bugfix]Fix bug of online server can not return mutli images by @Hu1Lcode in #2007
- [CI] [ROCm] Setup
test-ready.ymlandtest-merge.ymlby @tjtanaa in #2017 - Int8 Quantization Support for DiT (Z-Image & Qwen-Image) by @yjb767868009 in #1470
- [Model] Add Voxtral TTS model by @y123456y78 in #1803
- [Feat] Support T5 Tensor Parallelism by @yuanheng-zhao in #1881
- [Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph by @JuanPZuluaga in #1797
- [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False by @JuanPZuluaga in #1913
- [CI] Change Bagel online test environment variable
VLLM_TEST_CLEAN_GPU_MEMORYto0by @princepride in #2032 - [BugFix][Doc]Update voxtral_tts end2end.py & README.md by @y123456y78 in #2026
- [Docs] Add Wan2.1-T2V as supported video generation models by @SamitHuang in #1920
- [Bugfix] Remove duplicated config keyword max batch size by @tzhouam in #1851
- [Test] Implement mock HTTP request handling in benchmark CLI tests by @yenuo26 in #2014
- [CI] Fix test. by @congw729 in #2031
- reafator pipeline stage/step pipeline by @asukaqaq-s in #1368
- [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips by @LJH-LBJ in #2012
- [Benchmark] [Diffusion] [Enhancement] Random dataset by @Bounty-hunter in #1657
- [Bugfix] Z-Image CFG threshold should be > 0 instead of > 1 by @RuixiangMa in #1634
- [Voxtral TTS] Remove redundant yaml by @y123456y78 in #2056
- [Bugfix]: fixed ServerDisconnectedError in benchmark test (reapply #1683, fixes #1374) by @NumberWan in #1841
- [Perf] Improve Fish Speech S2 Pro inference performance by @Sy0307 in #1859
- [Voxtral] Improve example by @patrickvonplaten in #2045
- [CI] Uncomment condition for nightly build in YAML by @Gaohan123 in #2057
- [bugfix] /chat/completion doesn't read extra_body for diffusion model by @fhfuih in #2042
- [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool by @JuanPZuluaga in #2059
- [Rebase] Rebase to vllm v0.18.0 by @tzhouam in #2037
- [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 by @tzhouam in #2038
- [Model] Add HunyuanVideo-1.5 T2V and I2V support by @lishunyang12 in #1516
- [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection by @linyueqian in #2058
- Remove mm_prefix_lm patch because vllm==0.18.0 already support by @princepride in #2062
- [Bugfix] Fix HunyuanVideo-1.5 CI failures by @lishunyang12 in #2066
- [Voxtral] Fix Voxtral TTS end2end.py by @y123456y78 in #2067
- [FP8] enable hunyuan-image-3 diffusion model with fp8 online quant by @xuechendi in #1935
- [CI] Add Flux2 Klein Tests by @alex-jw-brooks in #2027
- [Bugfix] Restore chunk-waiting requests on OmniNewRequestData rewrap failure by @dubin555 in #1691
- [Fix] Fix non-unique request IDs in /v1/images/edits endpoint by @zJuuu in #2050
- [Bugfix] Fix cache-dit for single-transformer Wan2.2 models(eg. Wan2.2-TI2V-5B) by @RuixiangMa in #1392
- [Core] Simplify OmniModelConfig Initialization by @alex-jw-brooks in #1768
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #2070
- [BugFix]: Fix OmniGen2 Model Loading by @legitnull in #1711
- [Feat] support TeaCache for Flux2 klein by @RuixiangMa in #1234
- [Feature] add Tensor Parallelism to Omnigen2 by @zzhuoxin1508 in #2065
- [Bugfix] fix gguf TypeError: GGUFConfig.get_name() missing 1 required positional argument: 'self', test: add diffusion gguf unit coverage by @david6666666 in #1865
- [Docs][CI] doc update & L4 example test for text-to-image page by @fhfuih in #1910
- [Bugfix] Fix NPU Hunyuan fused MoE forward context after rebase to 0.18.0 by @Fishermanykx in #2091
- [Feature][RL] Support batching for QwenImage in async mode by @knlnguyen1802 in #1593
- [Test] L5 Long-Term Stability Test and GPU Memory Monitoring Main L5 last by @zhumingjue138 in #1799
- [CI] Update Diffusion Model Test Configuration for Nightly Builds by @yenuo26 in #2086
- [Refactor] Refactor Diffusion Scheduler/Executor Boundaries and Request State Flow by @yJader in #1625
- [Quantization] feat: add qwen-image gguf adapter by @david6666666 in #1755
- [Bagel]: Support SP by @princepride in #1903
- [Feat] Phase 1 foundation types for multimodal output decoupling by @meghaagr13 in #1816
- [Unit Test] add unit tests for AsyncOmni and Omni by @yinpeiqi in #2034
- [Perf] Qwen-Image Performance Nightly CI test by @wtomin in #1805
- [Bugfix] Fix Qwen-Image SP and TeaCache incompatibility by @wtomin in #2101
- [Bugfix][Chunk Transfer Adapter] deque mutated fix by @JuanPZuluaga in #2102
- [model] support FLUX.1-Kontext-dev by @RuixiangMa in #561
- [Doc] Sync and fix. by @congw729 in #2110
- [Feat] support TP for GLM-Image by @RuixiangMa in #1918
- [Feat][Qwen3-TTS] Better Qwen3-TTS online serving demo by @linyueqian in #1857
- Add tool to configure gpu_memory_utilization for multi-stage pipelines by @linyueqian in #1958
- [BugFIX] enable Hunyuan image3 with stage selection among text_to_image/image_to_text by @xuechendi in #1826
- [feature] stable_audio_open_1 teacache support by @akshatvishu in #1314
- [Feature] Add a extra body param in create video api by @lengrongfu in #1955
- [Bugfix] Support base64 input for --ref-audio in Qwen3-TTS client by @lolyhop in #1389
- [Feature] support to change the speaker of qwen3-omni by @R2-Y in #1963
- [skip ci] Keep the latest version. by @congw729 in #2112
- [Enhancement] Add force_refresh support for GLM-Image for cache-dit 1.3.0 upgrade by @SamitHuang in #1858
- [Bugfix] fix offload and hsdp incompatibility by @RuixiangMa in #1888
- [Feature]: add Ulysses advanced_uaa mode by @dongbo910220 in #1379
- [Test] L4 complete diffusion feature test for Qwen-Image-Layered models by @kechengliu97 in #2085
- [Refactor] Unify torch profiler for omni and diffusion models by @gcanlin in #2099
- [API] Add layers and resolution parameters to /v1/images/edits endpoint by @gcanlin in #2053
- [CI] [RL]: Add e2e test for custom pipeline by @knlnguyen1802 in #2005
- [Perf] Qwen-Image Nightly Performance CI Improvement by @wtomin in #2111
- [CI] Add conditions for L3 (tests after merging) and L4 (tests for nightly). by @congw729 in #1514
- [Enhancement] Custom chunk_size for mimo-audio model by @qibaoyuan in #1964
- [CI] Trigger nightly diffusion benchmark collects and html generates. by @congw729 in #1995
- [Core] Unified quantization framework by @lishunyang12 in #1764
- [Fix CI] Reduce num gpus to prevent ci failure by @wtomin in #2131
- [Feat] Support scalar types in AdditionalInformationEntry by @NickCao in #2105
- [Docs][skip ci] Fix omni and tts docs by @gcanlin in #2130
- [Bugfix] Fix high TTFP for Base task in Gradio TTS demo by @linyueqian in #2116
- [Feature] Speech batch entrypoint by @divyanshsinghvi in #1701
- [Bugfix] Fix memory leak: missing chunk_transfer_adapter.cleanup() in OmniARScheduler by @dubin555 in #2028
- [Fix] Qwen3 TTS audio handling for long ref_audio by @Sy0307 in #2104
- [CI/Build] Fix Doc 404s by @alex-jw-brooks in #2155
- [Voxtral TTS] Add multilingual support in gradio demo by @y123456y78 in #2151
- Add TTS Text Preprocessing to Gradio Demo by @rohinarora73 in #2152
- [Docs] Update WeChat QR code for community support by @david6666666 in #2165
- [Voxtral TTS] Use 8 step flow matching instead of 16 by @y123456y78 in #2158
- [CI] Add online e2e test for qwen2.5 omni by @LJH-LBJ in #1668
- [Test] Add L4 diffusion feature test for LongCat-Image by @lcukyfuture in #1970
- [DOC] intel GPU model support list by @xuechendi in #2154
- [Test] L4 complete diffusion feature test for LongCat Image Edit models by @NumberWan in #2035
- [CI] Fix examples tests error by @zhumingjue138 in #2138
- [Fixbug] increase qwen2 5 online test timeout limit by @LJH-LBJ in #2171
- [Docs] refine UAA documentation by @dongbo910220 in #2148
- [CI] Add Stable Diffusion 3.5 Tests by @spencerr221 in #2120
- [Cleanup] Remove stray test file from engine directory by @linyueqian in #2161
- [Bug-Fix]fix bug of empty prompt input by @Hu1Lcode in #2041
- [BugFix] Make Stage Device Initialization Respect Visible GPUs by @alex-jw-brooks in #2025
- [Test] Add Qwen-tts test cases and unify the style of existing test cases by @yenuo26 in #1911
- [Perf] [TTS] Improve Fish Speech S2 Pro voice cloning TTFP by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2145
- Revert "[Test] Add Qwen-tts test cases and unify the style of existing test cases" by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2192
- [CI] Skip test_sd3_expansion due to CI failure 5148 by @Gaohan123 in https://github.com/vllm-project/vllm-omni/pull/2191
- [Frontend] Speaker embedding support for speech and voices APIs by @marksverdhei in https://github.com/vllm-project/vllm-omni/pull/1227
- [Bugfix] add inject model_arch to hf_overrides by @lengrongfu in https://github.com/vllm-project/vllm-omni/pull/2178
- [CI] Add nightly-test label trigger. by @congw729 in https://github.com/vllm-project/vllm-omni/pull/2172
- [Bugfix] resolve stage config for GLM-Image with diffusers format by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/1894
- [Bugfix] Maintain model-level CPU offload in a blocking way by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/1978
- [skip ci][Docs] Add FlashAttention requirement for audio generation to prevent noise-only outputs in mimo-audio model by @qibaoyuan in https://github.com/vllm-project/vllm-omni/pull/2205
- [Test]Add FLUX.2-dev online serving expansion test by @yangjianjuan in https://github.com/vllm-project/vllm-omni/pull/2174
- [Bugfix] Fix qwen3-omni async thinker to talker decode alignment for #1758 by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2019
- [Fix] [skip ci] Fix path. by @congw729 in https://github.com/vllm-project/vllm-omni/pull/2204
- [Bugfix] remove default sampling parameters by @R2-Y in https://github.com/vllm-project/vllm-omni/pull/2173
- [BugFix]Fix keyError: num_processed_tokens_delta by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2213
- [Enhancement] Patch AsyncOmniEngine try_get_output[_async] hanging issues by @pi314ever in https://github.com/vllm-project/vllm-omni/pull/2153
- [Accuracy Benchmark] feat: add accuracy benchmark integrations for t2i and i2i by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/1917
- [Test] L4 complete diffusion feature test for Wan2.2 models by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2087
- [Bug Fix] GLM-Image stage device isolation and t2i prompt preprocessing in Omni runtime by @JaredforReal in https://github.com/vllm-project/vllm-omni/pull/2137
- [CI] qwen2.5-omni model cannot recognize the synthetic video by @LJH-LBJ in https://github.com/vllm-project/vllm-omni/pull/2211
- [Bugfix] Fix Voxtral TTS voice embeddings not loading by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2156
- [CI] fix Wan22 timeout and i2i accuracy threshold by @david6666666 in https://github.com/vllm-project/vllm-omni/pull/2235
- [Qwen3TTS][ServingSpeech] Bugfix/voice upload and add optional ref_text by @JuanPZuluaga in https://github.com/vllm-project/vllm-omni/pull/2046
- [Doc] Improve diffusion generation parameter docs for online serving by @SamitHuang in https://github.com/vllm-project/vllm-omni/pull/2051
- [Bugfix] Fix diffusion benchmark issues #1873 by @Dnoob in https://github.com/vllm-project/vllm-omni/pull/1897
- [Compatibility] Add Multiple Attention Backends Support in MIMO-Audio Tokenizer by @qibaoyuan in https://github.com/vllm-project/vllm-omni/pull/2225
- [Bug Fix] Resolve broken image issue when TP is enabled and no seed is provided. by @zhtmike in https://github.com/vllm-project/vllm-omni/pull/2176
- [Test] L4 complete diffusion feature test for Z-Image by @yinpeiqi in https://github.com/vllm-project/vllm-omni/pull/2132
- [CI] Increase diffusion initialization timeout from 600 to 700 seconds in online serving tests by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2230
- [CI] Add Voxtral TTS e2e test by @y123456y78 in https://github.com/vllm-project/vllm-omni/pull/2023
- [Bugfix] Fix tp and Quantization incompatible for Flux by @RuixiangMa in https://github.com/vllm-project/vllm-omni/pull/2180
- [CI] Skip tests due to L3 CI failure by @Gaohan123 in https://github.com/vllm-project/vllm-omni/pull/2245
- Frontend] Support --dtype in qwen3_omni offline e2e script by @reidliu41 in https://github.com/vllm-project/vllm-omni/pull/2246
- [BugFix][Qwen3-Omni]Fixed the issue of incorrect answers for single words. by @amy-why-3459 in https://github.com/vllm-project/vllm-omni/pull/2239
- [Test] L4 complete diffusion feature test for Qwen-Image models by @SamitHuang in https://github.com/vllm-project/vllm-omni/pull/1869
- [Bugfix] Fix dynamic function call on collective_rpc of DiffusionWorker by @knlnguyen1802 in https://github.com/vllm-project/vllm-omni/pull/2217
- [Bugfix]fix_test_bagel_online by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2237
- [CI] Add sd3 for test by @spencerr221 in https://github.com/vllm-project/vllm-omni/pull/2219
- [CI] Add online e2e test for MIMO-Audio by @LJH-LBJ in https://github.com/vllm-project/vllm-omni/pull/2129
- [CI] remove benchmark/testing comparison w/ other frameworks by @fhfuih in https://github.com/vllm-project/vllm-omni/pull/2179
- [Feature] support sp for hunyuan by @Bounty-hunter in https://github.com/vllm-project/vllm-omni/pull/2163
- [Bugfix] Modify conftest.py set unspecified parameters by @bjf-frz in https://github.com/vllm-project/vllm-omni/pull/2263
- [Release] Upgrade NPU dockerfile & docs for v0.18.0 by @gcanlin in https://github.com/vllm-project/vllm-omni/pull/2271
- [CI] Update pytest command to exclude specific test in nightly build by @yenuo26 in https://github.com/vllm-project/vllm-omni/pull/2272
- [bugfix] Remove duplicate yaml entry by @pi314ever in https://github.com/vllm-project/vllm-omni/pull/2279
- [Bugfix] Fix Fish Speech S2 Pro prompt handling for truncated audio & emotion tag by @Sy0307 in https://github.com/vllm-project/vllm-omni/pull/2268
- [Misc] Clean up unused diffusion timing args in examples by @yuanheng-zhao in https://github.com/vllm-project/vllm-omni/pull/2266
- [Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by @linyueqian in https://github.com/vllm-project/vllm-omni/pull/2277
New Contributors
- @lengrongfu made their first contribution in #1566
- @ZhanqiuHu made their first contribution in #1619
- @usberkeley made their first contribution in #759
- @HonestDeng made their first contribution in #336
- @jiangmengyu18 made their first contribution in #1537
- @pi314ever made their first contribution in #1677
- @Lidang-Jiang made their first contribution in #1687
- @JackLeeHal made their first contribution in #1197
- @dubin555 made their first contribution in #1614
- @RomanKoshkin made their first contribution in #1417
- @SHYuanBest made their first contribution in #1729
- @zhaotyer made their first contribution in #1201
- @oglok made their first contribution in #1337
- @NumberWan made their first contribution in #1465
- @Flink-ddd made their first contribution in #1582
- @ieaves made their first contribution in #1665
- @ahengljh made their first contribution in #1863
- @Fishermanykx made their first contribution in #1848
- @Joshna-Medisetty made their first contribution in #1934
- @DomBrown made their first contribution in #1985
- @Dnoob made their first contribution in #1956
- @Hu1Lcode made their first contribution in #2007
- @yjb767868009 made their first contribution in #1470
- @y123456y78 made their first contribution in #1803
- @patrickvonplaten made their first contribution in #2045
- @zJuuu made their first contribution in #2050
- @salmanmkc made their first contribution in #2070
- @meghaagr13 made their first contribution in #1816
- @akshatvishu made their first contribution in #1314
- @lolyhop made their first contribution in #1389
- @NickCao made their first contribution in #2105
- @rohinarora73 made their first contribution in #2152
- @lcukyfuture made their first contribution in #1970
- @spencerr221 made their first contribution in #2120
- @yangjianjuan made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2174
- @bjf-frz made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2087
- @zhtmike made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2176
- @reidliu41 made their first contribution in https://github.com/vllm-project/vllm-omni/pull/2246
Full Changelog: v0.16.0...v0.18.0