NVIDIA/TensorRT-LLM v1.3.0rc18 on GitHub

Known Issues
- DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.
Model Support
- Support Nemotron-H NVFP4 checkpoint on Hopper (#14775)
- Add Qwen image support (#13449)
- Support Step-3.7-Flash model (#14711)
- Add Cosmos3-Nano and Cosmos3-Super support (#14824)
- Add AFMoE Trinity support (#13148)
API
- Add logprobs_simple_format option to return logprobs as a flat list[float] (#13972)
- trtllm-serve, trtllm-eval, trtllm-bench: Make CLI flags take precedence over --config / --extra_llm_api_options YAML (#14812)
Feature
- Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
- Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
- Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
- Add per-expert LoRA support with Cutlass backend (#14801)
- Reduce OpenAI stream postprocess overhead (#14708)
- Add encoder CUDA graph support to llm.encode() (#14326)
- Use a Triton kernel for C++ mamba hybrid state update (#14869)
- Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
- Support KVCacheManagerV2 adjust() in single GPU + agg PyExecutor loop (#14578)
- Add disk cache config for KVCacheManagerV2 (#14845)
- Add Wan I2V generation example (#14981)
- Add LTX-2 visual generation example (#14976)
- Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)
Fix
- Fix mamba-out-of-block error with ADP + BS=1 + disagg (#14853)
- Fix XQA IMA for invalid pages with sliding window (#14459)
- Propagate event loop errors to await_responses callers (#12735)
- Fix Mamba replay mode accuracy issues (#14509)
- Fix PyExecutor hang in disagg TP prefill (#14020)
- Fix stale runtime metadata issues during MLA fallback transitions (#14049)
- Fix KVCacheManagerV2 block counting correctness issues (#14725)
- Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
- Fix LTX-2 audio PE padding issues (#14818)
- Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
- Fix config sharing issue for Qwen3-VL (#14766)
- Enforce request and buffer index lifecycle integrity (#14768)
- Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
- Clamp KV pool window sizes to max_seq_len (#14905)
- Fix mamba block calculation (#14524)
- Add trust_remote_code=True to the LLM(...) constructor to fix various model loading issues (#14892)
- Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
- Add warmup for trtllm-gen fmha JIT kernels (#14851)
Documentation
- Add VisualGen API walkthrough example and docs page (#14685)
- Add Nemotron 3 Ultra doc (#14964, #15113)
Test & Infra
- Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
- Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
- Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
- Relocate tests to right-sized stages (#14684)
- Move non-default-feature tests to post merge (#15038)

What's Changed

[None][test] Update datasets path by @JennyLiu-nv in #14671
[None][infra] Update new .test_durations by @EmmaQiaoCh in #14661
[TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in #14632
[https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in #14459
[None][feat] Tune mamba config by env variables by @Wanli-Jiang in #14730
[None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in #14803
[None][test] Update precision of previous device step time by @fredricz-20070104 in #14809
[None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in #14802
[TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in #14559
[https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in #12735
[TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in #14775
[TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in #13972
[None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14509
[None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in #14436
[None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in #14453
[TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in #14479
[None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in #14020
[https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in #14774
[#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in #14194
[None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #14789
[None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in #14791
[https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in #14793
[#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in #14477
[https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in #14639
[TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by @zhenhuaw-me in #14685
[None][chore] Update flashinfer-python from 0.6.12rc2 to 0.6.12 by @yihwang-nv in #14805
[None][fix] AutoDeploy: Unwaive llmc standalone tests by @bmarimuthu-nv in #14700
[TRTLLM-35882][feat] Add cute dsl gvr top-k decode kernel by @limin2021 in #14602
[https://nvbugs/6222480][test] fix stress test issue on H100 by @xinhe-nv in #14721
[None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #14787
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14783
[None][fix] synchronize MLA cache reuse fallback metadata by @DhineshPonnarasan in #14049
[None][feat] Add KV cache prefetch by @lowsfer in #14748
[https://nvbugs/6191524][fix] In MLA.forward_context, also call the warmup when has_cached_kv_for_mla_context by @tensorrt-cicd in #14536
[None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #14839
[None][fix] Cherry-pick kv_cache_manager_v2 fixes to main by @lowsfer in #14725
[None][test] Waive 11 failed cases for main in post-merge by @tensorrt-cicd in #14854
[None][feat] Enable flashifner gdn decoding kernel for qwen3.5 by @nv-guomingz in #13645
[https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update by @pcicotti in #14697
[https://nvbugs/6221450][fix] AutoDeploy: Qwen3.5 400B NVFP4 accuracy regression fix by @taylor-yb-lee in #14667
[TRTLLM-12648][test] implement disagg cancel stress metrics_thread by @chienchunhung in #14807
[None][chore] Update AD model list by @tcherckez-nvidia in #14686
[https://nvbugs/6226933][fix] canonicalize multimodal cache-key serialization to prevent hash collisions by @venkywonka in #14800
[https://nvbugs/6240561][fix] Unwaive DeepSeek R1 accuracy test by @taylor-yb-lee in #14870
[None][feat] Add Qwen image support by @pst2154 in #13449
[TRTLLM-12507][feat] Per-expert lora support with Cutlass backend by @brb-nv in #14801
[None][chore] Make submit.py can run single GPU test and accept customized config file by @HuiGao-NV in #14630
[None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #14792
[None][test] Update DSV32 32k4k config to avoid timeout issue by @chenfeiz0326 in #14856
[None][chore] Bump version to 1.3.0rc18 by @yuanjingx87 in #14872
[None][infra] Waive 5 failed cases for main in post-merge 2755 by @ZhanruiSunCh in #14883
[None][fix] LTX-2 audio PE pad: use token-axis seq_dim=1 for token-major rope by @luyiyun1021 in #14818
[#13082][fix] Fix-multimodal embedding mismatch by @aashirvad08 in #13240
[None][fix] Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750) by @yufeiwu-nv in #14750
[None][fix] Use renamed get_param_count_and_checkpoint_size in hybrid configs by @yufeiwu-nv in #14855
[None][test] Remove duplicate test cases in llm_perf_core file by @yufeiwu-nv in #14749
[None][test] Remove 28 closed-bug waive entries for main by @tensorrt-cicd in #14545
[TRTLLM-13022][test] remove deprecated models from tests by @xinhe-nv in #14660
[None][feat] Reserve one more slots for attention_dp in mixed mamba cache manager by @Wanli-Jiang in #14853
[https://nvbugs/6195110][fix] Restore DeepSeek shared-weights vanilla MTP path by @zhaoyangwang-nvidia in #14457
[#12359][feat] AutoDeploy: Support SSM replay kernel for MTP with FlashInfer by @galagam in #13725
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14857
[None][chore] add attention module owner for VisualGen by @zhenhuaw-me in #14814
[None][fix] release v1 KV blocks on MAX_UTILIZATION pause by @eopXD in #14723
[None][perf] Reduce OpenAI stream postprocess overhead by @2ez4bz in #14708
[None][fix] propagate chat prompt token ids (#14420) by @reasonsolo in #14859
[https://nvbugs/6211193][fix] etcd listen all interfaces by @reasonsolo in #14863
[#5247][fix] auto-detect local cnn_dailymail dataset by directory layout by @guan404ming in #13722
[https://nvbugs/6248987][fix] Made the slow-tokenizer swap lazy and idempotent. __init__ now just sets `_slo by @tensorrt-cicd in #14846
[None][chore] redact internal NVIDIA URLs from exec-slurm-compile skill by @ssam18 in #14538
[None][feat] VisualGen: Attention2D + Ulysses & Multi-GPU LPIPS Evals by @juney-nvidia in #13944
[TRTLLM-13077][feat] Decompose post_load_weights() by @chienchunhung in #14770
[None][fix] Fix config sharing issue for Qwen3-VL by @2ez4bz in #14766
[https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity by @chienchunhung in #14768
[None][feat] Add encoder CUDA graph support to llm.encode() by @tingyangk in #14326
[None][feat] Support Step-3.7-Flash model by @kaiyux in #14711
[https://nvbugs/6050489][chore] unwaive tests by @bo-nv in #14866
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14896
[None][infra] Source code and container vulnerability fix by @yuanjingx87 in #14025
[None][infra] Waive 11 failed cases for main in post-merge 2757 by @ZhanruiSunCh in #14925
[https://nvbugs/5821415][test] update rtx6k test list by @xinhe-nv in #14929
[None][fix] Update dataset identifier for cnn_dailymail to use namespaced repo in quantization scripts by @yufeiwu-nv in #14930
[https://nvbugs/5979673][fix] Unwaive test_agent_multi_backends.py::test_run_with_different_env by @Shixiaowei02 in #14939
[None][fix] Add nemotron-v3 as the proper nemotron-h reasoning parser by @Wanli-Jiang in #14900
[https://nvbugs/6193836][test] Use EP=8 + attention DP for minimax_m2.5 8-GPU perf by @ruodil in #14613
[TRTLLM-8236][infra] fix platform tag for public wheel by @niukuo in #14616
[None][test] update bug ids in waives by @xinhe-nv in #14946
[https://nvbugs/6244474][fix] AutoDeploy: skip explicit shape-prop after MLIR elementwise fusion by @tensorrt-cicd in #14795
[None][infra] fix cbts json decode by @crazydemo in #14928
[https://nvbugs/6222480][fix] Fix stress by @xinhe-nv in #14949
[None][test] Decrease P1 models number and merge sanity test list into core by @yufeiwu-nv in #14952
[None][perf] Use a Triton kernel for Cpp mamba hybrid state update by @VALLIS-NERIA in #14869
[None][chore] Autodeploy unwaive 5888827, 6200112 by @galagam in #14894
[NVBUG-6248780][fix] Add --decoupled flag to benchmark_core_model in multi-instance test by @karljang in #14888
[TRTLLM-12870][feat] Support num_images_per_prompt for FLUX pipelines by @karljang in #14890
[TRTLLM-12214][perf] DeepGemmFusedMoE: fuse masked gather + finalize-scale into one Triton kernel by @xwang233 in #14592
[None][fix] Fix AutoDeploy accuracy tests by @bmarimuthu-nv in #13925
[TRTLLMINF-69][infra] Migrate A100X-FMHA-Post-Merge-1 and A100X-Triton-Post-Merge-[1,2] to SLURM by @mlefeb01 in #14921
[TRTLLM-11508][refactor] Merge Eagle3 and MTP-eagle one-model workers by @zhaoyangwang-nvidia in #12353
[https://nvbugs/6143787][fix] Add kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.6) to TestQwen3 by @tensorrt-cicd in #13852
[https://nvbugs/6248764][fix] Normalize non-sliding KV windows to full attention in AutoDeploy by @eopXD in #14906
[https://nvbugs/6240420][fix] Clamp KV pool window sizes to max_seq_len by @eopXD in #14905
[None][test] Fix the ci disagg perf local submit test scope too large issue to avoid HF Model not found by @fredricz-20070104 in #14989
[None][test] remove outdated model in perf test by @ruodil in #14992
[TRTLLM-12648][test] implement disagg cancellation injector thread by @chienchunhung in #14920
[None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash by @bmarimuthu-nv in #14759
[None] [waive] Waive the failed step3p7 test case due to ckpt update by @kaiyux in #14997
[https://nvbugs/6210714][fix] Fix mamba block calculation by @VALLIS-NERIA in #14524
[https://nvbugs/5546507][test] Remove obsolet… by @xinhe-nv in #14995
[None][fix] AutoDeploy: Move hf_id_to_local_model_dir to function for GLM4.7 Flash test by @bmarimuthu-nv in #14999
[TRTLLM-12893][infra] Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis by @ZhanruiSunCh in #14528
[None][infra] Waive 11 failed cases for main in post-merge 2760 by @ZhanruiSunCh in #15003
[None][fix] Uncomment Qwen3.5 and DSR1 from model registry so that they can run f… by @taylor-yb-lee in #15001
[TRTLLM-11410][feat] Cosmos3 Support by @NVShreyas in #14824
[https://nvbugs/5859886][fix] Remove the waiver by @ziyixiong-nv in #14948
[https://nvbugs/6248744][fix] Added trust_remote_code=True to the LLM(...) constructor and removed the… by @tensorrt-cicd in #14892
[https://nvbugs/6160629][fix] AutoDeploy: Fix manual seed setting for standalone tests by @galagam in #14954
[TRTLLM-12714][feat] KVCacheManagerV2: wire PyExecutor rebalance hook (single GPU, aggregated for now) by @thorjohnsen in #14578
[None][feat] add Wan I2V generation example by @o-stoner in #14981
[TRTLLM-12527][feat] Parallelize multi-shard visual-gen checkpoint loading and pre-fetch checking by @yibinl-nvidia in #14021
[https://nvbugs/6250866][fix] Fix deep ep partial warp sync for gptoss shapes by @dongfengy in #14977
[https://nvbugs/6272668][infra] Unwaive DSR1 and Qwen3.5 again by @taylor-yb-lee in #15010
[None][feat] Afmoe trinity support by @alyosha-swamy in #13148
[TRTLLM-13027][ci] Relocate under-using tests to right-sized stages by @QiJune in #14684
[None][feat] Add LTX-2 visual generation example by @yibinl-nvidia in #14976
[None][feat] AutoDeploy: Fix hardcoded configs by @taylor-yb-lee in #14943
[#13718][feat] AutoDeploy MoE all-to-all: cache + runtime max-tokens by @greg-kwasniewski1 in #13723
[TRTLLM-13177][doc] Add Nemotron 3 Ultra doc by @nv-guomingz in #14964
[#10710][feat] Make explicit CLI flags take precedence over --config / --extra_llm_api_options YAML by @marinayanov in #14812
[https://nvbugs/6260907][fix] unwaive test by @bo-nv in #15058
[None][chore] Increase GB200-4_GPUs-PyTorch shards by @tburt-nv in #14836
[TRTLLM-12648][test] implement disagg cancellation canary thread by @chienchunhung in #15015
[TRTLLM-12507][feat] Cudagraph support for routed-expert MoE LoRA with Cutlass backend - Part 1 by @brb-nv in #14923
[https://nvbugs/6245317][test] set Harmony tiktoken env for GPT-OSS disagg by @dongfengy in #14935
[https://nvbugs/6153955][test] unwaive GPT-OSS w4 DP4 CUTLASS by @dongfengy in #14884
[None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing by @lancelly in #14994
[TRTLLM-13259][ci] Merge DGX_H100 DeepSeek and GptOss stages by @QiJune in #15035
[None][infra] Waive 11 failed cases for main in post-merge 2765 by @ZhanruiSunCh in #15080
[None][infra] Waive 3 failed cases for main in post-merge 2765 by @ZhanruiSunCh in #15082
[None][test] waive weekly qa ci failure cases by @crazydemo in #15077
[None][feat] AutoDeploy: propagate layer_type hint across pattern-matcher rewrites by @greg-kwasniewski1 in #14835
[None][test] Waive 15 failed cases for main in QA CI by @tensorrt-cicd in #15056
[None][infra] Waive 1 failed cases for main in pre-merge 41894 by @ZhanruiSunCh in #15089
[TRTLLM-13262][ci] Move non-default-feature tests to post merge by @QiJune in #15038
[None][feat] Enable disk cache config for KV cache v2 by @reasonsolo in #14845
[https://nvbugs/6185446][fix] Add warmup for trtllm-gen fmha JIT kernels by @pengbowang-nv in #14851
[https://nvbugs/6162940][chore] Unwaive fixed test by @longlee0622 in #15078
[None][perf] Support Gemma RMSNorm + interleaved mRoPE in fused_qk_no… by @nv-guomingz in #14898
[None][test] Half K25 Agg Multi Round to Solve Timeout Issue by @chenfeiz0326 in #15083
[None][infra] Reduce Docker image layer count in release stage by @tburt-nv in #14972
[#14828][feat] AutoDeploy: support multi KV cache memory pool in trtllm attention by @MrGeva in #14911
[None][doc] Refine Nemotron Ultra doc by @nv-guomingz in #15113

New Contributors

@DhineshPonnarasan made their first contribution in #14049
@pst2154 made their first contribution in #13449
@aashirvad08 made their first contribution in #13240
@guan404ming made their first contribution in #13722
@alyosha-swamy made their first contribution in #13148

Full Changelog: v1.3.0rc17...v1.3.0rc18