NVIDIA/TensorRT-LLM v1.3.0rc20 on GitHub

This RC version will be the last one supporting the TensorRT backend, in the next version the TensorRT backend will be removed!

Known Issues
- DeepSeek V3/V3.2 can crash with an illegal memory access or hang during warm up.
- Autotuning for Qwen3-family models can crash with "Assertion failed: Failed to initialize cutlass TMA WS grouped gemm."
API
- Add API to configure TeaCache coefficients (#13170)
- BREAKING CHANGE: Make request chat_template opt-in (#14646)
Feature
- Add DeepSeek V4 preparation (#15378, #15379, #15381, #15394, #15402, #15222)
- Add MXFP8 weight format plus CUTLASS W8A8 Linear and MoE (#14962)
- Add Marlin NVFP4 backend for MoE and Linear on Hopper (#13476)
- Add CUDA graph wrapper for multimodal encoders (#14829)
- Support cross-attention with FlashInfer TRT-LLM Gen kernels on Blackwell (#15429)
- Support post-norm and per-aux fc_norm for Eagle3 draft models (Eagle 3.1) (#14988)
- Add EPLB support for Qwen3.5 (#15543)
- Optimize CuteDSL NVFP4 MoE grouped/SwiGLU GEMM accumulation pipeline (#15258)
- Add CuTe DSL GVR-TopK load-balance optimization (#15304)
- Enable split-KV heuristic for low-occupancy cross-attention in LTX-2 FA4 (#15399)
- Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into the CuteDSL epilogue for LTX2 and WAN (#15299)
- Add async mp4 encode and configurable noise latent via env vars in VisualGen (#15229)
Fix
- Harden disagg cache transceiver teardown (#15422)
- Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice (#15461)
- Fix overallocation of draft KV cache (#15017)
- Disable NCCL window buffers on GB10 (#15559)
- Fix wrong NCCL fallback in nemotron-h (#15294)
- Fix CuteDSL NVFP4 EPLB weight layout (#15538)
- Enable CuTe DSL BF16 kernels for SM100 PP (#14993)
- Fix Gemma4 multimodal vision TP and xgrammar startup crashes (#15566)
- Add necessary methods for guided decoding in Kimi K2.5 (#15180)
- Re-enable Ulysses for LTX-2 v2a cross-attention (#15303)
- Fix passing scaled timestep to time_embedder in Cosmos3 (#15545)
- Clarify and align trtllm-bench runtime logging (#15254)
Documentation
- Add deploy guide for Minimax M3 (#15587)
- Add Qwen Image visual generation examples (#15235)
Benchmark
- Add Qwen-Image-Bench evaluator (#14837)
- Add modularized perf tests for attention and MoE (discrete/continuous) (#15541)
- Add Qwen3.5-397B-A17B-NVFP4 B200 aggregated perf-sanity tests (#15650)
- Add DeepSeek R1 0528 FP4 performance test to llm_perf_core.yml (#15453)
Test & Infra
- Move more test cases to post-merge (#15568)
- Stabilize perf-sanity tests (#15440)
- Avoid type checking failures due to pip dependency resolution (#15517)
- Gate GPT-OSS TRT-LLM Gen MoE tests to SM100/SM103 (#15128)
- Add GPT-OSS disagg test for transceiver v2 (#15301)
- Fix Cosmos3 tests after VisualGen config split (#15170)
- Fix visual gen test leaked issue (#15236)
- Fix Qwen3-Next bf16 4gpu test (#15206)
- Clean up Nemotron test cases (#15586)
- Fix and unwaive step3p7 test cases (#15583)
- Add test coverage for MiniMax model with multi-node M2.5 checkpoints eval (#15361)
- Add GLM NVFP4 stress test (#15437)
- Remove unreferenced accuracy tests and orphaned entries (#15593)
- Update .gitattributes (#15606)

What's Changed

[None][fix] AutoDeploy: Fixed wrong dist_backend AUTO detection when using trtllm-llmapi-launch by @MrGeva in #15423
[None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15341
[TRTLLMINF-81][feat] Avoid failed runners on infra retry by @dpitman-nvda in #15237
[https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown by @chienchunhung in #15422
[https://nvbugs/6273846][test] gate GPT-OSS TRTLLM Gen MoE tests to SM100/SM103 by @dongfengy in #15128
[None][fix] avoid type checking failures due to pip dependency resolution by @ixlmar in #15517
[None][feat] VisualGen: async mp4 encode + fixed noise latent via env vars by @wu6u3tw in #15229
[https://nvbugs/6337235][test] Fix MX/GMS model loader fixtures by @chienchunhung in #15471
[None][test] Un-waive K2.5 Thinking FP4 disagg-NIXL e2e/gen_only tests by @chenfeiz0326 in #15443
[None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15509
[None][test] Waive 11 failed cases for main in QA CI by @tensorrt-cicd in #15506
[None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15505
[TRTLLM-13550][feat] WideEP FT: add MPI signal handler replacement (1d.0) by @chienchunhung in #14160
[None][test] Remove 60 closed-bug waive entries for main by @tensorrt-cicd in #15511
[#3237][fix] Support negative numbers in MajorityVote digit validation by @nikJ13 in #12294
[None][test] Waive 10 failed cases for main in post-merge by @tensorrt-cicd in #15535
[None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #15504
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15499
[None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15510
[None][fix] AutoDeploy: handle torch dist all_gather in multi_stream MLA transform by @MrGeva in #15456
[None][feat] Add Gemma-4 NVFP4 quantized models to AutoDeploy registry by @marinayanov in #15382
[None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice by @achartier in #15461
[https://nvbugs/6306936][test] Re-enable AutoDeploy disagg tests by @govind-ramnarayan in #15325
[None][infra] split single-node perf sanity GB200 by @tburt-nv in #15548
[None][chore] Bump version to 1.3.0rc20 by @yuanjingx87 in #15551
[#10710][fix] clarify and align trtllm-bench runtime logging by @marinayanov in #15254
[https://nvbugs/6290345][fix] Fix allreduce benchmark input setup by @nv-lschneider in #15427
[None][feat] DSv4 prep: IndexerTopK and TopK primitives by @lfr-0531 in #15381
[None][perf] Cutedsl NVF4 MOE: grouped/swiglu GEMM: Fix acc pipeline release arrive threads + FC2 meta stage code clean by @liyuhannnnn in #15258
[https://nvbugs/6271740][test] Update llm_perf_core.yml to include new performance test for DeepSeek R1 0528 FP4 model by @yufeiwu-nv in #15453
[None][fix] Stabilize perf-sanity tests by @chenfeiz0326 in #15440
[None][test] fix Cosmos3 tests after VisualGen config split by @bobboli in #15170
[None][feat] DSv4 prep: compressor and mHC primitives by @lfr-0531 in #15379
[None][infra] Waive 3 failed cases for main in post-merge 2802 by @ZhanruiSunCh in #15571
[https://nvbugs/6264844][fix] Fix wrong NCCL fallback in nemotron-h by @Wanli-Jiang in #15294
[None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #15570
[https://nvbugs/6344108][fix] skip TestNemotron3Super120B on pre-blackwell by @bo-nv in #15539
[None][fix] Fix passing scaled timestep to time_embedder in Cosmos3 by @bastefaniak in #15545
[None][chore] Remove nv-internal-release guardword comments in mega_moe_nvfp4 by @xxi-nv in #15575
[None][ci] move more test cases to post merge by @QiJune in #15568
[https://nvbugs/6185146][fix] Use mat_a.new_empty([m, n_out//2]) / input_scale.new_empty([sf_size]) in the by @tensorrt-cicd in #14710
[TRTLLM-35882][feat] cute dsl gvr-topk load-balance optimization by @limin2021 in #15304
[None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15579
[None][test] waive hang issues by @xinhe-nv in #15576
[None][test] waive hang issues by @xinhe-nv in #15581
[#14874][feat] AutoDeploy : Perf optimization for gpt-oss-120b for low conc by @taylor-yb-lee in #15531
[TRTLLM-12982][perf] reuse multi-item scoring position_ids and params by @ixlmar in #15413
[TRTLLM-13599][test] Refine Qwen3.5 test cases by @nv-guomingz in #15544
[TRTLLMINF-111][infra] Reuse image sqsh file by @EmmaQiaoCh in #15147
[None][feat] DSv4 prep: MoE routing and backend support by @lfr-0531 in #15402
[None][feat] DSv4 prep: runtime cache foundations by @lfr-0531 in #15378
[https://nvbugs/6156233][fix] Lower GSM8K reference for the three GPT-OSS/20B-MXFP4 entries with… by @tensorrt-cicd in #15393
[None][chore] Small cleanups to MultimodalModelMixin by @2ez4bz in #15322
[TRTLLM-13123][feat] CUDA graph wrapper for multimodal encoders by @2ez4bz in #14829
[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve by @xwang233 in #15239
[None][feat] Add Qwen Image visual generation examples by @yibinl-nvidia in #15235
[TRTLLM-13490][feat] Support cross-attention with FlashInfer TRTLLM-Gen kernels on Blackwell by @cascade812 in #15429
[None][fix] LTX-2: re-enable Ulysses for v2a cross-attention by @luyiyun1021 in #15303
[TRTLLM-13246][feat] Wave 1: migrate aliases to setup_aliases and stage GMS RO load by @chienchunhung in #15014
[None][feat] Support post-norm and per-aux fc_norm for Eagle3 draft models by @Dogacel in #14988
[None][fix] fix FA4 install in devel docker by @o-stoner in #14706
[https://nvbugs/6276842][test] Loosen rtol/atol on encoder CUDA graph logits parity check by @tingyangk in #15527
[#15179][fix] Add necessary methods for guided decoding in Kimi K2.5 by @chungen04 in #15180
[None][test] Waive failed unittest on all devices (nvbugs/6335726) by @guqiqi in #15585
[None][infra] add blossom-ci authorized users by @niukuo in #15549
[https://nvbugs/6166097][fix] Fix CuteDSL NVFP4 EPLB weight layout by @nv-xtf in #15538
[None][test] GPT-OSS disagg test for transceiver v2 by @Shixiaowei02 in #15301
[None][feat] Add BaseResourceManager-based KV-cache compression manager framework by @Hudayday in #15106
[None][infra] use default split when CBTS test-db download fails by @crazydemo in #15592
[#12715][fix] Disable NCCL window buffers on GB10 by @nv-lschneider in #15559
[TRTLLM-11353][feat] API to configure TeaCache coefficients by @o-stoner in #13170
[TRTLLM-12242][feat] Add Marlin NVFP4 backend for MoE and Linear on Hopper by @xuantengh in #13476
[https://nvbugs/6094068][fix] Fix Qwen3-Next bf16 4gpu test by @JadoTu in #15206
[None][feat] Dis-agg transceiver mass integration from the DSV4 branch by @Shixiaowei02 in #15222
[https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP by @yuxianq in #14993
[https://nvbugs/6256531][test] Unwaive Llama guided decoding xgrammar by @sunnyqgg in #15240
[None][feat] DSv4: sparse cache manager adapter by @lfr-0531 in #15394
[TRTLLM-12982][chore] relocate torch_multi_arange by @ixlmar in #15416
[None][infra] Waive 15 failed cases for main in post-merge 2804 by @ZhanruiSunCh in #15620
[None][test] Waive hang issues by @xinhe-nv in #15609
[TRTLLM-13371][perf] LTX-2 FA4: enable split-KV heuristic (num_splits=0) for low-occupancy cross-attn by @luyiyun1021 in #15399
[https://nvbugs/6346546][fix] fix mRoPE CUDA graph gate for text requests by @yechank-nvidia in #15589
[https://nvbugs/6274932] [fix] Fix and unwaive step3p7 test cases by @kaiyux in #15583
[TRTLLM-13600][test] Clean up Qwen3 test cases by @nv-guomingz in #15591
[TRTLLM-13601][test] Clean up Nemotron test cases by @nv-guomingz in #15586
[None][infra] Fix node list query failing on tcsh login nodes by @yiqingy0 in #15623
Revert "[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve" by @tburt-nv in #15629
[TRTLLM-13612][test] Remove unreferenced accuracy tests and orphaned … by @nv-guomingz in #15593
[https://nvbugs/6215688][fix] Fix visual gen test leaked issue by @yibinl-nvidia in #15236
[https://nvbugs/6021427][fix] BREAKING CHANGE: Make request chat_template opt-in by @yibinl-nvidia in #14646
[None][infra] AutoDeploy: Add trtllm runner for standalone llm-c by @bmarimuthu-nv in #15630
[https://nvbugs/6274614][fix] remove spec tokens env for stress test by @chuangz0 in #15153
[TRTLLM-13444][test] Add Qwen-Image text-to-image unit tests by @yingguo-trt in #15580
[TRTLLM-13247][feat] Wave 2: stage Linear and Attention transforms by @chienchunhung in #15288
[https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached… by @chenfeiz0326 in #15611
[None][test] Add modularized perf tests (attention + MoE discrete/continuous) by @ruodil in #15541
[#15565][fix] AutoDeploy: Fix Super MTP IMA introduced by checkpointing replay by @galagam in #15622
[#15613][fix] Gemma4 multimodal: fix vision TP and xgrammar startup crashes by @Thachnh in #15566
[TRTLLM-12762][test] Add Test coverage for MiniMax Model with multi-node, M2.5 checkpoints eval by @jieli-matrix in #15361
[TRTLLM-13575][feat] Add EPLB support for Qwen3.5 by @nv-guomingz in #15543
[None][test] add GLM nvfp4 stress test by @xinhe-nv in #15437
[TRTLLM-12982][chore] improve multi-item scoring request validation by @ixlmar in #15627
[None][test] Add Qwen3.5-397B-A17B-NVFP4 B200 aggregated perf-sanity tests by @chenfeiz0326 in #15650
[None][infra] take test durations into account to determine cbts splits num by @crazydemo in #15614
[None][doc] Add deploy guide for Minimax M3 by @WeiHaocheng in #15587
[None][chore] Update .gitattributes by @ziyixiong-nv in #15606
[https://nvbugs/6239637][fix] Unwaive Qwen3.5 cases on A100 platform by @nv-guomingz in #15481
[TRTLLM-13712][feat] Add Qwen-Image-Bench evaluator by @yibinl-nvidia in #14837
[https://nvbugs/6248783][test] Unwaive Qwen3 skip softmax test by @bobboli in #15652
[None][fix] User/tjohnsen/evict empty blocks first by @thorjohnsen in #11685
[https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer by @thorjohnsen in #15546
[TRTLLM-13370][perf] LTX2 + WAN: Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into CuteDSL epilogue by @luyiyun1021 in #15299
[https://nvbugs/6248837][chore] waive memory polluters by @tburt-nv in #15665
[https://nvbugs/6269778][fix] Fix overallocation of draft KV cache by @mikeiovine in #15017
[None][feat] add MXFP8 weight format + CUTLASS W8A8 Linear and MoE by @WeiHaocheng in #14962
[https://nvbugs/6344612][test] relax GPT-OSS GPQA references due to high variance in random sampling by @dongfengy in #15567
[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size by @nv-lschneider in #15596

New Contributors

@nikJ13 made their first contribution in #12294
@bastefaniak made their first contribution in #15545
@Dogacel made their first contribution in #14988
@guqiqi made their first contribution in #15585
@nv-xtf made their first contribution in #15538
@xuantengh made their first contribution in #13476
@Thachnh made their first contribution in #15566

Full Changelog: v1.3.0rc19...v1.3.0rc20