This RC version will be the last one supporting the TensorRT backend, in the next version the TensorRT backend will be removed!
-
Known Issues
- DeepSeek V3/V3.2 can crash with an illegal memory access or hang during warm up.
- Autotuning for Qwen3-family models can crash with "Assertion failed: Failed to initialize cutlass TMA WS grouped gemm."
-
API
-
Feature
- Add DeepSeek V4 preparation (#15378, #15379, #15381, #15394, #15402, #15222)
- Add MXFP8 weight format plus CUTLASS W8A8 Linear and MoE (#14962)
- Add Marlin NVFP4 backend for MoE and Linear on Hopper (#13476)
- Add CUDA graph wrapper for multimodal encoders (#14829)
- Support cross-attention with FlashInfer TRT-LLM Gen kernels on Blackwell (#15429)
- Support post-norm and per-aux
fc_normfor Eagle3 draft models (Eagle 3.1) (#14988) - Add EPLB support for Qwen3.5 (#15543)
- Optimize CuteDSL NVFP4 MoE grouped/SwiGLU GEMM accumulation pipeline (#15258)
- Add CuTe DSL GVR-TopK load-balance optimization (#15304)
- Enable split-KV heuristic for low-occupancy cross-attention in LTX-2 FA4 (#15399)
- Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into the CuteDSL epilogue for LTX2 and WAN (#15299)
- Add async mp4 encode and configurable noise latent via env vars in VisualGen (#15229)
-
Fix
- Harden disagg cache transceiver teardown (#15422)
- Fix encoder-decoder beam search corruption via per-slot
fragmentPointerDevice(#15461) - Fix overallocation of draft KV cache (#15017)
- Disable NCCL window buffers on GB10 (#15559)
- Fix wrong NCCL fallback in nemotron-h (#15294)
- Fix CuteDSL NVFP4 EPLB weight layout (#15538)
- Enable CuTe DSL BF16 kernels for SM100 PP (#14993)
- Fix Gemma4 multimodal vision TP and xgrammar startup crashes (#15566)
- Add necessary methods for guided decoding in Kimi K2.5 (#15180)
- Re-enable Ulysses for LTX-2 v2a cross-attention (#15303)
- Fix passing scaled timestep to
time_embedderin Cosmos3 (#15545) - Clarify and align trtllm-bench runtime logging (#15254)
-
Documentation
-
Benchmark
-
Test & Infra
- Move more test cases to post-merge (#15568)
- Stabilize perf-sanity tests (#15440)
- Avoid type checking failures due to pip dependency resolution (#15517)
- Gate GPT-OSS TRT-LLM Gen MoE tests to SM100/SM103 (#15128)
- Add GPT-OSS disagg test for transceiver v2 (#15301)
- Fix Cosmos3 tests after VisualGen config split (#15170)
- Fix visual gen test leaked issue (#15236)
- Fix Qwen3-Next bf16 4gpu test (#15206)
- Clean up Nemotron test cases (#15586)
- Fix and unwaive step3p7 test cases (#15583)
- Add test coverage for MiniMax model with multi-node M2.5 checkpoints eval (#15361)
- Add GLM NVFP4 stress test (#15437)
- Remove unreferenced accuracy tests and orphaned entries (#15593)
- Update
.gitattributes(#15606)
What's Changed
- [None][fix] AutoDeploy: Fixed wrong dist_backend AUTO detection when using trtllm-llmapi-launch by @MrGeva in #15423
- [None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15341
- [TRTLLMINF-81][feat] Avoid failed runners on infra retry by @dpitman-nvda in #15237
- [https://nvbugs/6179661][fix] Harden disagg cache transceiver teardown by @chienchunhung in #15422
- [https://nvbugs/6273846][test] gate GPT-OSS TRTLLM Gen MoE tests to SM100/SM103 by @dongfengy in #15128
- [None][fix] avoid type checking failures due to pip dependency resolution by @ixlmar in #15517
- [None][feat] VisualGen: async mp4 encode + fixed noise latent via env vars by @wu6u3tw in #15229
- [https://nvbugs/6337235][test] Fix MX/GMS model loader fixtures by @chienchunhung in #15471
- [None][test] Un-waive K2.5 Thinking FP4 disagg-NIXL e2e/gen_only tests by @chenfeiz0326 in #15443
- [None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15509
- [None][test] Waive 11 failed cases for main in QA CI by @tensorrt-cicd in #15506
- [None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15505
- [TRTLLM-13550][feat] WideEP FT: add MPI signal handler replacement (1d.0) by @chienchunhung in #14160
- [None][test] Remove 60 closed-bug waive entries for main by @tensorrt-cicd in #15511
- [#3237][fix] Support negative numbers in MajorityVote digit validation by @nikJ13 in #12294
- [None][test] Waive 10 failed cases for main in post-merge by @tensorrt-cicd in #15535
- [None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #15504
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15499
- [None][test] Waive 4 failed cases for main in QA CI by @tensorrt-cicd in #15510
- [None][fix] AutoDeploy: handle torch dist all_gather in multi_stream MLA transform by @MrGeva in #15456
- [None][feat] Add Gemma-4 NVFP4 quantized models to AutoDeploy registry by @marinayanov in #15382
- [None][fix] Fix encoder-decoder beam search corruption via per-slot fragmentPointerDevice by @achartier in #15461
- [https://nvbugs/6306936][test] Re-enable AutoDeploy disagg tests by @govind-ramnarayan in #15325
- [None][infra] split single-node perf sanity GB200 by @tburt-nv in #15548
- [None][chore] Bump version to 1.3.0rc20 by @yuanjingx87 in #15551
- [#10710][fix] clarify and align trtllm-bench runtime logging by @marinayanov in #15254
- [https://nvbugs/6290345][fix] Fix allreduce benchmark input setup by @nv-lschneider in #15427
- [None][feat] DSv4 prep: IndexerTopK and TopK primitives by @lfr-0531 in #15381
- [None][perf] Cutedsl NVF4 MOE: grouped/swiglu GEMM: Fix acc pipeline release arrive threads + FC2 meta stage code clean by @liyuhannnnn in #15258
- [https://nvbugs/6271740][test] Update llm_perf_core.yml to include new performance test for DeepSeek R1 0528 FP4 model by @yufeiwu-nv in #15453
- [None][fix] Stabilize perf-sanity tests by @chenfeiz0326 in #15440
- [None][test] fix Cosmos3 tests after VisualGen config split by @bobboli in #15170
- [None][feat] DSv4 prep: compressor and mHC primitives by @lfr-0531 in #15379
- [None][infra] Waive 3 failed cases for main in post-merge 2802 by @ZhanruiSunCh in #15571
- [https://nvbugs/6264844][fix] Fix wrong NCCL fallback in nemotron-h by @Wanli-Jiang in #15294
- [None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #15570
- [https://nvbugs/6344108][fix] skip TestNemotron3Super120B on pre-blackwell by @bo-nv in #15539
- [None][fix] Fix passing scaled timestep to time_embedder in Cosmos3 by @bastefaniak in #15545
- [None][chore] Remove nv-internal-release guardword comments in mega_moe_nvfp4 by @xxi-nv in #15575
- [None][ci] move more test cases to post merge by @QiJune in #15568
- [https://nvbugs/6185146][fix] Use
mat_a.new_empty([m, n_out//2])/input_scale.new_empty([sf_size])in the by @tensorrt-cicd in #14710 - [TRTLLM-35882][feat] cute dsl gvr-topk load-balance optimization by @limin2021 in #15304
- [None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #15579
- [None][test] waive hang issues by @xinhe-nv in #15576
- [None][test] waive hang issues by @xinhe-nv in #15581
- [#14874][feat] AutoDeploy : Perf optimization for gpt-oss-120b for low conc by @taylor-yb-lee in #15531
- [TRTLLM-12982][perf] reuse multi-item scoring position_ids and params by @ixlmar in #15413
- [TRTLLM-13599][test] Refine Qwen3.5 test cases by @nv-guomingz in #15544
- [TRTLLMINF-111][infra] Reuse image sqsh file by @EmmaQiaoCh in #15147
- [None][feat] DSv4 prep: MoE routing and backend support by @lfr-0531 in #15402
- [None][feat] DSv4 prep: runtime cache foundations by @lfr-0531 in #15378
- [https://nvbugs/6156233][fix] Lower GSM8K reference for the three GPT-OSS/20B-MXFP4 entries with… by @tensorrt-cicd in #15393
- [None][chore] Small cleanups to MultimodalModelMixin by @2ez4bz in #15322
- [TRTLLM-13123][feat] CUDA graph wrapper for multimodal encoders by @2ez4bz in #14829
- [TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve by @xwang233 in #15239
- [None][feat] Add Qwen Image visual generation examples by @yibinl-nvidia in #15235
- [TRTLLM-13490][feat] Support cross-attention with FlashInfer TRTLLM-Gen kernels on Blackwell by @cascade812 in #15429
- [None][fix] LTX-2: re-enable Ulysses for v2a cross-attention by @luyiyun1021 in #15303
- [TRTLLM-13246][feat] Wave 1: migrate aliases to setup_aliases and stage GMS RO load by @chienchunhung in #15014
- [None][feat] Support post-norm and per-aux fc_norm for Eagle3 draft models by @Dogacel in #14988
- [None][fix] fix FA4 install in devel docker by @o-stoner in #14706
- [https://nvbugs/6276842][test] Loosen rtol/atol on encoder CUDA graph logits parity check by @tingyangk in #15527
- [#15179][fix] Add necessary methods for guided decoding in Kimi K2.5 by @chungen04 in #15180
- [None][test] Waive failed unittest on all devices (nvbugs/6335726) by @guqiqi in #15585
- [None][infra] add blossom-ci authorized users by @niukuo in #15549
- [https://nvbugs/6166097][fix] Fix CuteDSL NVFP4 EPLB weight layout by @nv-xtf in #15538
- [None][test] GPT-OSS disagg test for transceiver v2 by @Shixiaowei02 in #15301
- [None][feat] Add BaseResourceManager-based KV-cache compression manager framework by @Hudayday in #15106
- [None][infra] use default split when CBTS test-db download fails by @crazydemo in #15592
- [#12715][fix] Disable NCCL window buffers on GB10 by @nv-lschneider in #15559
- [TRTLLM-11353][feat] API to configure TeaCache coefficients by @o-stoner in #13170
- [TRTLLM-12242][feat] Add Marlin NVFP4 backend for MoE and Linear on Hopper by @xuantengh in #13476
- [https://nvbugs/6094068][fix] Fix Qwen3-Next bf16 4gpu test by @JadoTu in #15206
- [None][feat] Dis-agg transceiver mass integration from the DSV4 branch by @Shixiaowei02 in #15222
- [https://nvbugs/6224637][fix] Enable CuTe DSL BF16 kernels for SM100 PP by @yuxianq in #14993
- [https://nvbugs/6256531][test] Unwaive Llama guided decoding xgrammar by @sunnyqgg in #15240
- [None][feat] DSv4: sparse cache manager adapter by @lfr-0531 in #15394
- [TRTLLM-12982][chore] relocate
torch_multi_arangeby @ixlmar in #15416 - [None][infra] Waive 15 failed cases for main in post-merge 2804 by @ZhanruiSunCh in #15620
- [None][test] Waive hang issues by @xinhe-nv in #15609
- [TRTLLM-13371][perf] LTX-2 FA4: enable split-KV heuristic (num_splits=0) for low-occupancy cross-attn by @luyiyun1021 in #15399
- [https://nvbugs/6346546][fix] fix mRoPE CUDA graph gate for text requests by @yechank-nvidia in #15589
- [https://nvbugs/6274932] [fix] Fix and unwaive step3p7 test cases by @kaiyux in #15583
- [TRTLLM-13600][test] Clean up Qwen3 test cases by @nv-guomingz in #15591
- [TRTLLM-13601][test] Clean up Nemotron test cases by @nv-guomingz in #15586
- [None][infra] Fix node list query failing on tcsh login nodes by @yiqingy0 in #15623
- Revert "[TRTLLM-12622][feat] Add native post-processing hook to trtllm-serve" by @tburt-nv in #15629
- [TRTLLM-13612][test] Remove unreferenced accuracy tests and orphaned … by @nv-guomingz in #15593
- [https://nvbugs/6215688][fix] Fix visual gen test leaked issue by @yibinl-nvidia in #15236
- [https://nvbugs/6021427][fix] BREAKING CHANGE: Make request chat_template opt-in by @yibinl-nvidia in #14646
- [None][infra] AutoDeploy: Add trtllm runner for standalone llm-c by @bmarimuthu-nv in #15630
- [https://nvbugs/6274614][fix] remove spec tokens env for stress test by @chuangz0 in #15153
- [TRTLLM-13444][test] Add Qwen-Image text-to-image unit tests by @yingguo-trt in #15580
- [TRTLLM-13247][feat] Wave 2: stage Linear and Attention transforms by @chienchunhung in #15288
- [https://nvbugs/6368480][fix] Cache the SM count once in FmhaDispatcher's constructor and reuse the cached… by @chenfeiz0326 in #15611
- [None][test] Add modularized perf tests (attention + MoE discrete/continuous) by @ruodil in #15541
- [#15565][fix] AutoDeploy: Fix Super MTP IMA introduced by checkpointing replay by @galagam in #15622
- [#15613][fix] Gemma4 multimodal: fix vision TP and xgrammar startup crashes by @Thachnh in #15566
- [TRTLLM-12762][test] Add Test coverage for MiniMax Model with multi-node, M2.5 checkpoints eval by @jieli-matrix in #15361
- [TRTLLM-13575][feat] Add EPLB support for Qwen3.5 by @nv-guomingz in #15543
- [None][test] add GLM nvfp4 stress test by @xinhe-nv in #15437
- [TRTLLM-12982][chore] improve multi-item scoring request validation by @ixlmar in #15627
- [None][test] Add Qwen3.5-397B-A17B-NVFP4 B200 aggregated perf-sanity tests by @chenfeiz0326 in #15650
- [None][infra] take test durations into account to determine cbts splits num by @crazydemo in #15614
- [None][doc] Add deploy guide for Minimax M3 by @WeiHaocheng in #15587
- [None][chore] Update .gitattributes by @ziyixiong-nv in #15606
- [https://nvbugs/6239637][fix] Unwaive Qwen3.5 cases on A100 platform by @nv-guomingz in #15481
- [TRTLLM-13712][feat] Add Qwen-Image-Bench evaluator by @yibinl-nvidia in #14837
- [https://nvbugs/6248783][test] Unwaive Qwen3 skip softmax test by @bobboli in #15652
- [None][fix] User/tjohnsen/evict empty blocks first by @thorjohnsen in #11685
- [https://nvbugs/6293536][fix] Stage KV block offsets through a fresh host buffer by @thorjohnsen in #15546
- [TRTLLM-13370][perf] LTX2 + WAN: Fuse MLP up-GEMM + bias + GELU(tanh) + NVFP4-quant into CuteDSL epilogue by @luyiyun1021 in #15299
- [https://nvbugs/6248837][chore] waive memory polluters by @tburt-nv in #15665
- [https://nvbugs/6269778][fix] Fix overallocation of draft KV cache by @mikeiovine in #15017
- [None][feat] add MXFP8 weight format + CUTLASS W8A8 Linear and MoE by @WeiHaocheng in #14962
- [https://nvbugs/6344612][test] relax GPT-OSS GPQA references due to high variance in random sampling by @dongfengy in #15567
- [https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size by @nv-lschneider in #15596
New Contributors
- @nikJ13 made their first contribution in #12294
- @bastefaniak made their first contribution in #15545
- @Dogacel made their first contribution in #14988
- @guqiqi made their first contribution in #15585
- @nv-xtf made their first contribution in #15538
- @xuantengh made their first contribution in #13476
- @Thachnh made their first contribution in #15566
Full Changelog: v1.3.0rc19...v1.3.0rc20