NVIDIA/TensorRT-LLM v1.3.0rc17 on GitHub

Highlights

Known Issues
- DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
Model Support
- Add MoT World Model support (#14012)
- Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
- Restore Mistral Large 3 text-only processor (#14248)
- Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
- Add a reasoning parser for Qwen3.5 (#14659)
- Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
- Add Poolside Laguna tool parser (#14638)
- Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
- Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
- API
- Allow content: null in CustomChatCompletionMessageParam (#14368)
- Enforce trust_remote_code flag (#13527)
- Add thinking token budget control (#14665)
- Expose host/GPU per-iter time and clarify iter labeling in /metrics (#14127)
- Make attention backend case-insensitive (#14635)
Feature
- Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
- Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
- Add LoRA support to LLMAPI Triton backend (#14079)
- Log KV cache utilization and context tokens per iteration (#14206)
- Remove one-warp-per-token policy from MoE A2A kernels (#14550)
- Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
- Add CuTe DSL attention via exported binaries in VisualGen (#13721)
- Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
- Add GMS-only weight sharing support (#13926)
- Add VisualGen tensor parallelism support (#13614)
- Enable NCCL symmetric zero-copy by default (#14472)
- Improve disaggregated TTFT (#14719)
Fix
- Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
- Remove sync after FlashInfer attention plan() (#14634)
- Add a compatibility shim in load_hf_tokenizer for bytes_to_unicode (#14090)
- Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer (#14452)
- Fix crash in deep_ep.pyby falling back to the pre-quant dispatch path when hidden_states_sf is missing (#14404)
- Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
- Fix Mistral-Large-3 weight loading crash (#14033)
- Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
- Fix qwen3 hang on SM120/121 (#14424)
- Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench (#13498)
- Catch OSError in config_file_lock for NFS compatibility (#11960)
- Fix MoE DeepGEMM workspace size with attention DP (#13310)
- Fix inf/NaN issues in Triton Mamba softplus (#14652)
- Cap per-rank max_num_active_requests by max_num_tokens under attention DP (#14481)
- Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
- Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set (#12985)
- Replace fixed disagg fill throttle with slow-start ramp (#14475)
- Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 (#14381)
- Make FA4 a proper pip dependency (#13788)
- Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
Documentation
- Add CUTLASS DSL uninstall step to installation guide (#14621)
- Add deprecation notice to legacy support-matrix.md (#14495)
- Fix incorrect auto sampler behavior description for beam search (#14487)
- Add VisualGen context to AGENTS.md (#14732)
Test & Infra
- Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
- Add disagg local one-step run script for CI submit (#14557)
- Update model path definitions in test_perf.py and clean up waives.txt (#14393)
- Dedup executor unit tests on H100/B200 (#14556)
- Add disagg cancellation stress-test harness skeleton (#14375)
- Add UCX TLS env in disagg-related tests (#14626)
- Replace ONNX spec with onnx>=1.21.0 in requirements.txt (#14577)
- Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
- Add offline equivalence test for sharding IR (#13963)
- Enable kv_cache_manager_v2 test for A10 (#12885)
- Remove two-model EAGLE3 spec-decoding tests (#14735)
- Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in spec decoding perf test (#14438)

What's Changed

[https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in #14392
[None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in #13773
[None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in #13644
[None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in #14512
[https://nvbugs/6162328][fix] Add a tiny compat shim in load_hf_tokenizer that, when bytes_to_unicode is m by @tensorrt-cicd in #14090
[https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in #14440
[None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in #14452
[https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in #14523
[https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in #14404
[None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in #14526
[None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in #14542
[#11257][feat] Add LoRA support to llmapi triton backend by @karljang in #14079
[None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in #13409
[None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in #14557
[https://nvbugs/5974335][refactor] Update model path definitions in test_perf.py and clean up waives.txt by @yufeiwu-nv in #14393
[TRTLLM-12968][ci] Dedup executor unit tests on H100/B200 by @YihuiLu512 in #14556
[TRTLLM-12949][refactor] visual_gen: unify fused QK-norm+rope dispatch by @luyiyun1021 in #14529
[https://nvbugs/6143579][fix] Allow content: null in CustomChatCompletionMessageParam by @tijyojwad in #14368
[None][chore] log KV cache utilization and context tokens per iter by @pcicotti in #14206
[https://nvbugs/6168859][fix] move tinygemm PDL release after reduction by @dongfengy in #14537
[None][chore] Unwaive test_cp_tp_broadcast_object by @brb-nv in #14328
[https://nvbugs/6211185][fix] Fix failed GSM8K accuracy tests for LagunaXS on B200/GB200/B300 by @DomBrown in #14580
[TRTLLMINF-106][infra] Use B300 frontend platforms by @mlefeb01 in #14581
[None] [refactor] Unify compressed-tensors quant config parsing by @DomBrown in #14468
[None][feat] AutoDeploy push the rope buffer to later stage by @nvchenghaoz in #13859
[https://nvbugs/6215736][infra] Unwaive test_fp8_blockscale[throughput_mtp] by @bobboli in #14541
[https://nvbugs/6175923][test] Revert gpt_oss_20b perf MoE-backend pin by @ruodil in #14612
[https://nvbugs/6221621][test] Update trust_remote to nemotron and phi4 models by @yufeiwu-nv in #14570
[None][chore] update VisualGen codeowner settings by @zhenhuaw-me in #14530
[None][infra] Waive 8 failed cases for main in post-merge 2738 by @ZhanruiSunCh in #14615
[None][perf] Fuse FlashInfer GDN prefill state I/O into Triton kernels by @nv-guomingz in #14548
[https://nvbugs/6164924][fix] Lower free_gpu_memory_fraction for Exaone tests by @tensorrt-cicd in #14486
[https://nvbugs/6163033][fix] Guard q_a_proj.weight dict access behind nvfp4_fused_a; update test to `chec by @tensorrt-cicd in #14033
[None][fix] Bypass FlashInfer SSD prefill to fix state dtype precision by @tijyojwad in #14600
[None][fix] Exclude Qwen3 VL vision model from quantization by @2ez4bz in #12851
[https://nvbugs/6162860][fix] Set free_gpu_memory_fraction=0.6 only when torch_compile=True for test_bfloat16_ by @tensorrt-cicd in #14109
[None][chore] Remove one-warp-per-token policy from MoE A2A kernels by @bobboli in #14550
[None][test] Waive 7 failed cases for main in QA CI by @xinhe-nv in #14498
[None][doc] Add CUTLASS DSL uninstall step to installation guide by @yihwang-nv in #14621
[https://nvbugs/6099723][fix] Gate supports_mnnvl() False on SM120/121 in _mnnvl_utils.py and add the same Mnn by @tensorrt-cicd in #14424
[https://nvbugs/6114464][fix] Add kv_cache_config to TestQwen3VL_MOE::test_auto_dtype by @tensorrt-cicd in #13668
[None][chore] Update flashinfer-python from 0.6.12rc1 to 0.6.12rc2 by @yihwang-nv in #14607
[https://nvbugs/6109750][test] Unwaive passing GPTOSS tests by @dongfengy in #14596
[https://nvbugs/6215690][fix] AutoDeploy: FlashInfer 128-byte alignment for Mamba inputs (also addresses nvbugs/6162114) by @galagam in #14535
[None][docs] Add deprecation notice to legacy support-matrix.md by @fuergaosi233 in #14495
[None][fix] make FA4 proper pip dependency by @o-stoner in #13788
[TRTLLM-12648][test] add disagg cancellation stress-test harness skeleton by @chienchunhung in #14375
[None][feat] support non-divisible EP in MoE alltoall and slurm benchmark by @JacobHu-NV in #13888
[https://nvbugs/6094100][fix] add ucx tls env in disagg related tests by @chuangz0 in #14626
[None][test] Update stress tests by @xinhe-nv in #14454
[None][chore] Bump version to 1.3.0rc17 by @VALLIS-NERIA in #14657
[None][infra] Fix hang when generating report by @EmmaQiaoCh in #14625
[None][infra] Update blossom-ci allowlist: add nv-anants, guqiqi, jonghyunchoe, belgarten-nv by @yiqingy0 in #14662
[https://nvbugs/6115560][fix] catch OSError in config_file_lock for NFS compatibility by @sara4dev in #11960
[None][fix] Fix OSRB source header and provenance issues in AutoDeploy modeling code by @bmarimuthu-nv in #14670
[https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2 by @tensorrt-cicd in #14448
[https://nvbugs/5800725][fix] Restore Mistral Large 3 text-only processor by @byshiue in #14248
[None][test] Unwaive fp8 blockscale baseline mtp1 by @sunnyqgg in #14666
[None][docs] fix incorrect auto sampler behavior description for beam search by @fuergaosi233 in #14487
[None][feat] Expose host/GPU per-iter time and clarify iter labeling in /metrics by @eopXD in #14127
[None][chore] Add test lists with multi-gpu test to CI multi-gpu test trigger files by @pengbowang-nv in #14087
[None][refactor] Add derived properties for the thop.attention call site by @yuxianq in #14279
[https://nvbugs/6160085][fix] At tensorrt_llm/tokenizer/tokenizer.py import time, re-export `bytes_to_unicod by @tensorrt-cicd in #14116
[None][infra] Waive 1 failed cases for main in post-merge 2740 by @ZhanruiSunCh in #14688
[https://nvbugs/6221483][fix] Revert auto_deploy _mamba_ssm_prepare_metadata to pre-#13566 state by @greg-kwasniewski1 in #14640
[TRTLLM-12762][fix] Enable multi-node TP for MiniMax-M2 by @pcicotti in #14314
[https://nvbugs/6043248][fix] Validate tensor payload size on deserialization by @yibinl-nvidia in #14648
[None][perf] Add AutoDeploy NVFP4 RMSNorm quant fusion by @tcherckez-nvidia in #14361
[None][feat] Support Gemma4 multi-head_dim pools and host-side slicing to provide local view to Triton kernels for SWA by @eopXD in #13745
[TRTLLM-13960][test] Offline equivalence test for sharding IR by @greg-kwasniewski1 in #13963
[TRTLLM-11410][feat] MoT World Model Support by @NVShreyas in #14012
[https://nvbugs/6115036][fix] Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench by @hyukn in #13498
[https://nvbugs/5972776][fix] Pass IPC HMAC key through file descriptor by @yibinl-nvidia in #14378
[https://nvbugs/5911594][fix] Restrict HTTP cluster storage to loopback by @yibinl-nvidia in #14161
[None][fix] Exclude post-merge stages from CBTS force-keep filters by @achartier in #14594
[TRTLLMINF-67][infra] use pre-configured idle GPU exemption by @tburt-nv in #14587
[https://nvbugs/6207749][fix] Replace the spec with onnx>=1.21.0 in requirements.txt; mirror in `security_ by @tensorrt-cicd in #14577
[https://nvbugs/6185480][fix] Autodeploy skip the GLM accuracy test for pre-hopper by @nvchenghaoz in #14656
[https://nvbugs/6165866][infra] Waive 1 failed cases for main in pre-merge 40081 by @taylor-yb-lee in #14653
[https://nvbugs/6187185][fix] Apply the existing low_memory_overrides() helper in `TestNemotronV2.test_auto_ by @tensorrt-cicd in #14584
[https://nvbugs/6192201][fix] AutoDeploy: unwaive llama perf test and increase its concurrency to 256 by @MrGeva in #14691
[TRTLLM-12982][feat] improve attention backend selection by @ixlmar in #14635
[None][test] Enable test for kv_cache_manager_v2 for A10 by @lowsfer in #12885
[None][infra] Generate json with cmake fetched contents in build stage by @yuanjingx87 in #13607
[TRTLLM-12436][feat] visual_gen: add CuTe DSL attention via exported binaries by @xrq-phys in #13721
[https://nvbugs/6229221][fix] Add a reasoning parser for qwen3_5 by @moraxu in #14659
[None][fix] Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 by @achartier in #14381
[https://nvbugs/6084447][fix] Fix MoE DeepGEMM workspace size with attention_dp by @tensorrt-cicd in #13310
[https://nvbugs/6189416][fix] Add a Blackwell-specific reference entry (extra_acc_spec=sm100_fp8, accuracy=46. by @tensorrt-cicd in #14484
[#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml by @MrGeva in #14622
[#13561][feat] AutoDeploy: enable MLIR elementwise fusion and trtllm_gen MoE on Nano NVFP4 by @MrGeva in #14554
[https://nvbugs/6221841][fix] Detect via the raw config_dict whether the user actually set a top-level rope_th by @tensorrt-cicd in #14624
[None][perf] Replace Parakeet audio encoder with native trtllm layers by @aswinvisva in #14474
[https://nvbugs/6194552][fix] stabilize Triton Mamba softplus by @hnover-nv in #14652
[TRTLLM-13043][chroe] add VisualGen context to AGENTS.md by @zhenhuaw-me in #14732
[None][feature] Add thinking token budget control by @tijyojwad in #14665
[TRTLLM-13050][test] Remove two-model eagle3 spec-decoding tests by @QiJune in #14735
[TRTLLM-12653][feat] LTX-2 Ulysses cross-attention for v2a with audio padding by @luyiyun1021 in #14044
[None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_* by @yuxianq in #14569
[https://nvbugs/6162857][fix] Use generation metrics for VisualGen perf sanity by @taianz-nv in #14176
[https://nvbugs/6045177][fix] resolve mypy error by @ixlmar in #14689
[None][feat] add Poolside Laguna tool parser by @DomBrown in #14638
[TRTLLM-10004][chore] Enable NCCL symmetric zero-copy by default by @nv-lschneider in #14472
[None][infra] Fix cbts tokenmacro b64 by @crazydemo in #14718
[TRTLLM-12982][perf] remove sync after FlashInfer attention plan() by @ixlmar in #14634
[https://nvbugs/6156233][test] unwaive GPT-OSS dflash test since bug has been closed as fix unknown by @dongfengy in #14713
[TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP by @xwang233 in #14481
[None][feat] Enable NVFP4 KV cache support in trtllm-gen attention by @yihwang-nv in #12544
[https://nvbugs/6185480][fix] autodeploy unwaive the test by @nvchenghaoz in #14716
[https://nvbugs/6136737][fix] Propagate external SWA window to FMHA kernel in V2 KV cache by @tensorrt-cicd in #13719
[https://nvbugs/5996024][fix] Enforce trust_remote_code flag by @yibinl-nvidia in #13527
[https://nvbugs/5979710][fix] Bound transfer destinations by @yibinl-nvidia in #13525
[None][fix] Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set by @YPxHolic in #12985
[None][infra] revert #13607 by @tburt-nv in #14757
[TRTLLM-12535][chore] Refactor fast path (token ID space) preprocessing logic out to the input preprocessor methods only by @moraxu in #14370
[None][infra] Waive 2 failed cases for main in post-merge 2741 by @ZhanruiSunCh in #14737
[None][infra] Waive 1 failed cases for main in pre-merge 40562 by @ZhanruiSunCh in #14776
[TRTLLM-12440][feat] Add GMS-only weight sharing support by @chienchunhung in #13926
[None][test] Unwaive some Perf Tests by @chenfeiz0326 in #14664
[https://nvbugs/6204488][fix] Replace fixed disagg fill throttle with slow-start ramp by @chienchunhung in #14475
[None][chore] Waive failing multi-gpu test by @brb-nv in #14788
[TRTLLM-11408][feat] Add VisualGen TP Support by @belgarten-nv in #13614
[None][test] Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in Spec Decoding Perf Test by @chenfeiz0326 in #14438
[https://nvbugs/6165866][infra] Waive 1 failed cases for main in pre-merge 40081 - Fix prefix by @taylor-yb-lee in #14756
[https://nvbugs/6196391][fix] Carryover disagg TTFT improvements by @brb-nv in #14719
[https://nvbugs/6244695][fix] Revert Pass IPC HMAC key through file descriptor by @chenfeiz0326 in #14782

New Contributors

@fuergaosi233 made their first contribution in #14495
@sara4dev made their first contribution in #11960
@YPxHolic made their first contribution in #12985
@belgarten-nv made their first contribution in #13614

Full Changelog: v1.3.0rc16...v1.3.0rc17