github NVIDIA/TensorRT-LLM v1.3.0rc17

pre-release7 hours ago

Highlights

  • Known Issues
    • DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
  • Model Support
    • Add MoT World Model support (#14012)
    • Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
    • Restore Mistral Large 3 text-only processor (#14248)
    • Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
    • Add a reasoning parser for Qwen3.5 (#14659)
    • Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
    • Add Poolside Laguna tool parser (#14638)
    • Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
    • Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
    • API
    • Allow content: null in CustomChatCompletionMessageParam (#14368)
    • Enforce trust_remote_code flag (#13527)
    • Add thinking token budget control (#14665)
    • Expose host/GPU per-iter time and clarify iter labeling in /metrics (#14127)
    • Make attention backend case-insensitive (#14635)
  • Feature
    • Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
    • Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
    • Add LoRA support to LLMAPI Triton backend (#14079)
    • Log KV cache utilization and context tokens per iteration (#14206)
    • Remove one-warp-per-token policy from MoE A2A kernels (#14550)
    • Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
    • Add CuTe DSL attention via exported binaries in VisualGen (#13721)
    • Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
    • Add GMS-only weight sharing support (#13926)
    • Add VisualGen tensor parallelism support (#13614)
    • Enable NCCL symmetric zero-copy by default (#14472)
    • Improve disaggregated TTFT (#14719)
  • Fix
    • Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
    • Remove sync after FlashInfer attention plan() (#14634)
    • Add a compatibility shim in load_hf_tokenizer for bytes_to_unicode (#14090)
    • Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer (#14452)
    • Fix crash in deep_ep.pyby falling back to the pre-quant dispatch path when hidden_states_sf is missing (#14404)
    • Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
    • Fix Mistral-Large-3 weight loading crash (#14033)
    • Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
    • Fix qwen3 hang on SM120/121 (#14424)
    • Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench (#13498)
    • Catch OSError in config_file_lock for NFS compatibility (#11960)
    • Fix MoE DeepGEMM workspace size with attention DP (#13310)
    • Fix inf/NaN issues in Triton Mamba softplus (#14652)
    • Cap per-rank max_num_active_requests by max_num_tokens under attention DP (#14481)
    • Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
    • Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set (#12985)
    • Replace fixed disagg fill throttle with slow-start ramp (#14475)
    • Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 (#14381)
    • Make FA4 a proper pip dependency (#13788)
    • Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
  • Documentation
    • Add CUTLASS DSL uninstall step to installation guide (#14621)
    • Add deprecation notice to legacy support-matrix.md (#14495)
    • Fix incorrect auto sampler behavior description for beam search (#14487)
    • Add VisualGen context to AGENTS.md (#14732)
  • Test & Infra
    • Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
    • Add disagg local one-step run script for CI submit (#14557)
    • Update model path definitions in test_perf.py and clean up waives.txt (#14393)
    • Dedup executor unit tests on H100/B200 (#14556)
    • Add disagg cancellation stress-test harness skeleton (#14375)
    • Add UCX TLS env in disagg-related tests (#14626)
    • Replace ONNX spec with onnx>=1.21.0 in requirements.txt (#14577)
    • Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
    • Add offline equivalence test for sharding IR (#13963)
    • Enable kv_cache_manager_v2 test for A10 (#12885)
    • Remove two-model EAGLE3 spec-decoding tests (#14735)
    • Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in spec decoding perf test (#14438)

What's Changed

  • [https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in #14392
  • [None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in #13773
  • [None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in #13644
  • [None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in #14512
  • [https://nvbugs/6162328][fix] Add a tiny compat shim in load_hf_tokenizer that, when bytes_to_unicode is m by @tensorrt-cicd in #14090
  • [https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in #14440
  • [None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in #14452
  • [https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in #14523
  • [https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in #14404
  • [None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in #14526
  • [None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in #14542
  • [#11257][feat] Add LoRA support to llmapi triton backend by @karljang in #14079
  • [None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in #13409
  • [None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in #14557
  • [https://nvbugs/5974335][refactor] Update model path definitions in test_perf.py and clean up waives.txt by @yufeiwu-nv in #14393
  • [TRTLLM-12968][ci] Dedup executor unit tests on H100/B200 by @YihuiLu512 in #14556
  • [TRTLLM-12949][refactor] visual_gen: unify fused QK-norm+rope dispatch by @luyiyun1021 in #14529
  • [https://nvbugs/6143579][fix] Allow content: null in CustomChatCompletionMessageParam by @tijyojwad in #14368
  • [None][chore] log KV cache utilization and context tokens per iter by @pcicotti in #14206
  • [https://nvbugs/6168859][fix] move tinygemm PDL release after reduction by @dongfengy in #14537
  • [None][chore] Unwaive test_cp_tp_broadcast_object by @brb-nv in #14328
  • [https://nvbugs/6211185][fix] Fix failed GSM8K accuracy tests for LagunaXS on B200/GB200/B300 by @DomBrown in #14580
  • [TRTLLMINF-106][infra] Use B300 frontend platforms by @mlefeb01 in #14581
  • [None] [refactor] Unify compressed-tensors quant config parsing by @DomBrown in #14468
  • [None][feat] AutoDeploy push the rope buffer to later stage by @nvchenghaoz in #13859
  • [https://nvbugs/6215736][infra] Unwaive test_fp8_blockscale[throughput_mtp] by @bobboli in #14541
  • [https://nvbugs/6175923][test] Revert gpt_oss_20b perf MoE-backend pin by @ruodil in #14612
  • [https://nvbugs/6221621][test] Update trust_remote to nemotron and phi4 models by @yufeiwu-nv in #14570
  • [None][chore] update VisualGen codeowner settings by @zhenhuaw-me in #14530
  • [None][infra] Waive 8 failed cases for main in post-merge 2738 by @ZhanruiSunCh in #14615
  • [None][perf] Fuse FlashInfer GDN prefill state I/O into Triton kernels by @nv-guomingz in #14548
  • [https://nvbugs/6164924][fix] Lower free_gpu_memory_fraction for Exaone tests by @tensorrt-cicd in #14486
  • [https://nvbugs/6163033][fix] Guard q_a_proj.weight dict access behind nvfp4_fused_a; update test to `chec by @tensorrt-cicd in #14033
  • [None][fix] Bypass FlashInfer SSD prefill to fix state dtype precision by @tijyojwad in #14600
  • [None][fix] Exclude Qwen3 VL vision model from quantization by @2ez4bz in #12851
  • [https://nvbugs/6162860][fix] Set free_gpu_memory_fraction=0.6 only when torch_compile=True for test_bfloat16_ by @tensorrt-cicd in #14109
  • [None][chore] Remove one-warp-per-token policy from MoE A2A kernels by @bobboli in #14550
  • [None][test] Waive 7 failed cases for main in QA CI by @xinhe-nv in #14498
  • [None][doc] Add CUTLASS DSL uninstall step to installation guide by @yihwang-nv in #14621
  • [https://nvbugs/6099723][fix] Gate supports_mnnvl() False on SM120/121 in _mnnvl_utils.py and add the same Mnn by @tensorrt-cicd in #14424
  • [https://nvbugs/6114464][fix] Add kv_cache_config to TestQwen3VL_MOE::test_auto_dtype by @tensorrt-cicd in #13668
  • [None][chore] Update flashinfer-python from 0.6.12rc1 to 0.6.12rc2 by @yihwang-nv in #14607
  • [https://nvbugs/6109750][test] Unwaive passing GPTOSS tests by @dongfengy in #14596
  • [https://nvbugs/6215690][fix] AutoDeploy: FlashInfer 128-byte alignment for Mamba inputs (also addresses nvbugs/6162114) by @galagam in #14535
  • [None][docs] Add deprecation notice to legacy support-matrix.md by @fuergaosi233 in #14495
  • [None][fix] make FA4 proper pip dependency by @o-stoner in #13788
  • [TRTLLM-12648][test] add disagg cancellation stress-test harness skeleton by @chienchunhung in #14375
  • [None][feat] support non-divisible EP in MoE alltoall and slurm benchmark by @JacobHu-NV in #13888
  • [https://nvbugs/6094100][fix] add ucx tls env in disagg related tests by @chuangz0 in #14626
  • [None][test] Update stress tests by @xinhe-nv in #14454
  • [None][chore] Bump version to 1.3.0rc17 by @VALLIS-NERIA in #14657
  • [None][infra] Fix hang when generating report by @EmmaQiaoCh in #14625
  • [None][infra] Update blossom-ci allowlist: add nv-anants, guqiqi, jonghyunchoe, belgarten-nv by @yiqingy0 in #14662
  • [https://nvbugs/6115560][fix] catch OSError in config_file_lock for NFS compatibility by @sara4dev in #11960
  • [None][fix] Fix OSRB source header and provenance issues in AutoDeploy modeling code by @bmarimuthu-nv in #14670
  • [https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2 by @tensorrt-cicd in #14448
  • [https://nvbugs/5800725][fix] Restore Mistral Large 3 text-only processor by @byshiue in #14248
  • [None][test] Unwaive fp8 blockscale baseline mtp1 by @sunnyqgg in #14666
  • [None][docs] fix incorrect auto sampler behavior description for beam search by @fuergaosi233 in #14487
  • [None][feat] Expose host/GPU per-iter time and clarify iter labeling in /metrics by @eopXD in #14127
  • [None][chore] Add test lists with multi-gpu test to CI multi-gpu test trigger files by @pengbowang-nv in #14087
  • [None][refactor] Add derived properties for the thop.attention call site by @yuxianq in #14279
  • [https://nvbugs/6160085][fix] At tensorrt_llm/tokenizer/tokenizer.py import time, re-export `bytes_to_unicod by @tensorrt-cicd in #14116
  • [None][infra] Waive 1 failed cases for main in post-merge 2740 by @ZhanruiSunCh in #14688
  • [https://nvbugs/6221483][fix] Revert auto_deploy _mamba_ssm_prepare_metadata to pre-#13566 state by @greg-kwasniewski1 in #14640
  • [TRTLLM-12762][fix] Enable multi-node TP for MiniMax-M2 by @pcicotti in #14314
  • [https://nvbugs/6043248][fix] Validate tensor payload size on deserialization by @yibinl-nvidia in #14648
  • [None][perf] Add AutoDeploy NVFP4 RMSNorm quant fusion by @tcherckez-nvidia in #14361
  • [None][feat] Support Gemma4 multi-head_dim pools and host-side slicing to provide local view to Triton kernels for SWA by @eopXD in #13745
  • [TRTLLM-13960][test] Offline equivalence test for sharding IR by @greg-kwasniewski1 in #13963
  • [TRTLLM-11410][feat] MoT World Model Support by @NVShreyas in #14012
  • [https://nvbugs/6115036][fix] Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench by @hyukn in #13498
  • [https://nvbugs/5972776][fix] Pass IPC HMAC key through file descriptor by @yibinl-nvidia in #14378
  • [https://nvbugs/5911594][fix] Restrict HTTP cluster storage to loopback by @yibinl-nvidia in #14161
  • [None][fix] Exclude post-merge stages from CBTS force-keep filters by @achartier in #14594
  • [TRTLLMINF-67][infra] use pre-configured idle GPU exemption by @tburt-nv in #14587
  • [https://nvbugs/6207749][fix] Replace the spec with onnx>=1.21.0 in requirements.txt; mirror in `security_ by @tensorrt-cicd in #14577
  • [https://nvbugs/6185480][fix] Autodeploy skip the GLM accuracy test for pre-hopper by @nvchenghaoz in #14656
  • [https://nvbugs/6165866][infra] Waive 1 failed cases for main in pre-merge 40081 by @taylor-yb-lee in #14653
  • [https://nvbugs/6187185][fix] Apply the existing low_memory_overrides() helper in `TestNemotronV2.test_auto_ by @tensorrt-cicd in #14584
  • [https://nvbugs/6192201][fix] AutoDeploy: unwaive llama perf test and increase its concurrency to 256 by @MrGeva in #14691
  • [TRTLLM-12982][feat] improve attention backend selection by @ixlmar in #14635
  • [None][test] Enable test for kv_cache_manager_v2 for A10 by @lowsfer in #12885
  • [None][infra] Generate json with cmake fetched contents in build stage by @yuanjingx87 in #13607
  • [TRTLLM-12436][feat] visual_gen: add CuTe DSL attention via exported binaries by @xrq-phys in #13721
  • [https://nvbugs/6229221][fix] Add a reasoning parser for qwen3_5 by @moraxu in #14659
  • [None][fix] Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 by @achartier in #14381
  • [https://nvbugs/6084447][fix] Fix MoE DeepGEMM workspace size with attention_dp by @tensorrt-cicd in #13310
  • [https://nvbugs/6189416][fix] Add a Blackwell-specific reference entry (extra_acc_spec=sm100_fp8, accuracy=46. by @tensorrt-cicd in #14484
  • [#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml by @MrGeva in #14622
  • [#13561][feat] AutoDeploy: enable MLIR elementwise fusion and trtllm_gen MoE on Nano NVFP4 by @MrGeva in #14554
  • [https://nvbugs/6221841][fix] Detect via the raw config_dict whether the user actually set a top-level rope_th by @tensorrt-cicd in #14624
  • [None][perf] Replace Parakeet audio encoder with native trtllm layers by @aswinvisva in #14474
  • [https://nvbugs/6194552][fix] stabilize Triton Mamba softplus by @hnover-nv in #14652
  • [TRTLLM-13043][chroe] add VisualGen context to AGENTS.md by @zhenhuaw-me in #14732
  • [None][feature] Add thinking token budget control by @tijyojwad in #14665
  • [TRTLLM-13050][test] Remove two-model eagle3 spec-decoding tests by @QiJune in #14735
  • [TRTLLM-12653][feat] LTX-2 Ulysses cross-attention for v2a with audio padding by @luyiyun1021 in #14044
  • [None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_* by @yuxianq in #14569
  • [https://nvbugs/6162857][fix] Use generation metrics for VisualGen perf sanity by @taianz-nv in #14176
  • [https://nvbugs/6045177][fix] resolve mypy error by @ixlmar in #14689
  • [None][feat] add Poolside Laguna tool parser by @DomBrown in #14638
  • [TRTLLM-10004][chore] Enable NCCL symmetric zero-copy by default by @nv-lschneider in #14472
  • [None][infra] Fix cbts tokenmacro b64 by @crazydemo in #14718
  • [TRTLLM-12982][perf] remove sync after FlashInfer attention plan() by @ixlmar in #14634
  • [https://nvbugs/6156233][test] unwaive GPT-OSS dflash test since bug has been closed as fix unknown by @dongfengy in #14713
  • [TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP by @xwang233 in #14481
  • [None][feat] Enable NVFP4 KV cache support in trtllm-gen attention by @yihwang-nv in #12544
  • [https://nvbugs/6185480][fix] autodeploy unwaive the test by @nvchenghaoz in #14716
  • [https://nvbugs/6136737][fix] Propagate external SWA window to FMHA kernel in V2 KV cache by @tensorrt-cicd in #13719
  • [https://nvbugs/5996024][fix] Enforce trust_remote_code flag by @yibinl-nvidia in #13527
  • [https://nvbugs/5979710][fix] Bound transfer destinations by @yibinl-nvidia in #13525
  • [None][fix] Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set by @YPxHolic in #12985
  • [None][infra] revert #13607 by @tburt-nv in #14757
  • [TRTLLM-12535][chore] Refactor fast path (token ID space) preprocessing logic out to the input preprocessor methods only by @moraxu in #14370
  • [None][infra] Waive 2 failed cases for main in post-merge 2741 by @ZhanruiSunCh in #14737
  • [None][infra] Waive 1 failed cases for main in pre-merge 40562 by @ZhanruiSunCh in #14776
  • [TRTLLM-12440][feat] Add GMS-only weight sharing support by @chienchunhung in #13926
  • [None][test] Unwaive some Perf Tests by @chenfeiz0326 in #14664
  • [https://nvbugs/6204488][fix] Replace fixed disagg fill throttle with slow-start ramp by @chienchunhung in #14475
  • [None][chore] Waive failing multi-gpu test by @brb-nv in #14788
  • [TRTLLM-11408][feat] Add VisualGen TP Support by @belgarten-nv in #13614
  • [None][test] Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in Spec Decoding Perf Test by @chenfeiz0326 in #14438
  • [https://nvbugs/6165866][infra] Waive 1 failed cases for main in pre-merge 40081 - Fix prefix by @taylor-yb-lee in #14756
  • [https://nvbugs/6196391][fix] Carryover disagg TTFT improvements by @brb-nv in #14719
  • [https://nvbugs/6244695][fix] Revert Pass IPC HMAC key through file descriptor by @chenfeiz0326 in #14782

New Contributors

Full Changelog: v1.3.0rc16...v1.3.0rc17

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.