github NVIDIA/TensorRT-LLM v1.3.0rc1

latest release: latest-ci-stable-commit-main
pre-release16 hours ago

Highlights

  • Model Support

    • GLM-4.5-Air support (#10653)
    • K-EXAONE MTP support (#10796)
  • API

    • Refactor AutoDeployConfig into LlmArgs (#10613)
    • Support model_kwargs for pytorch backend (#10351)
  • Feature

    • Update disagg slurm scripts (#10712)
    • Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
    • Fix sharding dashboard errors (#10786)
    • Async Transfer Manager (#9891)
    • Speculative One Model: FlashInfer sampling (#10284)
    • Refactor speculative decoding workers (#10768)
    • Use global unique id as disagg request id (#10187)
    • Enable guided decoding with reasoning parsers (#10890)
    • Support partial update weight for fp8 (#10456)
    • Multi-LoRA serving with CUDA Graph (#8279)
    • Support logprobs for Completions API (#10809)
    • Eagle3 Specdec UX improvements (#10124)
    • Python transceiver components (step 2) (#10494)
    • Upgrade NIXL to v0.9.0 (#10896)
    • KV Connector Support for MTP (#10932)
    • Support overlap scheduler for disagg ctx instances (#10755)
    • Adding implementation of KVCacheManagerV2 (#10736)
    • Switch to ConfigurableMoE as the default path (#10792)
  • Fix

    • Enable system memory to transfer active message in NIXL ucx (#10602)
    • Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A (#10539)
    • Default disable gemm+allreduce fusion (#10656)
    • Fix vulnerability urllib3 and nbconvert (#10551)
    • Fix overlap scheduler race condition (#10610)
    • Replace pickle.load with restricted Unpickler (#10622)
    • Fix copy start_logs in disagg slurm scripts (#10840)
    • Cherry-pick: Disable short profile for tunable ops with MERGE strategy (#10844, #10715)
    • Lock resource to fix potential access to released data (#10827)
    • Cherry-pick: Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841, #10654)
    • Remove weight tensor holder to release memory earlier (#10876)
    • Add missing dist strategy param and fix typo for ad_logger (#10892)
    • Update RMSNorm custom op plumbing (#10843)
    • Fix hmac launch (#10434)
    • Avoid Double update for previous batch (#9888)
    • Re-init TRTLLM sampler to use sample stream in multi-stream cases (#10918)
    • Mtp with async scheduler (#10941)
    • Fix buffer reuse (#10716)
    • Cherry-pick: Fix hanging issue for MNNVL Allreduce under PP (#10750, #10633)
    • Workaround for flashinfer.sampling.sampling_from_logits (#10713)
    • Fix port 8000 being used issue in stress test (#10756)
  • Documentation

    • Clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) (#10320)
    • Add NIXL as a Python attribution (step 4) (#10910)
    • 1.2 Release Notes Headers (#10722)
  • Test & Infra

    • Upload regression info to artifactory (#10599)
    • Add sonarqube scanning in lockfile generation pipeline (#10700)
    • Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
    • Remove trt flow tests in NIM (#10731)
    • Update config.yaml of slurm scripts to align with submit.py change (#10802)
    • Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
    • Trigger multi-gpu tests when install_nixl/ucx.sh is modified (#10624)
    • Add DGX-Spark VLM accuracy and perf spec dec cases (#10804)
    • Fix test list llm_spark_func.txt (#10921)
    • Add test configurable moe module multi gpu (#10699)
    • NVFP4 MoE - Move weights transformation to fusion phase (#10803)
    • Update flashinfer-python to 0.6.1 (#10872)
    • Improve disagg acc tests (#10833)
    • Refine placement group in ray executor (#10235)
    • Regenerate out dated lock file (#10940)
    • Remove long-running sanity check tests on GH200 (#10924, #10969)
    • Add dgx-spark beta notes (#10766)
    • Modify ctx config in 128k8k disagg cases (#10779)
    • Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)

What's Changed

  • [#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 by @MrGeva in #10697
  • [None][infra] Add sonarqube scanning in lockfile generation pipeline by @yuanjingx87 in #10700
  • [https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test by @lucaslie in #10461
  • [#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size by @MrGeva in #10687
  • [#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] by @lucaslie in #10675
  • [https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx by @chuangz0 in #10602
  • [https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests by @lucaslie in #10769
  • [TRTLLM-10300][feat] Upload regression info to artifactory by @chenfeiz0326 in #10599
  • [None][chore] Add release/1.2 branch into lockfile generation schedule by @yiqingy0 in #10790
  • [TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab by @ZhanruiSunCh in #10616
  • [None][infra] Waive failed cases for main on 01/19 by @EmmaQiaoCh in #10794
  • [#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test by @MrGeva in #10603
  • [None][feat] Update disagg slurm scripts by @qiaoxj07 in #10712
  • [None][test] adjust the dis-agg test timeout threshold by @Shixiaowei02 in #10800
  • [None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) by @ssam18 in #10320
  • [None][chore] Remove trt flow tests in NIM by @jieli-matrix in #10731
  • [None][chore] update config.yaml of slurm scripts to align with submit.py change by @dc3671 in #10802
  • [https://nvbugs/5776445][chore] unwaive test by @reasonsolo in #10667
  • [TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by @lancelly in #10273
  • [TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. by @bobboli in #10539
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10776
  • [None][fix] default disable gemm+allreduce fusion by @benzh-2025 in #10656
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10787
  • [None][fix] Fix vulnerability urllib3 and nbconvert by @yiqingy0 in #10551
  • [None][test] Update sanity test list by @xinhe-nv in #10825
  • [None][fix] Remove unused params in attn by @yizhang-nv in #10652
  • [TRTLLM-10785][feat] Fix sharding dashboard errors by @greg-kwasniewski1 in #10786
  • [https://nvbugs/5701445][chore] unwaive test. by @yuxianq in #10806
  • [None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… by @bo-nv in #10624
  • [None][infra] Waive failed cases for main branch on 01/20 by @EmmaQiaoCh in #10829
  • [None][chore] Reduce tedious logs by @chzblych in #10847
  • [#10707][fix] AutoDeploy: Super accuracy test fixes by @galagam in #10717
  • [None][chore] Async Transfer Manager by @jthomson04 in #9891
  • [None][fix] fix duplicate entry in waives.txt by @lucaslie in #10853
  • [None][feat] Speculative One Model: FlashInfer sampling by @IzzyPutterman in #10284
  • [https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… by @SimengLiu-nv in #10610
  • [https://nvbugs/5760737][test] only skip mooncake+indexerkcache test by @zhengd-nv in #10266
  • [https://nvbugs/5759698][fix] unwaive test_base_worker by @Superjomn in #10669
  • [None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes by @djns99 in #9532
  • [https://nvbugs/5670458][chore] Unwaive reward model test by @shuyixiong in #10831
  • [None][chore] Revert #10847 by @chzblych in #10869
  • [https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler by @yibinl-nvidia in #10622
  • [None][fix] Fix copy start_logs in disagg slurm scripts by @qiaoxj07 in #10840
  • [None][fix] Cherry-pick #10715: Disable short profile for tunable ops with MERGE strategy by @hyukn in #10844
  • [https://nvbugs/5740377][fix] Lock resource to fix potential access to released data by @HuiGao-NV in #10827
  • [https://nvbugs/5814253][fix] unwaive test_autotuner_distributed_strategy tests by @hyukn in #10793
  • [None][chore] switch to ConfigurableMoE as the default path by @xxi-nv in #10792
  • [None][infra] Waive failed cases for main branch on 01/21 by @EmmaQiaoCh in #10882
  • [https://nvbugs/5636916][fix] Cherry-pick #10654: Fix accuracy issue of TWO-SHOT AllReduce kernel by @hyukn in #10841
  • [None][chore] unwaive qwen3 235B accuracy test by @kris1025 in #10493
  • [TRTLLM-10325][feat] Refactor speculative decoding workers by @cascade812 in #10768
  • [None][infra] Fix sonarQube job hang by create jenkins homd folder if not exist by @yuanjingx87 in #10830
  • [https://nvbugs/5816267][fix] Remove weight tensor holder to release memory earlier by @dongxuy04 in #10876
  • [https://nvbugs/5784543][chore] unwaive test. by @yuxianq in #10835
  • [None][feat] GLM-4.5-Air support by @videodanchik in #10653
  • [TRTLLM-10059][feat] Use global unique id as disagg request id by @reasonsolo in #10187
  • [None][chore] Add DGX-Spark VLM accuracy and perf spec dec cases by @JennyLiu-nv in #10804
  • [None][feat] K-EXAONE MTP support by @yechank-nvidia in #10796
  • [#8241][feat] Support model_kwargs for pytorch backend by @taylor-yb-lee in #10351
  • [TRTLLM-10154][feat] Enable guided decoding with reasoning parsers by @syuoni in #10890
  • [None][fix] Fix waived tests for Nemotron-h models by @Wanli-Jiang in #10758
  • [TRTLLM-9771][feat] Support partial update weight for fp8 by @shuyixiong in #10456
  • [None][feat] Add KV cache cleanup by @pengbowang-nv in #7439
  • [https://nvbugs/5811159][fix] Unwaive bug 5811159. by @bobboli in #10903
  • [#10838][fix] Add missing dist strategy param. fix typo for ad_logger… by @tcherckez-nvidia in #10892
  • [None][ci] Fix test list llm_spark_func.txt by @syuoni in #10921
  • [None][chore] Bump version to 1.3.0rc1 by @yiqingy0 in #10923
  • [None][chore] NVFP4 MoE - Move weights transformation to fusion phase… by @tcherckez-nvidia in #10803
  • [https://nvbugs/5741304][chore] Update flashinfer-python to 0.6.1 by @yihwang-nv in #10872
  • [https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph by @JyChang012 in #8279
  • [None][fix] Update RMSNorm custom op plumbing by @JintaoPengCS in #10843
  • [TRTLLM-10388][feat] Support logprobs for Completions API by @LinPoly in #10809
  • [https://nvbugs/5768068][chore] improve disagg acc tests by @bo-nv in #10833
  • [https://nvbugs/5783876][fix] fix hmac launch by @Superjomn in #10434
  • [TRTLLM-10590][feat] Eagle3 Specdec UX improvements by @venkywonka in #10124
  • [TRTLLM-9527][doc] Add NIXL as a Python attribution (step 2) by @Shixiaowei02 in #10910
  • [TRTLLM-9527][feat] Python transceiver components (step 2) by @Shixiaowei02 in #10494
  • [None][fix] Avoid Double update for previous batch by @yizhang-nv in #9888
  • [https://nvbugs/5819002][fix] fix sharding tests by @greg-kwasniewski1 in #10775
  • [#9306][refactor] Refactor AutoDeployConfig into LlmArgs by @2ez4bz in #10613
  • [https://nvbugs/5688721][fix] unwaive NemotronH accuracy test by @lucaslie in #10852
  • [None][infra] Update CI allowlist by @yuanjingx87 in #10936
  • [TRTLLM-9108][feat] Add test configurable moe module multi gpu by @leslie-fang25 in #10699
  • [None][test] Remove unused test list by @StanleySun639 in #10916
  • [None][feat] Upgrade NIXL to v0.9.0 by @zackyoray in #10896
  • [None][infra] Waive a failed case in pre-merge stage by @EmmaQiaoCh in #10948
  • [https://nvbugs/5833795][chore] Waive test test_e2e.py::test_ptp_quickstart_advanced[GPT-OSS-120B-gpt_oss/gpt-oss-120b] by @yihwang-nv in #10953
  • [None][chore] refine placement group in ray executor by @Superjomn in #10235
  • [https://nvbugs/5814215][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py::test_flashinfer_fused_moe_matches_torch_moe by @yihwang-nv in #10930
  • [None][infra] Regenerate out dated lock file by @yuanjingx87 in #10940
  • [https://nvbugs/5707359][fix] Unwaive the test that due to flashinfer… by @liji-nv in #10570
  • [None][feat] AutoDeploy: Enhance memory consumption for MoE fusion transform by @taylor-yb-lee in #10772
  • [None][feat] KV Connector Support for MTP by @jthomson04 in #10932
  • [TRTLLM-10334] [feat] Support overlap scheduler for disagg ctx instances by @kaiyux in #10755
  • [None][ci] Remove long-running sanity check tests on GH200 (#10924) by @chzblych in #10969
  • [None][infra] Fix TRT-LLM data scratch mount point for gb10x by @EmmaQiaoCh in #10880
  • [https://nvbugs/5829097][fix] Re-init TRTLLM sampler to use sample stream in multi-stream cases. by @yuxianq in #10918
  • [TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2 by @lowsfer in #10736
  • [None][fix] Bugfix/mtp with async scheduler by @pcastonguay in #10941
  • [None][chroe] Mass integration of release/1.2 by @dominicshanshan in #10888
  • [TRTLLM-10147][perf] Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark by @syuoni in #10279
  • [None][test] Waive failed tests on main 1/25 by @chzblych in #10984

New Contributors

Full Changelog: v1.3.0rc0...v1.3.0rc1

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.