NVIDIA/TensorRT-LLM v1.3.0rc1 on GitHub

Highlights

Model Support
- GLM-4.5-Air support (#10653)
- K-EXAONE MTP support (#10796)
API
- Refactor AutoDeployConfig into LlmArgs (#10613)
- Support model_kwargs for pytorch backend (#10351)
Feature
- Update disagg slurm scripts (#10712)
- Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
- Fix sharding dashboard errors (#10786)
- Async Transfer Manager (#9891)
- Speculative One Model: FlashInfer sampling (#10284)
- Refactor speculative decoding workers (#10768)
- Use global unique id as disagg request id (#10187)
- Enable guided decoding with reasoning parsers (#10890)
- Support partial update weight for fp8 (#10456)
- Multi-LoRA serving with CUDA Graph (#8279)
- Support logprobs for Completions API (#10809)
- Eagle3 Specdec UX improvements (#10124)
- Python transceiver components (step 2) (#10494)
- Upgrade NIXL to v0.9.0 (#10896)
- KV Connector Support for MTP (#10932)
- Support overlap scheduler for disagg ctx instances (#10755)
- Adding implementation of KVCacheManagerV2 (#10736)
- Switch to ConfigurableMoE as the default path (#10792)
Fix
- Enable system memory to transfer active message in NIXL ucx (#10602)
- Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A (#10539)
- Default disable gemm+allreduce fusion (#10656)
- Fix vulnerability urllib3 and nbconvert (#10551)
- Fix overlap scheduler race condition (#10610)
- Replace pickle.load with restricted Unpickler (#10622)
- Fix copy start_logs in disagg slurm scripts (#10840)
- Cherry-pick: Disable short profile for tunable ops with MERGE strategy (#10844, #10715)
- Lock resource to fix potential access to released data (#10827)
- Cherry-pick: Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841, #10654)
- Remove weight tensor holder to release memory earlier (#10876)
- Add missing dist strategy param and fix typo for ad_logger (#10892)
- Update RMSNorm custom op plumbing (#10843)
- Fix hmac launch (#10434)
- Avoid Double update for previous batch (#9888)
- Re-init TRTLLM sampler to use sample stream in multi-stream cases (#10918)
- Mtp with async scheduler (#10941)
- Fix buffer reuse (#10716)
- Cherry-pick: Fix hanging issue for MNNVL Allreduce under PP (#10750, #10633)
- Workaround for flashinfer.sampling.sampling_from_logits (#10713)
- Fix port 8000 being used issue in stress test (#10756)
Documentation
- Clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) (#10320)
- Add NIXL as a Python attribution (step 4) (#10910)
- 1.2 Release Notes Headers (#10722)
Test & Infra
- Upload regression info to artifactory (#10599)
- Add sonarqube scanning in lockfile generation pipeline (#10700)
- Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
- Remove trt flow tests in NIM (#10731)
- Update config.yaml of slurm scripts to align with submit.py change (#10802)
- Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
- Trigger multi-gpu tests when install_nixl/ucx.sh is modified (#10624)
- Add DGX-Spark VLM accuracy and perf spec dec cases (#10804)
- Fix test list llm_spark_func.txt (#10921)
- Add test configurable moe module multi gpu (#10699)
- NVFP4 MoE - Move weights transformation to fusion phase (#10803)
- Update flashinfer-python to 0.6.1 (#10872)
- Improve disagg acc tests (#10833)
- Refine placement group in ray executor (#10235)
- Regenerate out dated lock file (#10940)
- Remove long-running sanity check tests on GH200 (#10924, #10969)
- Add dgx-spark beta notes (#10766)
- Modify ctx config in 128k8k disagg cases (#10779)
- Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)

What's Changed

[#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 by @MrGeva in #10697
[None][infra] Add sonarqube scanning in lockfile generation pipeline by @yuanjingx87 in #10700
[https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test by @lucaslie in #10461
[#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size by @MrGeva in #10687
[#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] by @lucaslie in #10675
[https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx by @chuangz0 in #10602
[https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests by @lucaslie in #10769
[TRTLLM-10300][feat] Upload regression info to artifactory by @chenfeiz0326 in #10599
[None][chore] Add release/1.2 branch into lockfile generation schedule by @yiqingy0 in #10790
[TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab by @ZhanruiSunCh in #10616
[None][infra] Waive failed cases for main on 01/19 by @EmmaQiaoCh in #10794
[#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test by @MrGeva in #10603
[None][feat] Update disagg slurm scripts by @qiaoxj07 in #10712
[None][test] adjust the dis-agg test timeout threshold by @Shixiaowei02 in #10800
[None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) by @ssam18 in #10320
[None][chore] Remove trt flow tests in NIM by @jieli-matrix in #10731
[None][chore] update config.yaml of slurm scripts to align with submit.py change by @dc3671 in #10802
[https://nvbugs/5776445][chore] unwaive test by @reasonsolo in #10667
[TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by @lancelly in #10273
[TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. by @bobboli in #10539
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10776
[None][fix] default disable gemm+allreduce fusion by @benzh-2025 in #10656
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10787
[None][fix] Fix vulnerability urllib3 and nbconvert by @yiqingy0 in #10551
[None][test] Update sanity test list by @xinhe-nv in #10825
[None][fix] Remove unused params in attn by @yizhang-nv in #10652
[TRTLLM-10785][feat] Fix sharding dashboard errors by @greg-kwasniewski1 in #10786
[https://nvbugs/5701445][chore] unwaive test. by @yuxianq in #10806
[None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… by @bo-nv in #10624
[None][infra] Waive failed cases for main branch on 01/20 by @EmmaQiaoCh in #10829
[None][chore] Reduce tedious logs by @chzblych in #10847
[#10707][fix] AutoDeploy: Super accuracy test fixes by @galagam in #10717
[None][chore] Async Transfer Manager by @jthomson04 in #9891
[None][fix] fix duplicate entry in waives.txt by @lucaslie in #10853
[None][feat] Speculative One Model: FlashInfer sampling by @IzzyPutterman in #10284
[https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… by @SimengLiu-nv in #10610
[https://nvbugs/5760737][test] only skip mooncake+indexerkcache test by @zhengd-nv in #10266
[https://nvbugs/5759698][fix] unwaive test_base_worker by @Superjomn in #10669
[None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes by @djns99 in #9532
[https://nvbugs/5670458][chore] Unwaive reward model test by @shuyixiong in #10831
[None][chore] Revert #10847 by @chzblych in #10869
[https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler by @yibinl-nvidia in #10622
[None][fix] Fix copy start_logs in disagg slurm scripts by @qiaoxj07 in #10840
[None][fix] Cherry-pick #10715: Disable short profile for tunable ops with MERGE strategy by @hyukn in #10844
[https://nvbugs/5740377][fix] Lock resource to fix potential access to released data by @HuiGao-NV in #10827
[https://nvbugs/5814253][fix] unwaive test_autotuner_distributed_strategy tests by @hyukn in #10793
[None][chore] switch to ConfigurableMoE as the default path by @xxi-nv in #10792
[None][infra] Waive failed cases for main branch on 01/21 by @EmmaQiaoCh in #10882
[https://nvbugs/5636916][fix] Cherry-pick #10654: Fix accuracy issue of TWO-SHOT AllReduce kernel by @hyukn in #10841
[None][chore] unwaive qwen3 235B accuracy test by @kris1025 in #10493
[TRTLLM-10325][feat] Refactor speculative decoding workers by @cascade812 in #10768
[None][infra] Fix sonarQube job hang by create jenkins homd folder if not exist by @yuanjingx87 in #10830
[https://nvbugs/5816267][fix] Remove weight tensor holder to release memory earlier by @dongxuy04 in #10876
[https://nvbugs/5784543][chore] unwaive test. by @yuxianq in #10835
[None][feat] GLM-4.5-Air support by @videodanchik in #10653
[TRTLLM-10059][feat] Use global unique id as disagg request id by @reasonsolo in #10187
[None][chore] Add DGX-Spark VLM accuracy and perf spec dec cases by @JennyLiu-nv in #10804
[None][feat] K-EXAONE MTP support by @yechank-nvidia in #10796
[#8241][feat] Support model_kwargs for pytorch backend by @taylor-yb-lee in #10351
[TRTLLM-10154][feat] Enable guided decoding with reasoning parsers by @syuoni in #10890
[None][fix] Fix waived tests for Nemotron-h models by @Wanli-Jiang in #10758
[TRTLLM-9771][feat] Support partial update weight for fp8 by @shuyixiong in #10456
[None][feat] Add KV cache cleanup by @pengbowang-nv in #7439
[https://nvbugs/5811159][fix] Unwaive bug 5811159. by @bobboli in #10903
[#10838][fix] Add missing dist strategy param. fix typo for ad_logger… by @tcherckez-nvidia in #10892
[None][ci] Fix test list llm_spark_func.txt by @syuoni in #10921
[None][chore] Bump version to 1.3.0rc1 by @yiqingy0 in #10923
[None][chore] NVFP4 MoE - Move weights transformation to fusion phase… by @tcherckez-nvidia in #10803
[https://nvbugs/5741304][chore] Update flashinfer-python to 0.6.1 by @yihwang-nv in #10872
[https://nvbugs/5322131][feat] Multi-LoRA serving with CUDA Graph by @JyChang012 in #8279
[None][fix] Update RMSNorm custom op plumbing by @JintaoPengCS in #10843
[TRTLLM-10388][feat] Support logprobs for Completions API by @LinPoly in #10809
[https://nvbugs/5768068][chore] improve disagg acc tests by @bo-nv in #10833
[https://nvbugs/5783876][fix] fix hmac launch by @Superjomn in #10434
[TRTLLM-10590][feat] Eagle3 Specdec UX improvements by @venkywonka in #10124
[TRTLLM-9527][doc] Add NIXL as a Python attribution (step 2) by @Shixiaowei02 in #10910
[TRTLLM-9527][feat] Python transceiver components (step 2) by @Shixiaowei02 in #10494
[None][fix] Avoid Double update for previous batch by @yizhang-nv in #9888
[https://nvbugs/5819002][fix] fix sharding tests by @greg-kwasniewski1 in #10775
[#9306][refactor] Refactor AutoDeployConfig into LlmArgs by @2ez4bz in #10613
[https://nvbugs/5688721][fix] unwaive NemotronH accuracy test by @lucaslie in #10852
[None][infra] Update CI allowlist by @yuanjingx87 in #10936
[TRTLLM-9108][feat] Add test configurable moe module multi gpu by @leslie-fang25 in #10699
[None][test] Remove unused test list by @StanleySun639 in #10916
[None][feat] Upgrade NIXL to v0.9.0 by @zackyoray in #10896
[None][infra] Waive a failed case in pre-merge stage by @EmmaQiaoCh in #10948
[https://nvbugs/5833795][chore] Waive test test_e2e.py::test_ptp_quickstart_advanced[GPT-OSS-120B-gpt_oss/gpt-oss-120b] by @yihwang-nv in #10953
[None][chore] refine placement group in ray executor by @Superjomn in #10235
[https://nvbugs/5814215][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py::test_flashinfer_fused_moe_matches_torch_moe by @yihwang-nv in #10930
[None][infra] Regenerate out dated lock file by @yuanjingx87 in #10940
[https://nvbugs/5707359][fix] Unwaive the test that due to flashinfer… by @liji-nv in #10570
[None][feat] AutoDeploy: Enhance memory consumption for MoE fusion transform by @taylor-yb-lee in #10772
[None][feat] KV Connector Support for MTP by @jthomson04 in #10932
[TRTLLM-10334] [feat] Support overlap scheduler for disagg ctx instances by @kaiyux in #10755
[None][ci] Remove long-running sanity check tests on GH200 (#10924) by @chzblych in #10969
[None][infra] Fix TRT-LLM data scratch mount point for gb10x by @EmmaQiaoCh in #10880
[https://nvbugs/5829097][fix] Re-init TRTLLM sampler to use sample stream in multi-stream cases. by @yuxianq in #10918
[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2 by @lowsfer in #10736
[None][fix] Bugfix/mtp with async scheduler by @pcastonguay in #10941
[None][chroe] Mass integration of release/1.2 by @dominicshanshan in #10888
[TRTLLM-10147][perf] Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark by @syuoni in #10279
[None][test] Waive failed tests on main 1/25 by @chzblych in #10984

New Contributors

@ssam18 made their first contribution in #10320
@cascade812 made their first contribution in #10768
@videodanchik made their first contribution in #10653
@taylor-yb-lee made their first contribution in #10351
@JyChang012 made their first contribution in #8279

Full Changelog: v1.3.0rc0...v1.3.0rc1