github NVIDIA/TensorRT-LLM v1.3.0rc0

latest release: v1.2.0rc6.post2
pre-release15 hours ago

Highlights

  • Model Support

    • Added support for K-EXAONE models (#10355)
    • Integrated MiniMax M2 model (#10532)
    • Added Spark QA functional and performance test cases (#10564)
    • Added support for new Transformers RoPE configuration format (#10636)
    • Support customized sequence length larger than model config (#10600)
  • API Improvements

    • Added support for image_embeds in OpenAI API (#9715)
    • Covered LLM API multi_modal_embeddings (#9963)
    • Implemented GET/DELETE v1/responses/{response_id} endpoints (#9937)
    • Use RequestError for validation errors to prevent engine shutdown (#9761)
  • Performance Optimizations

    • Added Hopper XQA decode support for skip softmax attention (#10264)
    • Enabled attention data parallelism for Nemotron Super v3 (#10347)
    • Added fp4 GEMM with AllReduce support (#9729)
    • Use XQA JIT implementation by default with sliding window perf optimization (#10335)
    • Reduced host overhead for unified nvfp4 GEMM tuning path (#10503)
    • Implemented fused Triton kernel for e8m0 resmooth to reduce memory footprint (#10327)
  • MoE (Mixture of Experts) Enhancements

    • Added ExpertStatistic and DUMMY_ALLREDUCE for configurable MoE (#10401)
    • Added test configurable MoE module (#10575)
    • Implemented padding empty chunk for configurable MoE (#10451)
    • Enabled EPLB for DEEPGEMM (#10617)
    • Extended MoE quantization test utilities with comprehensive quant algorithm support (#10691)
  • Disaggregation Features

    • New request states and KV cache transceiver APIs in generation-first disaggregation (#10406)
    • Fixed cancellation with chunked prefill and disaggregation (#10111)
  • Auto Deploy

    • Refactored memory usage logging in AutoDeploy (#8505)
    • Separated RMS pattern detection from fusion (#9969)
    • Auto download speculative models from HuggingFace for PyTorch backend (#10099)
  • Fixes

    • Fixed PP loop hang caused by i-sending new requests (#10665)
    • Avoided write-write race for async PP send (#10488)
    • Fixed hang issue when enabling skip softmax on Blackwell (#10490)
    • Fixed hanging issue for MNNVL Allreduce under PP (#10633)
    • Implemented PP skip forward for all spec workers (#10578)
    • Added warning for gen-only paused state (#10664)
    • Used uint64_t as dtype of lamport_buffer_size to avoid overflow (#10499)
    • Fixed HelixCpMnnvlMemory initialization with PP (#10533)
    • Fixed regression in KV cache resize memory estimation (#10726)
    • Prevented out-of-bounds read (#9879)
    • Solved pillow version conflict (#10537)
    • Support to parse the keyword modules_to_not_convert of HF model config (#10527)
    • Used correct model names for config database regression tests (#10192)
    • Support GuidedDecoder with sharded logits (#10698)
    • Fixed Piecewise CUDA Graph for GPTOSS (#10631)
    • Fixed AutoDeploy EP sharding test (#10460)
    • Fixed the nvfp4 fused_moe in AutoDeploy (#10727)
    • Added quantization check for DeepEP LL low precision combine in new MoE comm API (#10072)
    • Fixed AIPerf issue (#10666)
    • Disabled TinyGEMM PDL due to accuracy issues (WAR) (#10619)
    • Only keep a limited number of performance statistic data (#10569)
    • Convert to CUDA tensor before calling _resmooth_kernel (#10770)
  • Test & Infra

    • Added hang detection for executor loop and worker (#10480)
    • Implemented bot to send performance regression messages to Slack channel (#10489)
    • Made model initialization more general and support weights loading in layer-wise benchmarks (#10562)
    • Updated trtllm-gen to support groupsTokensHeadsQ (#10261)
    • Added support to export data in trtllm-eval (#10075)
    • Added Torch extension API for FusedAddRMSNormQuant kernel (#9905)
    • Enabled ray tests (#10272)
    • Prevented flaky failures in C++ test_e2e.py by using local cached datasets (#10638)
    • Enabled partial reuse in Gemma and GPT OSS test (#10559)

What's Changed

  • [TRTLLM-10195][feat] K-EXAONE support by @yechank-nvidia in #10355
  • [None][test] update core test list by @crazydemo in #10538
  • [#8391][chore] removed llama and added deepseek to AutoDeploy's L0 perf test by @MrGeva in #10585
  • [TRTLLM-10022][feat] Add hopper xqa decode support for skip softmax attention by @pengbowang-nv in #10264
  • [None][chore] update waive list by @jieli-matrix in #10577
  • [None][feat] Add ExpertStatistic and DUMMY_ALLREDUCE for configurable_moe by @qiaoxj07 in #10401
  • [TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel by @chenfeiz0326 in #10489
  • [None][chore] update deepseekv3.2 test parameter by @yingguo-trt in #10595
  • [None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10572
  • [https://nvbugs/5794796][chore] waive test blocking premerge by @dc3671 in #10593
  • [None][fix] Solve pillow version conflict by @Wanli-Jiang in #10537
  • [TRTLLM-9522][test] cover LLM API multi_modal_embeddings by @ixlmar in #9963
  • [None][infra] Waive failed tests for main 01/12 by @EmmaQiaoCh in #10604
  • [#10580][fix] re-enable NemotronH MOE MMLU test by @suyoggupta in #10594
  • [https://nvbugs/5761391][fix] Use correct model names for config database regression tests by @anish-shanbhag in #10192
  • [None][chore] Print correct backend name in benchmark report by @galagam in #10597
  • [https://nvbugs/5689235][fix] Fix cancellation+chunked prefill+disagg by @Tabrizian in #10111
  • [https://nvbugs/5762336][fix] support to parse the keyword modules_to_not_convert of the HF model config" by @xxi-nv in #10527
  • [None][chore] Fix disagg assert by @fredricz-20070104 in #10596
  • [TRTLLM-10271][test] Add Spark QA functional and performance cases by @JennyLiu-nv in #10564
  • [None][infra] try removing shared cache dir mount by @tburt-nv in #10609
  • [None][infra] Update allowlist 2026.01.08 by @niukuo in #10535
  • [None][feat] Hang detection for executor loop and worker. by @yuxianq in #10480
  • [TRTLLM-8462][feat] Support GET/DELETE v1/responses/{response_id} by @JunyiXu-nv in #9937
  • [TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. by @nv-guomingz in #10347
  • [https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow by @yilin-void in #10499
  • [NVBUG-5670458][chore] Unwaive lp tests by @hchings in #10524
  • [TRTLLM-8425][doc] document Torch Sampler details by @ixlmar in #10606
  • [None][feat] Layer-wise benchmarks: make model init more general and support weights loading by @yuantailing in #10562
  • [None][test] Unwaive qwen3 next test case. by @nv-guomingz in #9877
  • [None][feat] add fp4 gemm + allreduce by @benzh-2025 in #9729
  • [None][infra] support overriding nspect version by @niukuo in #10402
  • [https://nvbugs/5772396][fix] WAR: Disable TinyGEMM PDL due to accuracy issues by @dongfengy in #10619
  • [None][feat] AutoDeploy: refactor memory usage logging by @nzmora-nvidia in #8505
  • [#9283][feat] AutoDeploy: separate rms pattern detection from fusion by @Fridah-nv in #9969
  • [https://nvbugs/5791900][fix] Fix HelixCpMnnvlMemory init with PP by @brb-nv in #10533
  • [None][chore] Add test configurable moe module by @leslie-fang25 in #10575
  • [https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. by @yuxianq in #10578
  • [None][fix] Avoid write-write race for async pp send. by @yuxianq in #10488
  • [https://nvbugs/5753788][chore] Padding empty chunk for configurable moe by @leslie-fang25 in #10451
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10589
  • [None][chore] update allowlist 2026-01-13 by @tburt-nv in #10645
  • [None][test] add test into qa test list by @xinhe-nv in #10627
  • [None][test] Spark - Change testlist name and perf yml format by @JennyLiu-nv in #10626
  • [None][chore] waive the CI failure by @xxi-nv in #10655
  • [None][refactor] Unify the usage of MPIDist and TorchDist. by @yuxianq in #10380
  • [None][fix] Reduce host overhead for unified nvfp4 gemm tuning path. by @hyukn in #10503
  • [https://nvbugs/5637220][ci] unwaive TestQwen3_235B_A22B::test_nvfp4[latency_moe_trtllm_attention_dp] by @QiJune in #9870
  • [None][test] add log_samples and output_path for trtllm_eval by @dc3671 in #10629
  • [https://nvbugs/5664904][fix] Update test name MNNVL->NVLinkTwoSided and unwaive tests. by @bobboli in #9672
  • [TRTLLM-9522][feat] support image_embeds in OpenAI API by @ixlmar in #9715
  • [None][feat] MiniMax M2 support by @jmydurant in #10532
  • [None][fix] fix L0 issues by @xinhe-nv in #10670
  • [None][chore] enable EPLB for DEEPGEMM by @xxi-nv in #10617
  • [None][chore] improve the readability of log for cutlass can only sup… by @xxi-nv in #10630
  • [None][feat] Support new Transformers RoPE configuration format by @lkm2835 in #10636
  • [https://nvbugs/5760740][fix] Enable ray tests by @shuyixiong in #10272
  • [https://nvbugs/5774869][infra] Use 2 GPUs to test skip softmax attention on H100. by @bobboli in #10420
  • [https://nvbugs/5787566][fix] Only keep a limited number of performance statistic data by @HuiGao-NV in #10569
  • [None][infra] Waive failed cases in post-merge on 01/14 by @EmmaQiaoCh in #10668
  • [TRTLLM-9849][infra] Update dependencies to 25.12 by @EmmaQiaoCh in #9818
  • [#9760][fix] Use RequestError for validation errors to prevent engine shutdown by @tzulingk in #9761
  • [None][feat] Adding torch ext API for FusedAddRMSNormQuant kernel by @JintaoPengCS in #9905
  • [https://nvbugs/5800725][infra] Update waives.txt by @byshiue in #10625
  • [https://nvbugs/5766952][fix] Fix AIPerf issue. by @dominicshanshan in #10666
  • [TRTLLM-10245][feat] Add accuracy tests for super v3 fp8 model by @Wanli-Jiang in #10482
  • [https://nvbugs/5630196] [fix] Prevent flaky failures in C++ test_e2e.py by using local cached datasets for benchmarking by @DomBrown in #10638
  • [https://nvbugs/5777041][fix] fix AutoDeploy ep sharding test by @lucaslie in #10460
  • [None][fix] add quantization check for DeepEP LL low precision combine in new moe comm api by @yilin-void in #10072
  • [None][infra] separate AutoDeploy tests into own stages by @lucaslie in #10634
  • [None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias by @anish-shanbhag in #10099
  • [None][infra] Waive failed tests on main 01/15 by @EmmaQiaoCh in #10683
  • [None][test] store per user output and per gpu output metric in csv file by @ruodil in #10658
  • [None][chore] Bump version to 1.3.0rc0 by @yiqingy0 in #10681
  • [https://nvbugs/5741392][fix] [chore] Remove test exemptions from waivers tile by @nv-lschneider in #10517
  • [None][feat] update trtllm-gen to support groupsTokensHeadsQ by @PerkzZheng in #10261
  • [None][feat] Use XQA JIT impl by default and mitigate perf loss with sliding window by @pengbowang-nv in #10335
  • [None][test] Remove NIM perf test by @yufeiwu-nv in #10657
  • [https://nvbugs/5791830][fix] fix pp loop hang caused by i-sending new requests by @reasonsolo in #10665
  • [None][doc] doc updates by @juney-nvidia in #10704
  • [TRTLLM-9942][feat] new request states and kvcache transceiver APIs in generation-first disagg by @reasonsolo in #10406
  • [None][doc] doc updates by @forrestl111 in #10711
  • [None][feat] Support to export data in trtllm-eval by @heyuhhh in #10075
  • [https://nvbugs/5721661][fix] Prevent out-of-bounds read by @thorjohnsen in #9879
  • [https://nvbugs/5738168][fix] unwaive test accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype[False] by @Tabrizian in #10584
  • [None][bug] AutoDeploy: fix regression in kv cache resize memory estimation by @lucaslie in #10726
  • [https://nvbugs/5701445][chore] isolate test. by @yuxianq in #10444
  • [None][chore] Waive star attention unittests by @heyuhhh in #10439
  • [https://nvbugs/5598674][fix] enable partial reuse in gemma and gpt oss test by @chuangz0 in #10559
  • [TRTLLM-9111][feat] MoE test refactor: Extend MoE quantization test utilities with comprehensive quant algorithm support by @xxi-nv in #10691
  • [https://nvbugs/5800521][ci] Move test_openai_chat_guided_decoding to H100 stage (to avoid potential OOM) by @syuoni in #10703
  • [https://nvbugs/5669671][fix] Support GuidedDecoder with sharded logits by @syuoni in #10698
  • [https://nvbugs/5791936][fix] Add warning for gen-only paused by @chuangz0 in #10664
  • [https://nvbugs/5810940][chore] Update waive lists for nvbugs/5810940. by @bobboli in #10737
  • [None][infra] Waive failed cases for main branch on 01/16 by @EmmaQiaoCh in #10738
  • [https://nvbugs/5782112][fix] Fix hanging issue for MNNVL Allreduce under PP by @hyukn in #10633
  • [None][fix] impl fused triton kernel for e8m0 resmooth to reduce memory footprint by @Nekofish-L in #10327
  • [None][doc] update doc (add minimax model) by @jmydurant in #10746
  • [None][chore] Remove closed bugs by @xinhe-nv in #10586
  • [None][fix] Fix Piecewise Cuda Graph for GPTOSS by @dongfengy in #10631
  • [None] [feat] Support multiple accuracy tasks for slurm scripts by @kaiyux in #10500
  • [TRTLLM-10305][feat] Support customized seq len larger than model config by @Wanli-Jiang in #10600
  • [TRTLLM-8425][doc] Update sampling documentation by @stnie in #10083
  • [None][fix] waive tests on sm89 by @xinhe-nv in #10753
  • [https://nvbugs/5783509][fix] Fix a hang issue when enabling skip softmax on Blackwell by @Tom-Zheng in #10490
  • [TRTLLM-9735][feat] Add processed logprobs functionality to TorchSampler by @stnie in #9675
  • [None][fix] AutoDeploy: Fix the nvfp4 fused_moe by @nvchenghaoz in #10727
  • [None][chore] update flashinfer to 0.6.0 by @nvchenghaoz in #10522
  • [None][fix] AutoDeploy: skip mxfp4_moe test unless on Hopper by @Fridah-nv in #10729
  • [TRTLLM-8263][feat] Add Aggregated Perf Tests by @chenfeiz0326 in #10598
  • [None][fix] Fix tmp dir being deleted too early in unit test. by @hyukn in #10740
  • [https://nvbugs/5794313][chore] unwaive tests. by @yuxianq in #10660
  • [None][fix] convert to CUDA tensor before calling _resmooth_kernel. by @yuxianq in #10770
  • [None][feat] AutoDeploy: improved sharding utilities by @greg-kwasniewski1 in #10319
  • [None][infra] Update upgrade related docs for release 1.2 (#10760) by @chzblych in #10773
  • [None][test] Waive main post-merge test failures 1/18 by @chzblych in #10777

New Contributors

Full Changelog: v1.2.0rc8...v1.3.0rc0

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.