NVIDIA/TensorRT-LLM v1.3.0rc0 on GitHub

Highlights

Model Support
- Added support for K-EXAONE models (#10355)
- Integrated MiniMax M2 model (#10532)
- Added Spark QA functional and performance test cases (#10564)
- Added support for new Transformers RoPE configuration format (#10636)
- Support customized sequence length larger than model config (#10600)
API Improvements
- Added support for image_embeds in OpenAI API (#9715)
- Covered LLM API multi_modal_embeddings (#9963)
- Implemented GET/DELETE v1/responses/{response_id} endpoints (#9937)
- Use RequestError for validation errors to prevent engine shutdown (#9761)
Performance Optimizations
- Added Hopper XQA decode support for skip softmax attention (#10264)
- Enabled attention data parallelism for Nemotron Super v3 (#10347)
- Added fp4 GEMM with AllReduce support (#9729)
- Use XQA JIT implementation by default with sliding window perf optimization (#10335)
- Reduced host overhead for unified nvfp4 GEMM tuning path (#10503)
- Implemented fused Triton kernel for e8m0 resmooth to reduce memory footprint (#10327)
MoE (Mixture of Experts) Enhancements
- Added ExpertStatistic and DUMMY_ALLREDUCE for configurable MoE (#10401)
- Added test configurable MoE module (#10575)
- Implemented padding empty chunk for configurable MoE (#10451)
- Enabled EPLB for DEEPGEMM (#10617)
- Extended MoE quantization test utilities with comprehensive quant algorithm support (#10691)
Disaggregation Features
- New request states and KV cache transceiver APIs in generation-first disaggregation (#10406)
- Fixed cancellation with chunked prefill and disaggregation (#10111)
Auto Deploy
- Refactored memory usage logging in AutoDeploy (#8505)
- Separated RMS pattern detection from fusion (#9969)
- Auto download speculative models from HuggingFace for PyTorch backend (#10099)
Fixes
- Fixed PP loop hang caused by i-sending new requests (#10665)
- Avoided write-write race for async PP send (#10488)
- Fixed hang issue when enabling skip softmax on Blackwell (#10490)
- Fixed hanging issue for MNNVL Allreduce under PP (#10633)
- Implemented PP skip forward for all spec workers (#10578)
- Added warning for gen-only paused state (#10664)
- Used uint64_t as dtype of lamport_buffer_size to avoid overflow (#10499)
- Fixed HelixCpMnnvlMemory initialization with PP (#10533)
- Fixed regression in KV cache resize memory estimation (#10726)
- Prevented out-of-bounds read (#9879)
- Solved pillow version conflict (#10537)
- Support to parse the keyword modules_to_not_convert of HF model config (#10527)
- Used correct model names for config database regression tests (#10192)
- Support GuidedDecoder with sharded logits (#10698)
- Fixed Piecewise CUDA Graph for GPTOSS (#10631)
- Fixed AutoDeploy EP sharding test (#10460)
- Fixed the nvfp4 fused_moe in AutoDeploy (#10727)
- Added quantization check for DeepEP LL low precision combine in new MoE comm API (#10072)
- Fixed AIPerf issue (#10666)
- Disabled TinyGEMM PDL due to accuracy issues (WAR) (#10619)
- Only keep a limited number of performance statistic data (#10569)
- Convert to CUDA tensor before calling _resmooth_kernel (#10770)
Test & Infra
- Added hang detection for executor loop and worker (#10480)
- Implemented bot to send performance regression messages to Slack channel (#10489)
- Made model initialization more general and support weights loading in layer-wise benchmarks (#10562)
- Updated trtllm-gen to support groupsTokensHeadsQ (#10261)
- Added support to export data in trtllm-eval (#10075)
- Added Torch extension API for FusedAddRMSNormQuant kernel (#9905)
- Enabled ray tests (#10272)
- Prevented flaky failures in C++ test_e2e.py by using local cached datasets (#10638)
- Enabled partial reuse in Gemma and GPT OSS test (#10559)

What's Changed

[TRTLLM-10195][feat] K-EXAONE support by @yechank-nvidia in #10355
[None][test] update core test list by @crazydemo in #10538
[#8391][chore] removed llama and added deepseek to AutoDeploy's L0 perf test by @MrGeva in #10585
[TRTLLM-10022][feat] Add hopper xqa decode support for skip softmax attention by @pengbowang-nv in #10264
[None][chore] update waive list by @jieli-matrix in #10577
[None][feat] Add ExpertStatistic and DUMMY_ALLREDUCE for configurable_moe by @qiaoxj07 in #10401
[TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel by @chenfeiz0326 in #10489
[None][chore] update deepseekv3.2 test parameter by @yingguo-trt in #10595
[None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10572
[https://nvbugs/5794796][chore] waive test blocking premerge by @dc3671 in #10593
[None][fix] Solve pillow version conflict by @Wanli-Jiang in #10537
[TRTLLM-9522][test] cover LLM API multi_modal_embeddings by @ixlmar in #9963
[None][infra] Waive failed tests for main 01/12 by @EmmaQiaoCh in #10604
[#10580][fix] re-enable NemotronH MOE MMLU test by @suyoggupta in #10594
[https://nvbugs/5761391][fix] Use correct model names for config database regression tests by @anish-shanbhag in #10192
[None][chore] Print correct backend name in benchmark report by @galagam in #10597
[https://nvbugs/5689235][fix] Fix cancellation+chunked prefill+disagg by @Tabrizian in #10111
[https://nvbugs/5762336][fix] support to parse the keyword modules_to_not_convert of the HF model config" by @xxi-nv in #10527
[None][chore] Fix disagg assert by @fredricz-20070104 in #10596
[TRTLLM-10271][test] Add Spark QA functional and performance cases by @JennyLiu-nv in #10564
[None][infra] try removing shared cache dir mount by @tburt-nv in #10609
[None][infra] Update allowlist 2026.01.08 by @niukuo in #10535
[None][feat] Hang detection for executor loop and worker. by @yuxianq in #10480
[TRTLLM-8462][feat] Support GET/DELETE v1/responses/{response_id} by @JunyiXu-nv in #9937
[TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. by @nv-guomingz in #10347
[https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow by @yilin-void in #10499
[NVBUG-5670458][chore] Unwaive lp tests by @hchings in #10524
[TRTLLM-8425][doc] document Torch Sampler details by @ixlmar in #10606
[None][feat] Layer-wise benchmarks: make model init more general and support weights loading by @yuantailing in #10562
[None][test] Unwaive qwen3 next test case. by @nv-guomingz in #9877
[None][feat] add fp4 gemm + allreduce by @benzh-2025 in #9729
[None][infra] support overriding nspect version by @niukuo in #10402
[https://nvbugs/5772396][fix] WAR: Disable TinyGEMM PDL due to accuracy issues by @dongfengy in #10619
[None][feat] AutoDeploy: refactor memory usage logging by @nzmora-nvidia in #8505
[#9283][feat] AutoDeploy: separate rms pattern detection from fusion by @Fridah-nv in #9969
[https://nvbugs/5791900][fix] Fix HelixCpMnnvlMemory init with PP by @brb-nv in #10533
[None][chore] Add test configurable moe module by @leslie-fang25 in #10575
[https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. by @yuxianq in #10578
[None][fix] Avoid write-write race for async pp send. by @yuxianq in #10488
[https://nvbugs/5753788][chore] Padding empty chunk for configurable moe by @leslie-fang25 in #10451
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10589
[None][chore] update allowlist 2026-01-13 by @tburt-nv in #10645
[None][test] add test into qa test list by @xinhe-nv in #10627
[None][test] Spark - Change testlist name and perf yml format by @JennyLiu-nv in #10626
[None][chore] waive the CI failure by @xxi-nv in #10655
[None][refactor] Unify the usage of MPIDist and TorchDist. by @yuxianq in #10380
[None][fix] Reduce host overhead for unified nvfp4 gemm tuning path. by @hyukn in #10503
[https://nvbugs/5637220][ci] unwaive TestQwen3_235B_A22B::test_nvfp4[latency_moe_trtllm_attention_dp] by @QiJune in #9870
[None][test] add log_samples and output_path for trtllm_eval by @dc3671 in #10629
[https://nvbugs/5664904][fix] Update test name MNNVL->NVLinkTwoSided and unwaive tests. by @bobboli in #9672
[TRTLLM-9522][feat] support image_embeds in OpenAI API by @ixlmar in #9715
[None][feat] MiniMax M2 support by @jmydurant in #10532
[None][fix] fix L0 issues by @xinhe-nv in #10670
[None][chore] enable EPLB for DEEPGEMM by @xxi-nv in #10617
[None][chore] improve the readability of log for cutlass can only sup… by @xxi-nv in #10630
[None][feat] Support new Transformers RoPE configuration format by @lkm2835 in #10636
[https://nvbugs/5760740][fix] Enable ray tests by @shuyixiong in #10272
[https://nvbugs/5774869][infra] Use 2 GPUs to test skip softmax attention on H100. by @bobboli in #10420
[https://nvbugs/5787566][fix] Only keep a limited number of performance statistic data by @HuiGao-NV in #10569
[None][infra] Waive failed cases in post-merge on 01/14 by @EmmaQiaoCh in #10668
[TRTLLM-9849][infra] Update dependencies to 25.12 by @EmmaQiaoCh in #9818
[#9760][fix] Use RequestError for validation errors to prevent engine shutdown by @tzulingk in #9761
[None][feat] Adding torch ext API for FusedAddRMSNormQuant kernel by @JintaoPengCS in #9905
[https://nvbugs/5800725][infra] Update waives.txt by @byshiue in #10625
[https://nvbugs/5766952][fix] Fix AIPerf issue. by @dominicshanshan in #10666
[TRTLLM-10245][feat] Add accuracy tests for super v3 fp8 model by @Wanli-Jiang in #10482
[https://nvbugs/5630196] [fix] Prevent flaky failures in C++ test_e2e.py by using local cached datasets for benchmarking by @DomBrown in #10638
[https://nvbugs/5777041][fix] fix AutoDeploy ep sharding test by @lucaslie in #10460
[None][fix] add quantization check for DeepEP LL low precision combine in new moe comm api by @yilin-void in #10072
[None][infra] separate AutoDeploy tests into own stages by @lucaslie in #10634
[None][feat] Auto download speculative models from HF for pytorch backend, add speculative_model field alias by @anish-shanbhag in #10099
[None][infra] Waive failed tests on main 01/15 by @EmmaQiaoCh in #10683
[None][test] store per user output and per gpu output metric in csv file by @ruodil in #10658
[None][chore] Bump version to 1.3.0rc0 by @yiqingy0 in #10681
[https://nvbugs/5741392][fix] [chore] Remove test exemptions from waivers tile by @nv-lschneider in #10517
[None][feat] update trtllm-gen to support groupsTokensHeadsQ by @PerkzZheng in #10261
[None][feat] Use XQA JIT impl by default and mitigate perf loss with sliding window by @pengbowang-nv in #10335
[None][test] Remove NIM perf test by @yufeiwu-nv in #10657
[https://nvbugs/5791830][fix] fix pp loop hang caused by i-sending new requests by @reasonsolo in #10665
[None][doc] doc updates by @juney-nvidia in #10704
[TRTLLM-9942][feat] new request states and kvcache transceiver APIs in generation-first disagg by @reasonsolo in #10406
[None][doc] doc updates by @forrestl111 in #10711
[None][feat] Support to export data in trtllm-eval by @heyuhhh in #10075
[https://nvbugs/5721661][fix] Prevent out-of-bounds read by @thorjohnsen in #9879
[https://nvbugs/5738168][fix] unwaive test accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype[False] by @Tabrizian in #10584
[None][bug] AutoDeploy: fix regression in kv cache resize memory estimation by @lucaslie in #10726
[https://nvbugs/5701445][chore] isolate test. by @yuxianq in #10444
[None][chore] Waive star attention unittests by @heyuhhh in #10439
[https://nvbugs/5598674][fix] enable partial reuse in gemma and gpt oss test by @chuangz0 in #10559
[TRTLLM-9111][feat] MoE test refactor: Extend MoE quantization test utilities with comprehensive quant algorithm support by @xxi-nv in #10691
[https://nvbugs/5800521][ci] Move test_openai_chat_guided_decoding to H100 stage (to avoid potential OOM) by @syuoni in #10703
[https://nvbugs/5669671][fix] Support GuidedDecoder with sharded logits by @syuoni in #10698
[https://nvbugs/5791936][fix] Add warning for gen-only paused by @chuangz0 in #10664
[https://nvbugs/5810940][chore] Update waive lists for nvbugs/5810940. by @bobboli in #10737
[None][infra] Waive failed cases for main branch on 01/16 by @EmmaQiaoCh in #10738
[https://nvbugs/5782112][fix] Fix hanging issue for MNNVL Allreduce under PP by @hyukn in #10633
[None][fix] impl fused triton kernel for e8m0 resmooth to reduce memory footprint by @Nekofish-L in #10327
[None][doc] update doc (add minimax model) by @jmydurant in #10746
[None][chore] Remove closed bugs by @xinhe-nv in #10586
[None][fix] Fix Piecewise Cuda Graph for GPTOSS by @dongfengy in #10631
[None] [feat] Support multiple accuracy tasks for slurm scripts by @kaiyux in #10500
[TRTLLM-10305][feat] Support customized seq len larger than model config by @Wanli-Jiang in #10600
[TRTLLM-8425][doc] Update sampling documentation by @stnie in #10083
[None][fix] waive tests on sm89 by @xinhe-nv in #10753
[https://nvbugs/5783509][fix] Fix a hang issue when enabling skip softmax on Blackwell by @Tom-Zheng in #10490
[TRTLLM-9735][feat] Add processed logprobs functionality to TorchSampler by @stnie in #9675
[None][fix] AutoDeploy: Fix the nvfp4 fused_moe by @nvchenghaoz in #10727
[None][chore] update flashinfer to 0.6.0 by @nvchenghaoz in #10522
[None][fix] AutoDeploy: skip mxfp4_moe test unless on Hopper by @Fridah-nv in #10729
[TRTLLM-8263][feat] Add Aggregated Perf Tests by @chenfeiz0326 in #10598
[None][fix] Fix tmp dir being deleted too early in unit test. by @hyukn in #10740
[https://nvbugs/5794313][chore] unwaive tests. by @yuxianq in #10660
[None][fix] convert to CUDA tensor before calling _resmooth_kernel. by @yuxianq in #10770
[None][feat] AutoDeploy: improved sharding utilities by @greg-kwasniewski1 in #10319
[None][infra] Update upgrade related docs for release 1.2 (#10760) by @chzblych in #10773
[None][test] Waive main post-merge test failures 1/18 by @chzblych in #10777

New Contributors

@tzulingk made their first contribution in #9761
@JintaoPengCS made their first contribution in #9905

Full Changelog: v1.2.0rc8...v1.3.0rc0