NVIDIA/TensorRT-LLM v1.3.0rc2 on GitHub

Highlights:

Model Support
- Enable MTP for Nemotron Super (#10754)
- Make TRTLLM MoE the default for GPTOSS on Blackwell (#11074)
- Add missing absolute position embeddings in Qwen3-VL vision encoder (#11065)
API
- Change context params and disagg params (#10495)
- Add KVCacheManagerV2 APIs for Transceiver (#11003)
Feature
- Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
- Fuse AllGather for expert statistics required by EPLB (#10885)
- Add first-iteration streaming for GPT-OSS in trtllm-serve (#10808)
- Integrate CuteDSL argmax kernel (#10476)
- Update Mamba decode kernel to FlashInfer (#10757)
- Improve effective memory bandwidth with TMA.RED (#10987)
- Reorganize AutoTuner cache file for distributed tuning (#10956)
- Support attention DP + Helix CP (#10477)
- Improve performance of _write_finish_reasons in TorchSampler (#10459)
- Add gRPC server for high-performance external router integration (#11037)
- Prepare for future KVCacheV2 MTP support (#11029)
Fix
- Fix CuteDSL MoE unit test (#10983)
- Fix overlap scheduler pause() timing (#10943)
- Fix Pydantic deepcopy bug (#11004)
- Restore IPv6 support in serve.py (#10929)
- Fix conditional compilation for sm10x cubins (#10839)
- Add graceful fallbacks for NCCL symmetric mode (#11042)
- Fix enable_alltoall passed to CutlassFusedMoE (#11016)
- Fix kvCacheManager isLeaf() assertion failure (#10922)
- Add null pointer check to parseNpyHeader (#10944)
- Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
Documentation
- Update Qwen2/3-VL models in supported_models.md (#10797)
Benchmark
- Add performance alignment to layer-wise benchmarks (#11018)
- Clean up layer-wise benchmarks code (#11092)
- Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases (#11096)
Test & Infra
- Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
- Add timeout for SeedOSS test (#8683)
- Add Fake Ops for one-sided AlltoAll (#11002)
- Refactor setup for RNN cache transceiver (#10957)
- Change SLURM config access to use resolvePlatform (#11006)
- Update CI allowList (#11040)
- Add Mamba and MLA layers to sharding tests (#10364)
- Remove pybind11 bindings and references (#10550, #11026)
- Add multi-acc and Lyris GB200 test support (#11024)
- Package triton-kernels as a dependency (#10471)
- Fix Qwen3 Eagle test (#11030)
- Dump thread stacks for hanging tests before timeout (#10708)
- Remove -ccache from build_wheel.py args (#11064)
- Fix trtllm-serve guided decoding test (#11101)
- Remove invalid account for Blossom CI (#11126)
- Add source code pulse scan to PLC nightly pipeline (#10961)

What's Changed

[None][fix] Fix CuteDSL MoE unittest by @syuoni in #10983
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10974
[https://nvbugs/5661741][feat] Add 250K-token NVFP4 MoE + PDL regression tests by @yingguo-trt in #10911
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10976
[None][infra] Waive failed case for main branch on 01/26 by @EmmaQiaoCh in #10994
[None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV by @Tom-Zheng in #10813
[TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. by @bobboli in #10885
[https://nvbugs/5794796][fix] Cherry-pick #10855: Unwaive Llama 3.3 related multi GPU tests by @pengbowang-nv in #10942
[#10614][fix] gpt_oss first iteration streaming in trtllm-serve by @LinPoly in #10808
[None][chore] Removing pybind11 bindings and references by @Linda-Stadter in #10550
[#8982][feat] AutoDeploy attention dp support by @lucaslie in #10728
[None][chore] update AD model list by @tcherckez-nvidia in #10981
[TRTLLM-10062][feat] Enable MTP for Nemotron Super by @sunnyqgg in #10754
[TRTLLM-10276][feat] Integrate cutedsl argmax kernel by @ameynaik-hub in #10476
[TRTLLM-10453][feat] Update mamba decode kernel to flashinfer by @Wanli-Jiang in #10757
[TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler by @yuantailing in #10943
[https://nvbugs/5612438][fix] Add timeout for SeedOSS test by @zhhuang-nv in #8683
[None][infra] Waive failed cases for main on 01/27 by @EmmaQiaoCh in #11017
[None][chore] Bump version to 1.3.0rc2 by @yiqingy0 in #11021
[None][chore] Remove closed bugs by @xinhe-nv in #10982
[#10889][fix] fix pydantic deepcopy bug by @reasonsolo in #11004
[TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. by @bobboli in #11002
[TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth by @sherry-1001 in #10987
[TRTLLM-9527][feat] change context params and disagg params (step3) by @chuangz0 in #10495
[TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning by @hyukn in #10956
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10993
[https://nvbugs/5843316][chore] waive overlap_scheduler test by @galagam in #11025
[#10013][feat] AutoDeploy: native cache manager integration by @lucaslie in #10635
[https://nvbugs/5721661][chore] Unwaive fixed bug. by @SimengLiu-nv in #11009
[#10877][fix] restore ipv6 support in serve.py by @Evgueni-Petrov-aka-espetrov in #10929
[TRTLLM-10197][chore] Refactor to setup for RNN cache transceiver by @NVShreyas in #10957
[TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform by @mlefeb01 in #11006
[None][fix] Proper conditional compilation of sm10x cubins by @tongyuantongyu in #10839
[https://nvbugs/5756804][fix] Re-enable passing test by @dongfengy in #10986
[None][fix] unwaive tests by @xinhe-nv in #11047
[https://nvbugs/5779536][fix] Cherry-pick #10902: Unwaive DeepSeekR1 nvfp4 pp4 mtp test case (#10902) by @pengbowang-nv in #11000
[None][infra] Update CI allowList by @yuanjingx87 in #11040
[TRTLLM-10362][feat] Added Mamba and MLA layers to the sharding tests by @greg-kwasniewski1 in #10364
[None][chore] Removing cpp/tensorrt_llm/pybind by @Linda-Stadter in #11026
[None][feat] support multi_acc and Lyris GB200 test by @yingguo-trt in #11024
[None][infra] Waive failed cases for main on 1/28 by @EmmaQiaoCh in #11053
[None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 checkpoint by @govind-ramnarayan in #10674
[#10245][feat] AutoDeploy: Add Minimax M2 support by @bmarimuthu-nv in #10525
[None][fix] nccl symmetric with graceful fallbacks by @nv-lschneider in #11042
[None][fix] fix Qwen2/3 export for AutoDeploy by @Fridah-nv in #11007
[None][fix] No need to remove the original waive list by @yiqingy0 in #11060
[https://nvbugs/5761391][fix] Include triton-kernels as a packaged dependency by @anish-shanbhag in #10471
[None][fix] Fix enable_alltoall passed to CutlassFusedMoE by @syuoni in #11016
[None][feat] Add performance alignment to layer-wise benchmarks by @yuantailing in #11018
[https://nvbugs/5813452][fix] Fix "Assertion failed: isLeaf() in kvCacheManager.cpp:465" by @Boreas618 in #10922
[None][infra] Waived flaky tests by @ZhanruiSunCh in #11091
[TRTLLM-10264][feat] Support attention DP + Helix CP by @brb-nv in #10477
[TRTLLM-10415][feat] Dump thread stacks for hanging tests before time… by @WeiHaocheng in #10708
[TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler by @stnie in #10459
[None][chore] Consolidate duplicate kv cache reuse variables. by @hnover-nv in #10935
[None][chore] Clean up layer-wise benchmarks code by @yuantailing in #11092
[None][fix] AutoDeploy: remove mem check for a log unit test by @lucaslie in #11120
[https://nvbugs/5836592][fix] Fix qwen3 eagle test by @mikeiovine in #11030
[None][feat] AutoDeploy: Flashinfer kernels bringup by @nvchenghaoz in #10867
[None][feat] Add gRPC server for high-performance external router integration by @CatherineSue in #11037
[None][infra] Remove invalid account for blossom CI by @yuanjingx87 in #11126
[None][fix] Add missing absolute pe in Qwen3-VL Vision Encoder by @Nekofish-L in #11065
[https://nvbugs/5775544][fix] Unwaive test by @eopXD in #11023
[None][test] Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases by @JennyLiu-nv in #11096
[TRTLLM-9904][feat] Changes for future KVCacheV2 MTP support by @liji-nv in #11029
[TRTLLM-10733][feat] Make TRTLLM MOE the default one for GPTOSS on Blackwell by @dongfengy in #11074
[https://nvbugs/5825514][fix] Add null pointer check to parseNpyHeader by @yibinl-nvidia in #10944
[None][chore] Correct sorting order for attention DP scheduling to prioritize non-relaxed requests by @lancelly in #11106
[None][fix] Remove -ccache from build_wheel.py args by @yuantailing in #11064
[https://nvbugs/5837281][fix] Fix trtllm-serve guided decoding test by @syuoni in #11101
[None][feat] New KVCacheManagerV2 APIs for Transceiver by @lowsfer in #11003
[None][doc] Update Qwen2/3-VL's model on supported_models.md by @yechank-nvidia in #10797
[https://nvbugs/5853997][chore] Waive test by @dominicshanshan in #11132
[None][infra] Add source code pulse scan to PLC nightly pipeline by @yuanjingx87 in #10961

New Contributors

@Evgueni-Petrov-aka-espetrov made their first contribution in #10929
@bmarimuthu-nv made their first contribution in #10525
@hnover-nv made their first contribution in #10935
@CatherineSue made their first contribution in #11037

Full Changelog: v1.3.0rc1...v1.3.0rc2