Highlights:
-
Model Support
-
API
-
Feature
- Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
- Fuse AllGather for expert statistics required by EPLB (#10885)
- Add first-iteration streaming for GPT-OSS in
trtllm-serve(#10808) - Integrate CuteDSL argmax kernel (#10476)
- Update Mamba decode kernel to FlashInfer (#10757)
- Improve effective memory bandwidth with TMA.RED (#10987)
- Reorganize AutoTuner cache file for distributed tuning (#10956)
- Support attention DP + Helix CP (#10477)
- Improve performance of
_write_finish_reasonsin TorchSampler (#10459) - Add gRPC server for high-performance external router integration (#11037)
- Prepare for future KVCacheV2 MTP support (#11029)
-
Fix
- Fix CuteDSL MoE unit test (#10983)
- Fix overlap scheduler
pause()timing (#10943) - Fix Pydantic deepcopy bug (#11004)
- Restore IPv6 support in
serve.py(#10929) - Fix conditional compilation for sm10x cubins (#10839)
- Add graceful fallbacks for NCCL symmetric mode (#11042)
- Fix
enable_alltoallpassed to CutlassFusedMoE (#11016) - Fix kvCacheManager
isLeaf()assertion failure (#10922) - Add null pointer check to
parseNpyHeader(#10944) - Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
-
Documentation
- Update Qwen2/3-VL models in
supported_models.md(#10797)
- Update Qwen2/3-VL models in
-
Benchmark
-
Test & Infra
- Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
- Add timeout for SeedOSS test (#8683)
- Add Fake Ops for one-sided AlltoAll (#11002)
- Refactor setup for RNN cache transceiver (#10957)
- Change SLURM config access to use
resolvePlatform(#11006) - Update CI allowList (#11040)
- Add Mamba and MLA layers to sharding tests (#10364)
- Remove pybind11 bindings and references (#10550, #11026)
- Add multi-acc and Lyris GB200 test support (#11024)
- Package
triton-kernelsas a dependency (#10471) - Fix Qwen3 Eagle test (#11030)
- Dump thread stacks for hanging tests before timeout (#10708)
- Remove
-ccachefrombuild_wheel.pyargs (#11064) - Fix
trtllm-serveguided decoding test (#11101) - Remove invalid account for Blossom CI (#11126)
- Add source code pulse scan to PLC nightly pipeline (#10961)
What's Changed
- [None][fix] Fix CuteDSL MoE unittest by @syuoni in #10983
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10974
- [https://nvbugs/5661741][feat] Add 250K-token NVFP4 MoE + PDL regression tests by @yingguo-trt in #10911
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10976
- [None][infra] Waive failed case for main branch on 01/26 by @EmmaQiaoCh in #10994
- [None][feat] Add Skip Softmax MLA kernels for Blackwell and Fix an accuracy bug of NVFP4 KV by @Tom-Zheng in #10813
- [TRTLLM-10048][feat] Fuse the AllGather for expert statistics required by the EPLB. by @bobboli in #10885
- [https://nvbugs/5794796][fix] Cherry-pick #10855: Unwaive Llama 3.3 related multi GPU tests by @pengbowang-nv in #10942
- [#10614][fix] gpt_oss first iteration streaming in trtllm-serve by @LinPoly in #10808
- [None][chore] Removing pybind11 bindings and references by @Linda-Stadter in #10550
- [#8982][feat] AutoDeploy attention dp support by @lucaslie in #10728
- [None][chore] update AD model list by @tcherckez-nvidia in #10981
- [TRTLLM-10062][feat] Enable MTP for Nemotron Super by @sunnyqgg in #10754
- [TRTLLM-10276][feat] Integrate cutedsl argmax kernel by @ameynaik-hub in #10476
- [TRTLLM-10453][feat] Update mamba decode kernel to flashinfer by @Wanli-Jiang in #10757
- [TRTLLM-10560][fix] Fix the time of pause() for overlap scheduler by @yuantailing in #10943
- [https://nvbugs/5612438][fix] Add timeout for SeedOSS test by @zhhuang-nv in #8683
- [None][infra] Waive failed cases for main on 01/27 by @EmmaQiaoCh in #11017
- [None][chore] Bump version to 1.3.0rc2 by @yiqingy0 in #11021
- [None][chore] Remove closed bugs by @xinhe-nv in #10982
- [#10889][fix] fix pydantic deepcopy bug by @reasonsolo in #11004
- [TRTLLM-9390][chore] Add Fake OPs for One-Sided AlltoAll. by @bobboli in #11002
- [TRTLLM-9831][perf] Use TMA.RED to improve effective memory bandwidth by @sherry-1001 in #10987
- [TRTLLM-9527][feat] change context params and disagg params (step3) by @chuangz0 in #10495
- [TRTLLM-10308][feat] AutoTuner Cache: reorganize cache file for distributed tuning by @hyukn in #10956
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10993
- [https://nvbugs/5843316][chore] waive overlap_scheduler test by @galagam in #11025
- [#10013][feat] AutoDeploy: native cache manager integration by @lucaslie in #10635
- [https://nvbugs/5721661][chore] Unwaive fixed bug. by @SimengLiu-nv in #11009
- [#10877][fix] restore ipv6 support in serve.py by @Evgueni-Petrov-aka-espetrov in #10929
- [TRTLLM-10197][chore] Refactor to setup for RNN cache transceiver by @NVShreyas in #10957
- [TRTINFRA-7379][infra] Change SLURM config access to use resolvePlatform by @mlefeb01 in #11006
- [None][fix] Proper conditional compilation of sm10x cubins by @tongyuantongyu in #10839
- [https://nvbugs/5756804][fix] Re-enable passing test by @dongfengy in #10986
- [None][fix] unwaive tests by @xinhe-nv in #11047
- [https://nvbugs/5779536][fix] Cherry-pick #10902: Unwaive DeepSeekR1 nvfp4 pp4 mtp test case (#10902) by @pengbowang-nv in #11000
- [None][infra] Update CI allowList by @yuanjingx87 in #11040
- [TRTLLM-10362][feat] Added Mamba and MLA layers to the sharding tests by @greg-kwasniewski1 in #10364
- [None][chore] Removing cpp/tensorrt_llm/pybind by @Linda-Stadter in #11026
- [None][feat] support multi_acc and Lyris GB200 test by @yingguo-trt in #11024
- [None][infra] Waive failed cases for main on 1/28 by @EmmaQiaoCh in #11053
- [None][chore] AutoDeploy: Eagle One-Model [1/n]: PyTorch impl for Eagle3 checkpoint by @govind-ramnarayan in #10674
- [#10245][feat] AutoDeploy: Add Minimax M2 support by @bmarimuthu-nv in #10525
- [None][fix] nccl symmetric with graceful fallbacks by @nv-lschneider in #11042
- [None][fix] fix Qwen2/3 export for AutoDeploy by @Fridah-nv in #11007
- [None][fix] No need to remove the original waive list by @yiqingy0 in #11060
- [https://nvbugs/5761391][fix] Include triton-kernels as a packaged dependency by @anish-shanbhag in #10471
- [None][fix] Fix enable_alltoall passed to CutlassFusedMoE by @syuoni in #11016
- [None][feat] Add performance alignment to layer-wise benchmarks by @yuantailing in #11018
- [https://nvbugs/5813452][fix] Fix "Assertion failed: isLeaf() in kvCacheManager.cpp:465" by @Boreas618 in #10922
- [None][infra] Waived flaky tests by @ZhanruiSunCh in #11091
- [TRTLLM-10264][feat] Support attention DP + Helix CP by @brb-nv in #10477
- [TRTLLM-10415][feat] Dump thread stacks for hanging tests before time… by @WeiHaocheng in #10708
- [TRTLLM-10312][perf] Improve performance of _write_finish_reasons in TorchSampler by @stnie in #10459
- [None][chore] Consolidate duplicate kv cache reuse variables. by @hnover-nv in #10935
- [None][chore] Clean up layer-wise benchmarks code by @yuantailing in #11092
- [None][fix] AutoDeploy: remove mem check for a log unit test by @lucaslie in #11120
- [https://nvbugs/5836592][fix] Fix qwen3 eagle test by @mikeiovine in #11030
- [None][feat] AutoDeploy: Flashinfer kernels bringup by @nvchenghaoz in #10867
- [None][feat] Add gRPC server for high-performance external router integration by @CatherineSue in #11037
- [None][infra] Remove invalid account for blossom CI by @yuanjingx87 in #11126
- [None][fix] Add missing absolute pe in Qwen3-VL Vision Encoder by @Nekofish-L in #11065
- [https://nvbugs/5775544][fix] Unwaive test by @eopXD in #11023
- [None][test] Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases by @JennyLiu-nv in #11096
- [TRTLLM-9904][feat] Changes for future KVCacheV2 MTP support by @liji-nv in #11029
- [TRTLLM-10733][feat] Make TRTLLM MOE the default one for GPTOSS on Blackwell by @dongfengy in #11074
- [https://nvbugs/5825514][fix] Add null pointer check to parseNpyHeader by @yibinl-nvidia in #10944
- [None][chore] Correct sorting order for attention DP scheduling to prioritize non-relaxed requests by @lancelly in #11106
- [None][fix] Remove
-ccachefrom build_wheel.py args by @yuantailing in #11064 - [https://nvbugs/5837281][fix] Fix trtllm-serve guided decoding test by @syuoni in #11101
- [None][feat] New KVCacheManagerV2 APIs for Transceiver by @lowsfer in #11003
- [None][doc] Update Qwen2/3-VL's model on supported_models.md by @yechank-nvidia in #10797
- [https://nvbugs/5853997][chore] Waive test by @dominicshanshan in #11132
- [None][infra] Add source code pulse scan to PLC nightly pipeline by @yuanjingx87 in #10961
New Contributors
- @Evgueni-Petrov-aka-espetrov made their first contribution in #10929
- @bmarimuthu-nv made their first contribution in #10525
- @hnover-nv made their first contribution in #10935
- @CatherineSue made their first contribution in #11037
Full Changelog: v1.3.0rc1...v1.3.0rc2