Highlights
-
Model Support
- Nemotron 3 Super support
- Add tool parser support for GLM-4 models (#11986)
- Implement dynamic resolution for Nemotron VL (#11894)
- Enable mixed quantization support for Nemotron-H Mamba (#11972)
- Add VisualGen FA4 attention backend support (#11697)
- VisualGen support for LTX-2, Wan and FLUX (#12009)
- Add TRTLLM-Gen kernels for GLM4.7 and support
groupsTokensHeadsQande2m1output (#11643) - Support attention-DP for TRTLLM-Gen NVFP4 MoE (#12156)
-
API
-
Feature
- Add basic SSM support in
KVCacheManagerV2(#11976) - Improve KV event batching (#11883)
- Add 2FP4 / Arcquant support (#11333)
- Adapt the transceiver to manager v2 (step 6) (#11978)
- Add shared expert LoRA support for MoE models in the PyTorch backend (#11760)
- Add dynamic draft length on the one-model speculative decoding path (#10860)
- Enable configurable warmup shapes for VisualGen (#12107)
- Add FlashInfer API support for
TRTLLMGenFusedMoE(#10453) - Add Python cache transceiver support for gen-first workflow (#11941)
- Add basic SSM support in
-
Fix
- Upgrade Cutlass version (#11956)
- Fix DS v32 tool calling type and parse errors (#11935)
- Fix protobuf and
aiohttpvulnerabilities (#11898) - Fix NVFP4 sharding (#11618)
- Fix Kimi-K2.5 accuracy test skip condition and reference configs (#11930)
- Pass
sparse_attn_configfromeffective_draft_configfor one-model draft KV cache (#12032) - Fix MTP advanced sampling top-k IMA (#12088)
- Revert refactor of the KV connector integration in py_executor, which caused issues with KVBM (#11872)
- Fix sharding overwrite with multiple graph modules (#12051)
- Fix various agentic flow issues (#12061)
- Split
mContextChunkSizeinto per-target and per-draft fields (#12058) - Fix
ValueErrorand missing decoding statistics for MTP (#12063) - Improve NCCL library load stability (#12015)
- Disable TRTLLM-Gen routing PDL due to NaN issues (#11994)
- Enforce a minimum
NVSHMEM_QP_DEPTHof 128 for DeepEP low latency (#12100) - Narrow a bare
exceptclause and use identity checks forNone(#12041) - Fix MoE DeepEP hangs caused by non-deterministic GC (#12060)
- Fix
KVCacheManagerV2shrink behavior for the last level and improveinit_ratio(#12112) - Fix Mamba cache handling for PP > 1 (#12146)
- Handle
anyOfparameter schemas in the Qwen3Coder tool parser (#12173) - Add explicit errors for intermediate-size misalignment with the FP8 block size (#12101)
- Fix DeepEP with the TRTLLM MoE backend for sequence length 1 (#12158)
- Improve port retry loops and exception handling (#12225)
- Add streaming support for
no </think>on Nemotron models (#12176)
-
Documentation
-
Benchmark
- Add QA perf test cases with L0 local mode (#12022)
- Align performance benchmark output format (#12067)
- Improve sampler performance by replacing
torch.wherewithmasked_fill_(#11949) - Add a fused
cat+fp8_quantizeCUDA kernel for the DSA indexer (#11899) - Optimize long-sequence token-parallel prefill for the DSA indexer (#11871)
- Reduce
logprobs=0overhead inTorchSampler(#11983) - Refine AlltoAll benchmark scripts (#11649)
- Optimize the Q3N decode kernel with IO reads (#11344)
- Fix disaggregated gen-only benchmark coverage (#12091)
- Fix MPI issues and port conflicts in disaggregated performance tests (#12020)
- Add GB200 performance sanity tests to the QA test database (#11882)
- Refactor parallel VAE support (#12123)
- Optimize 6KD FP8 blockscale GEMM (#11502)
- Optimize Qwen3.5 performance (#11581)
- Restore 3 disaggregated gen-only tests (#12159)
-
Test & Infra
- Fix disaggregated SKU coverage (#12065)
- Fix upload build info branch handling and ensure it always runs in post steps (#12025)
- Fix the CI issue for Mistral Large3 (#12073)
- Enable more KV connector priority tests in CI (#11892)
- Add speculative decoding tests for
exclude_input_in_output=true(#12080) - Add E2E tests for the KV cache connector async loading path (#12053)
- Change the image used for the CI preparation step (#12086)
- Add the
verlstage in CI (#11306) - Add multi-node E2E and accuracy cases on DGX-Spark (#12110)
- Update NumPy to version 2 (#11280)
What's Changed
- [None][feat] Add Auto-Deploy dashboard failures analysis skill by @tcherckez-nvidia in #12033
- [https://nvbugs/5820511][fix] Upgrade Cutlass version by @pamelap-nvidia in #11956
- [None][feat] Add AD model list validation checks to pre-commit and PR… by @tcherckez-nvidia in #12036
- [None][chore] Clarify DCO sign-off and co-author guidelines in AGENTS.md by @kaiyux in #12034
- [TRTLLM-7784][feat] Basic SSM support in KVCacheManagerV2 by @lowsfer in #11976
- [None][test] Add QA's perf test cases with L0 local mode by @fredricz-20070104 in #12022
- [TRTLLM-11246][feat] Add tool parser support for GLM-4 models by @JunyiXu-nv in #11986
- [https://nvbugs/5937478][fix] Fix DS v32 tool calling type and parse error by @JunyiXu-nv in #11935
- [TRTLLM-11135][fix] Fix vulnerabilities protobuf and aiohttp by @yiqingy0 in #11898
- [None][chore] Align perf benchmark output format by @yingguo-trt in #12067
- [None][chore] Improve sampler performance by replacing torch.where with masked_fill_ by @stnie in #11949
- [None][infra] Waive 1 failed cases for main in post-merge 2582 by @ZhanruiSunCh in #12069
- [TRTLLM-10421][perf] Add fused cat+fp8_quantize CUDA kernel for DSA indexer by @kaiyux in #11899
- [None][test] Fix disagg sku by @fredricz-20070104 in #12065
- [https://nvbugs/5892646][perf] Long-sequence token-parallel optimization for DSA indexer prefill by @nvxuanyuc in #11871
- [TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL by @2ez4bz in #11894
- [https://nvbugs/5708901][perf] reduce logprobs=0 overhead in TorchSampler by @ixlmar in #11983
- [None][feat] NVFP4 TRTLLM-Gen MoE for AutoDeploy (Nemotron Super) by @tcherckez-nvidia in #11652
- [https://nvbugs/5963896][fix] Remove test
test_visual_gen_quickstarton A10 by @chang-l in #12048 - [TRTLLM-11535][feat] Fixed NVFP4 sharding by @greg-kwasniewski1 in #11618
- [None][fix] Improve KV Event Batching by @jthomson04 in #11883
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12047
- [TRTLLM-11276][fix] Fix Kimi-K2.5 accuracy test skip condition and reference configs by @lancelly in #11930
- [https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache by @chenfeiz0326 in #12032
- [None][fix] MTP Advanced Sampling Topk IMA by @IzzyPutterman in #12088
- [None][fix] Revert "[None][chore] KV Connector Refactor (#11078)" by @jthomson04 in #11872
- [None][chore] Bump version to 1.3.0rc8 by @yuanjingx87 in #12090
- [None][chore] Refine AlltoAll benchmark scripts. by @bobboli in #11649
- [None][feat] 2FP4 / Arcquant. by @Tracin in #11333
- [None][fix] Fix Upload Build Info branch and run in post always by @mzweilz in #12025
- [TRTLLM-11366][feat] Add dedicated virtual memory tag for model weights, configurable restore mode by @tongyuantongyu in #11889
- [https://nvbugs/5961430][fix] Fix CI issue of Mistral Large3 by @byshiue in #12073
- [None][test] add Perf sanity gb200 test into QA test db by @xinhe-nv in #11882
- [None][infra] Waive 2 failed cases for main in post-merge 2584 by @ZhanruiSunCh in #12108
- [None][chore] Waive mpi hang test case by @jieli-matrix in #12077
- [None][chore] re-enable benchmark test in post merge by @zhenhuaw-me in #12035
- [None][feat] Mamba optimization and mixed quantization support for nemotron-h by @Wanli-Jiang in #11972
- [None][fix] Various fixes for agentic flow by @2ez4bz in #12061
- [https://nvbugs/5936322][fix] Fix sporadic port collision in multigpu AutoDeploy tests by @MrGeva in #11913
- [TRTLLM-9523][feat] Adapting the transceiver to manager v2 (step 6) by @Shixiaowei02 in #11978
- [TRTLLM-11928][feat] Fix sharding overwrite with multiple graph module by @greg-kwasniewski1 in #12051
- [https://nvbugs/5948539][fix] Fix disagg gen-only benchmark by @Tabrizian in #12091
- [None][fix] Split mContextChunkSize into per-target/draft fields by @Hrithvik-Alex in #12058
- [None][fix] Fix ValueError and missing decoding statistics for MTP by @cascade812 in #12063
- [None][fix] Enable more KV connector priority tests in CI by @jthomson04 in #11892
- [https://nvbugs/5923949][fix] Improve NCCL library load stability by @nv-lschneider in #12015
- [None][feat] Enable non-gated activation to the new MoE test by @IwakuraRein in #11996
- [None][infra] Update CI allow list by @yuanjingx87 in #12119
- [None][chore] Unwaiving disagg tests failing with address in use error by @pcastonguay in #12085
- [https://nvbugs/5955170][fix] Disable TRTLLM GEN Routing PDL due to nan issue by @dongfengy in #11994
- [None][fix] Enforce minimum NVSHMEM_QP_DEPTH of 128 for DeepEP low latency by @Tabrizian in #12100
- [None][refactor] parallel vae refactor by @NVShreyas in #12123
- [https://nvbugs/5826604][test] Remove test waive for Llama3.1 8B bfloat16 4gpu timeout … by @syuoni in #12092
- [TRTLLM-11257][infra] Unwaive TestDeepSeekR1::test_fp8_blockscale[throughput_mtp] test case by @zhaoyangwang-nvidia in #12059
- [None][infra] Waive 2 failed cases for main in post-merge 2586 by @ZhanruiSunCh in #12134
- [None][feat] Optimize the q3n decode kernel with IO read by @JadoTu in #11344
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12093
- [TRTLLM-11092][feat] add support for visual gen FA4 attention backend by @o-stoner in #11697
- [https://nvbugs/5955173][fix] Add abort method for GenerationResultBase by @JunyiXu-nv in #11970
- [None][test] Add speculative decoding test with exclude_input_in_output=true by @StanleySun639 in #12080
- [None][feat] Add shared expert LoRA support for MoE models in PyTorch backend by @achartier in #11760
- [https://nvbugs/5846166][bug] Fix Disagg Perf Test's MPI Issue and Port Conflict by @chenfeiz0326 in #12020
- [TRTLLM-10244][doc] Add deployment guide for Nemotron 3 Super by @nv-guomingz in #12129
- [None][fix] Narrow bare except clause and use identity check for None by @edenfunf in #12041
- [TRTLLM-10303][feat] Deprecate trtllm-serve CLI options by @JunyiXu-nv in #12106
- [#11800][fix] Add keepalive ping tolerance and context.abort to gRPC server by @CatherineSue in #11992
- [None][test] Add e2e tests for KV cache connector async loading path by @Tabrizian in #12053
- [TRTLLMINF-11][chore] Change image used for Preparation step of CI by @dpitman-nvda in #12086
- [https://nvbugs/5973199][fix] support attn-dp TRTLLM-Gen NVFP4 MoE fu… by @tcherckez-nvidia in #12156
- [TRTLLM-10617][feat] LTX-2 Model Support by @yibinl-nvidia in #12009
- [TRTLLM-10695][ci] add verl stage in CI by @Superjomn in #11306
- [None][feat] Optimize 6KD fp8 blockscale gemm by @CarstyYou in #11502
- [https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back by @chenfeiz0326 in #12159
- [TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC by @xxi-nv in #12060
- [None][feat] Add flashinfer api for TRTLLMGenFusedMoE by @rosong11 in #10453
- [None][chore] Add multinode e2e and accuracy cases on DGX-Spark by @JennyLiu-nv in #12110
- [TRTLLM-11207][requirements] Update numpy version to 2 by @Funatiq in #11280
- [None][chore] Fix KVCacheManagerV2 shrink for last level and improve init_ratio by @lowsfer in #12112
- [TRTLLM-10319][feat] Dynamic draft length on spec decode one-model path by @zheyuf in #10860
- [TRTLLM-11288][feat] Configurable warmup shapes for VisualGen by @luyiyun1021 in #12107
- [None][feat] add trtllm-gen kernels for glm4.7 and support groupsTokensHeadsQ + e2m1 output by @PerkzZheng in #11643
- [None][fix] Fixed mamba cache issue for pp>1 by @Wanli-Jiang in #12146
- [None][feat] Qwen3.5 perf optimizations by @suyoggupta in #11581
- [None][feat] Add mix-precision checkpoint support in AutoDeploy by @Fridah-nv in #12175
- [https://nvbugs/5944411][fix] Handle anyOf parameter schemas in Qwen3Coder tool parser by @tijyojwad in #12173
- [None][infra] Waive failed A10-PyTorch-1 test in pre-merge by @yuanjingx87 in #12207
- [None][fix] Add streaming support to no for nemotron model by @tijyojwad in #12176
- [None][chore] Add explicit error for intermediate size misalignment with fp8 block size by @leslie-fang25 in #12101
- [https://nvbugs/5973316][fix] fix deepep with trtllm moe backend and seqlen one by @leslie-fang25 in #12158
- [TRTLLM-8922][feat] py cache transceiver for gen-first workflow by @reasonsolo in #11941
- [None][fix] remove test_llm_api_autodeploy.py::TestNemotronSuperV3::t… by @tcherckez-nvidia in #12193
- [None][infra] Waive 9 failed cases for main in post-merge 2593 by @ZhanruiSunCh in #12224
- [None][fix] port retry loop and exception handling by @MrGeva in #12225
New Contributors
- @Hrithvik-Alex made their first contribution in #12058
- @zhaoyangwang-nvidia made their first contribution in #12059
- @edenfunf made their first contribution in #12041
Full Changelog: v1.3.0rc7...v1.3.0rc8