Highlights

Model Support
- Nemotron 3 Super support
- Add tool parser support for GLM-4 models (#11986)
- Implement dynamic resolution for Nemotron VL (#11894)
- Enable mixed quantization support for Nemotron-H Mamba (#11972)
- Add VisualGen FA4 attention backend support (#11697)
- VisualGen support for LTX-2, Wan and FLUX (#12009)
- Add TRTLLM-Gen kernels for GLM4.7 and support groupsTokensHeadsQ and e2m1 output (#11643)
- Support attention-DP for TRTLLM-Gen NVFP4 MoE (#12156)
API
- Add dedicated virtual memory tags for model weights and configurable restore mode (#11889)
- Add abort method for GenerationResultBase (#11970)
- Deprecate trtllm-serve CLI options (#12106)
- Add keepalive ping tolerance and context.abort support to the gRPC server (#11992)
Feature
- Add basic SSM support in KVCacheManagerV2 (#11976)
- Improve KV event batching (#11883)
- Add 2FP4 / Arcquant support (#11333)
- Adapt the transceiver to manager v2 (step 6) (#11978)
- Add shared expert LoRA support for MoE models in the PyTorch backend (#11760)
- Add dynamic draft length on the one-model speculative decoding path (#10860)
- Enable configurable warmup shapes for VisualGen (#12107)
- Add FlashInfer API support for TRTLLMGenFusedMoE (#10453)
- Add Python cache transceiver support for gen-first workflow (#11941)
Fix
- Upgrade Cutlass version (#11956)
- Fix DS v32 tool calling type and parse errors (#11935)
- Fix protobuf and aiohttp vulnerabilities (#11898)
- Fix NVFP4 sharding (#11618)
- Fix Kimi-K2.5 accuracy test skip condition and reference configs (#11930)
- Pass sparse_attn_config from effective_draft_config for one-model draft KV cache (#12032)
- Fix MTP advanced sampling top-k IMA (#12088)
- Revert refactor of the KV connector integration in py_executor, which caused issues with KVBM (#11872)
- Fix sharding overwrite with multiple graph modules (#12051)
- Fix various agentic flow issues (#12061)
- Split mContextChunkSize into per-target and per-draft fields (#12058)
- Fix ValueError and missing decoding statistics for MTP (#12063)
- Improve NCCL library load stability (#12015)
- Disable TRTLLM-Gen routing PDL due to NaN issues (#11994)
- Enforce a minimum NVSHMEM_QP_DEPTH of 128 for DeepEP low latency (#12100)
- Narrow a bare except clause and use identity checks for None (#12041)
- Fix MoE DeepEP hangs caused by non-deterministic GC (#12060)
- Fix KVCacheManagerV2 shrink behavior for the last level and improve init_ratio (#12112)
- Fix Mamba cache handling for PP > 1 (#12146)
- Handle anyOf parameter schemas in the Qwen3Coder tool parser (#12173)
- Add explicit errors for intermediate-size misalignment with the FP8 block size (#12101)
- Fix DeepEP with the TRTLLM MoE backend for sequence length 1 (#12158)
- Improve port retry loops and exception handling (#12225)
- Add streaming support for no </think> on Nemotron models (#12176)
Documentation
- Clarify DCO sign-off and co-author guidelines in AGENTS.md (#12034)
- Add a deployment guide for Nemotron 3 Super (#12129)
Benchmark
- Add QA perf test cases with L0 local mode (#12022)
- Align performance benchmark output format (#12067)
- Improve sampler performance by replacing torch.where with masked_fill_ (#11949)
- Add a fused cat + fp8_quantize CUDA kernel for the DSA indexer (#11899)
- Optimize long-sequence token-parallel prefill for the DSA indexer (#11871)
- Reduce logprobs=0 overhead in TorchSampler (#11983)
- Refine AlltoAll benchmark scripts (#11649)
- Optimize the Q3N decode kernel with IO reads (#11344)
- Fix disaggregated gen-only benchmark coverage (#12091)
- Fix MPI issues and port conflicts in disaggregated performance tests (#12020)
- Add GB200 performance sanity tests to the QA test database (#11882)
- Refactor parallel VAE support (#12123)
- Optimize 6KD FP8 blockscale GEMM (#11502)
- Optimize Qwen3.5 performance (#11581)
- Restore 3 disaggregated gen-only tests (#12159)
Test & Infra
- Fix disaggregated SKU coverage (#12065)
- Fix upload build info branch handling and ensure it always runs in post steps (#12025)
- Fix the CI issue for Mistral Large3 (#12073)
- Enable more KV connector priority tests in CI (#11892)
- Add speculative decoding tests for exclude_input_in_output=true (#12080)
- Add E2E tests for the KV cache connector async loading path (#12053)
- Change the image used for the CI preparation step (#12086)
- Add the verl stage in CI (#11306)
- Add multi-node E2E and accuracy cases on DGX-Spark (#12110)
- Update NumPy to version 2 (#11280)

What's Changed

[None][feat] Add Auto-Deploy dashboard failures analysis skill by @tcherckez-nvidia in #12033
[https://nvbugs/5820511][fix] Upgrade Cutlass version by @pamelap-nvidia in #11956
[None][feat] Add AD model list validation checks to pre-commit and PR… by @tcherckez-nvidia in #12036
[None][chore] Clarify DCO sign-off and co-author guidelines in AGENTS.md by @kaiyux in #12034
[TRTLLM-7784][feat] Basic SSM support in KVCacheManagerV2 by @lowsfer in #11976
[None][test] Add QA's perf test cases with L0 local mode by @fredricz-20070104 in #12022
[TRTLLM-11246][feat] Add tool parser support for GLM-4 models by @JunyiXu-nv in #11986
[https://nvbugs/5937478][fix] Fix DS v32 tool calling type and parse error by @JunyiXu-nv in #11935
[TRTLLM-11135][fix] Fix vulnerabilities protobuf and aiohttp by @yiqingy0 in #11898
[None][chore] Align perf benchmark output format by @yingguo-trt in #12067
[None][chore] Improve sampler performance by replacing torch.where with masked_fill_ by @stnie in #11949
[None][infra] Waive 1 failed cases for main in post-merge 2582 by @ZhanruiSunCh in #12069
[TRTLLM-10421][perf] Add fused cat+fp8_quantize CUDA kernel for DSA indexer by @kaiyux in #11899
[None][test] Fix disagg sku by @fredricz-20070104 in #12065
[https://nvbugs/5892646][perf] Long-sequence token-parallel optimization for DSA indexer prefill by @nvxuanyuc in #11871
[TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL by @2ez4bz in #11894
[https://nvbugs/5708901][perf] reduce logprobs=0 overhead in TorchSampler by @ixlmar in #11983
[None][feat] NVFP4 TRTLLM-Gen MoE for AutoDeploy (Nemotron Super) by @tcherckez-nvidia in #11652
[https://nvbugs/5963896][fix] Remove test test_visual_gen_quickstart on A10 by @chang-l in #12048
[TRTLLM-11535][feat] Fixed NVFP4 sharding by @greg-kwasniewski1 in #11618
[None][fix] Improve KV Event Batching by @jthomson04 in #11883
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12047
[TRTLLM-11276][fix] Fix Kimi-K2.5 accuracy test skip condition and reference configs by @lancelly in #11930
[https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache by @chenfeiz0326 in #12032
[None][fix] MTP Advanced Sampling Topk IMA by @IzzyPutterman in #12088
[None][fix] Revert "[None][chore] KV Connector Refactor (#11078)" by @jthomson04 in #11872
[None][chore] Bump version to 1.3.0rc8 by @yuanjingx87 in #12090
[None][chore] Refine AlltoAll benchmark scripts. by @bobboli in #11649
[None][feat] 2FP4 / Arcquant. by @Tracin in #11333
[None][fix] Fix Upload Build Info branch and run in post always by @mzweilz in #12025
[TRTLLM-11366][feat] Add dedicated virtual memory tag for model weights, configurable restore mode by @tongyuantongyu in #11889
[https://nvbugs/5961430][fix] Fix CI issue of Mistral Large3 by @byshiue in #12073
[None][test] add Perf sanity gb200 test into QA test db by @xinhe-nv in #11882
[None][infra] Waive 2 failed cases for main in post-merge 2584 by @ZhanruiSunCh in #12108
[None][chore] Waive mpi hang test case by @jieli-matrix in #12077
[None][chore] re-enable benchmark test in post merge by @zhenhuaw-me in #12035
[None][feat] Mamba optimization and mixed quantization support for nemotron-h by @Wanli-Jiang in #11972
[None][fix] Various fixes for agentic flow by @2ez4bz in #12061
[https://nvbugs/5936322][fix] Fix sporadic port collision in multigpu AutoDeploy tests by @MrGeva in #11913
[TRTLLM-9523][feat] Adapting the transceiver to manager v2 (step 6) by @Shixiaowei02 in #11978
[TRTLLM-11928][feat] Fix sharding overwrite with multiple graph module by @greg-kwasniewski1 in #12051
[https://nvbugs/5948539][fix] Fix disagg gen-only benchmark by @Tabrizian in #12091
[None][fix] Split mContextChunkSize into per-target/draft fields by @Hrithvik-Alex in #12058
[None][fix] Fix ValueError and missing decoding statistics for MTP by @cascade812 in #12063
[None][fix] Enable more KV connector priority tests in CI by @jthomson04 in #11892
[https://nvbugs/5923949][fix] Improve NCCL library load stability by @nv-lschneider in #12015
[None][feat] Enable non-gated activation to the new MoE test by @IwakuraRein in #11996
[None][infra] Update CI allow list by @yuanjingx87 in #12119
[None][chore] Unwaiving disagg tests failing with address in use error by @pcastonguay in #12085
[https://nvbugs/5955170][fix] Disable TRTLLM GEN Routing PDL due to nan issue by @dongfengy in #11994
[None][fix] Enforce minimum NVSHMEM_QP_DEPTH of 128 for DeepEP low latency by @Tabrizian in #12100
[None][refactor] parallel vae refactor by @NVShreyas in #12123
[https://nvbugs/5826604][test] Remove test waive for Llama3.1 8B bfloat16 4gpu timeout … by @syuoni in #12092
[TRTLLM-11257][infra] Unwaive TestDeepSeekR1::test_fp8_blockscale[throughput_mtp] test case by @zhaoyangwang-nvidia in #12059
[None][infra] Waive 2 failed cases for main in post-merge 2586 by @ZhanruiSunCh in #12134
[None][feat] Optimize the q3n decode kernel with IO read by @JadoTu in #11344
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12093
[TRTLLM-11092][feat] add support for visual gen FA4 attention backend by @o-stoner in #11697
[https://nvbugs/5955173][fix] Add abort method for GenerationResultBase by @JunyiXu-nv in #11970
[None][test] Add speculative decoding test with exclude_input_in_output=true by @StanleySun639 in #12080
[None][feat] Add shared expert LoRA support for MoE models in PyTorch backend by @achartier in #11760
[https://nvbugs/5846166][bug] Fix Disagg Perf Test's MPI Issue and Port Conflict by @chenfeiz0326 in #12020
[TRTLLM-10244][doc] Add deployment guide for Nemotron 3 Super by @nv-guomingz in #12129
[None][fix] Narrow bare except clause and use identity check for None by @edenfunf in #12041
[TRTLLM-10303][feat] Deprecate trtllm-serve CLI options by @JunyiXu-nv in #12106
[#11800][fix] Add keepalive ping tolerance and context.abort to gRPC server by @CatherineSue in #11992
[None][test] Add e2e tests for KV cache connector async loading path by @Tabrizian in #12053
[TRTLLMINF-11][chore] Change image used for Preparation step of CI by @dpitman-nvda in #12086
[https://nvbugs/5973199][fix] support attn-dp TRTLLM-Gen NVFP4 MoE fu… by @tcherckez-nvidia in #12156
[TRTLLM-10617][feat] LTX-2 Model Support by @yibinl-nvidia in #12009
[TRTLLM-10695][ci] add verl stage in CI by @Superjomn in #11306
[None][feat] Optimize 6KD fp8 blockscale gemm by @CarstyYou in #11502
[https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back by @chenfeiz0326 in #12159
[TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC by @xxi-nv in #12060
[None][feat] Add flashinfer api for TRTLLMGenFusedMoE by @rosong11 in #10453
[None][chore] Add multinode e2e and accuracy cases on DGX-Spark by @JennyLiu-nv in #12110
[TRTLLM-11207][requirements] Update numpy version to 2 by @Funatiq in #11280
[None][chore] Fix KVCacheManagerV2 shrink for last level and improve init_ratio by @lowsfer in #12112
[TRTLLM-10319][feat] Dynamic draft length on spec decode one-model path by @zheyuf in #10860
[TRTLLM-11288][feat] Configurable warmup shapes for VisualGen by @luyiyun1021 in #12107
[None][feat] add trtllm-gen kernels for glm4.7 and support groupsTokensHeadsQ + e2m1 output by @PerkzZheng in #11643
[None][fix] Fixed mamba cache issue for pp>1 by @Wanli-Jiang in #12146
[None][feat] Qwen3.5 perf optimizations by @suyoggupta in #11581
[None][feat] Add mix-precision checkpoint support in AutoDeploy by @Fridah-nv in #12175
[https://nvbugs/5944411][fix] Handle anyOf parameter schemas in Qwen3Coder tool parser by @tijyojwad in #12173
[None][infra] Waive failed A10-PyTorch-1 test in pre-merge by @yuanjingx87 in #12207
[None][fix] Add streaming support to no for nemotron model by @tijyojwad in #12176
[None][chore] Add explicit error for intermediate size misalignment with fp8 block size by @leslie-fang25 in #12101
[https://nvbugs/5973316][fix] fix deepep with trtllm moe backend and seqlen one by @leslie-fang25 in #12158
[TRTLLM-8922][feat] py cache transceiver for gen-first workflow by @reasonsolo in #11941
[None][fix] remove test_llm_api_autodeploy.py::TestNemotronSuperV3::t… by @tcherckez-nvidia in #12193
[None][infra] Waive 9 failed cases for main in post-merge 2593 by @ZhanruiSunCh in #12224
[None][fix] port retry loop and exception handling by @MrGeva in #12225

New Contributors

@Hrithvik-Alex made their first contribution in #12058
@zhaoyangwang-nvidia made their first contribution in #12059
@edenfunf made their first contribution in #12041

Full Changelog: v1.3.0rc7...v1.3.0rc8

NVIDIA/TensorRT-LLM v1.3.0rc8 on GitHub

Highlights

What's Changed

New Contributors

NVIDIA/TensorRT-LLM v1.3.0rc8
on GitHub