github NVIDIA/TensorRT-LLM v1.3.0rc8

pre-release5 hours ago

Highlights

  • Model Support

    • Nemotron 3 Super support
    • Add tool parser support for GLM-4 models (#11986)
    • Implement dynamic resolution for Nemotron VL (#11894)
    • Enable mixed quantization support for Nemotron-H Mamba (#11972)
    • Add VisualGen FA4 attention backend support (#11697)
    • VisualGen support for LTX-2, Wan and FLUX (#12009)
    • Add TRTLLM-Gen kernels for GLM4.7 and support groupsTokensHeadsQ and e2m1 output (#11643)
    • Support attention-DP for TRTLLM-Gen NVFP4 MoE (#12156)
  • API

    • Add dedicated virtual memory tags for model weights and configurable restore mode (#11889)
    • Add abort method for GenerationResultBase (#11970)
    • Deprecate trtllm-serve CLI options (#12106)
    • Add keepalive ping tolerance and context.abort support to the gRPC server (#11992)
  • Feature

    • Add basic SSM support in KVCacheManagerV2 (#11976)
    • Improve KV event batching (#11883)
    • Add 2FP4 / Arcquant support (#11333)
    • Adapt the transceiver to manager v2 (step 6) (#11978)
    • Add shared expert LoRA support for MoE models in the PyTorch backend (#11760)
    • Add dynamic draft length on the one-model speculative decoding path (#10860)
    • Enable configurable warmup shapes for VisualGen (#12107)
    • Add FlashInfer API support for TRTLLMGenFusedMoE (#10453)
    • Add Python cache transceiver support for gen-first workflow (#11941)
  • Fix

    • Upgrade Cutlass version (#11956)
    • Fix DS v32 tool calling type and parse errors (#11935)
    • Fix protobuf and aiohttp vulnerabilities (#11898)
    • Fix NVFP4 sharding (#11618)
    • Fix Kimi-K2.5 accuracy test skip condition and reference configs (#11930)
    • Pass sparse_attn_config from effective_draft_config for one-model draft KV cache (#12032)
    • Fix MTP advanced sampling top-k IMA (#12088)
    • Revert refactor of the KV connector integration in py_executor, which caused issues with KVBM (#11872)
    • Fix sharding overwrite with multiple graph modules (#12051)
    • Fix various agentic flow issues (#12061)
    • Split mContextChunkSize into per-target and per-draft fields (#12058)
    • Fix ValueError and missing decoding statistics for MTP (#12063)
    • Improve NCCL library load stability (#12015)
    • Disable TRTLLM-Gen routing PDL due to NaN issues (#11994)
    • Enforce a minimum NVSHMEM_QP_DEPTH of 128 for DeepEP low latency (#12100)
    • Narrow a bare except clause and use identity checks for None (#12041)
    • Fix MoE DeepEP hangs caused by non-deterministic GC (#12060)
    • Fix KVCacheManagerV2 shrink behavior for the last level and improve init_ratio (#12112)
    • Fix Mamba cache handling for PP > 1 (#12146)
    • Handle anyOf parameter schemas in the Qwen3Coder tool parser (#12173)
    • Add explicit errors for intermediate-size misalignment with the FP8 block size (#12101)
    • Fix DeepEP with the TRTLLM MoE backend for sequence length 1 (#12158)
    • Improve port retry loops and exception handling (#12225)
    • Add streaming support for no </think> on Nemotron models (#12176)
  • Documentation

    • Clarify DCO sign-off and co-author guidelines in AGENTS.md (#12034)
    • Add a deployment guide for Nemotron 3 Super (#12129)
  • Benchmark

    • Add QA perf test cases with L0 local mode (#12022)
    • Align performance benchmark output format (#12067)
    • Improve sampler performance by replacing torch.where with masked_fill_ (#11949)
    • Add a fused cat + fp8_quantize CUDA kernel for the DSA indexer (#11899)
    • Optimize long-sequence token-parallel prefill for the DSA indexer (#11871)
    • Reduce logprobs=0 overhead in TorchSampler (#11983)
    • Refine AlltoAll benchmark scripts (#11649)
    • Optimize the Q3N decode kernel with IO reads (#11344)
    • Fix disaggregated gen-only benchmark coverage (#12091)
    • Fix MPI issues and port conflicts in disaggregated performance tests (#12020)
    • Add GB200 performance sanity tests to the QA test database (#11882)
    • Refactor parallel VAE support (#12123)
    • Optimize 6KD FP8 blockscale GEMM (#11502)
    • Optimize Qwen3.5 performance (#11581)
    • Restore 3 disaggregated gen-only tests (#12159)
  • Test & Infra

    • Fix disaggregated SKU coverage (#12065)
    • Fix upload build info branch handling and ensure it always runs in post steps (#12025)
    • Fix the CI issue for Mistral Large3 (#12073)
    • Enable more KV connector priority tests in CI (#11892)
    • Add speculative decoding tests for exclude_input_in_output=true (#12080)
    • Add E2E tests for the KV cache connector async loading path (#12053)
    • Change the image used for the CI preparation step (#12086)
    • Add the verl stage in CI (#11306)
    • Add multi-node E2E and accuracy cases on DGX-Spark (#12110)
    • Update NumPy to version 2 (#11280)

What's Changed

  • [None][feat] Add Auto-Deploy dashboard failures analysis skill by @tcherckez-nvidia in #12033
  • [https://nvbugs/5820511][fix] Upgrade Cutlass version by @pamelap-nvidia in #11956
  • [None][feat] Add AD model list validation checks to pre-commit and PR… by @tcherckez-nvidia in #12036
  • [None][chore] Clarify DCO sign-off and co-author guidelines in AGENTS.md by @kaiyux in #12034
  • [TRTLLM-7784][feat] Basic SSM support in KVCacheManagerV2 by @lowsfer in #11976
  • [None][test] Add QA's perf test cases with L0 local mode by @fredricz-20070104 in #12022
  • [TRTLLM-11246][feat] Add tool parser support for GLM-4 models by @JunyiXu-nv in #11986
  • [https://nvbugs/5937478][fix] Fix DS v32 tool calling type and parse error by @JunyiXu-nv in #11935
  • [TRTLLM-11135][fix] Fix vulnerabilities protobuf and aiohttp by @yiqingy0 in #11898
  • [None][chore] Align perf benchmark output format by @yingguo-trt in #12067
  • [None][chore] Improve sampler performance by replacing torch.where with masked_fill_ by @stnie in #11949
  • [None][infra] Waive 1 failed cases for main in post-merge 2582 by @ZhanruiSunCh in #12069
  • [TRTLLM-10421][perf] Add fused cat+fp8_quantize CUDA kernel for DSA indexer by @kaiyux in #11899
  • [None][test] Fix disagg sku by @fredricz-20070104 in #12065
  • [https://nvbugs/5892646][perf] Long-sequence token-parallel optimization for DSA indexer prefill by @nvxuanyuc in #11871
  • [TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL by @2ez4bz in #11894
  • [https://nvbugs/5708901][perf] reduce logprobs=0 overhead in TorchSampler by @ixlmar in #11983
  • [None][feat] NVFP4 TRTLLM-Gen MoE for AutoDeploy (Nemotron Super) by @tcherckez-nvidia in #11652
  • [https://nvbugs/5963896][fix] Remove test test_visual_gen_quickstart on A10 by @chang-l in #12048
  • [TRTLLM-11535][feat] Fixed NVFP4 sharding by @greg-kwasniewski1 in #11618
  • [None][fix] Improve KV Event Batching by @jthomson04 in #11883
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12047
  • [TRTLLM-11276][fix] Fix Kimi-K2.5 accuracy test skip condition and reference configs by @lancelly in #11930
  • [https://nvbugs/5919026][fix] Pass sparse_attn_config from effective_draft_config for one-model draft KV cache by @chenfeiz0326 in #12032
  • [None][fix] MTP Advanced Sampling Topk IMA by @IzzyPutterman in #12088
  • [None][fix] Revert "[None][chore] KV Connector Refactor (#11078)" by @jthomson04 in #11872
  • [None][chore] Bump version to 1.3.0rc8 by @yuanjingx87 in #12090
  • [None][chore] Refine AlltoAll benchmark scripts. by @bobboli in #11649
  • [None][feat] 2FP4 / Arcquant. by @Tracin in #11333
  • [None][fix] Fix Upload Build Info branch and run in post always by @mzweilz in #12025
  • [TRTLLM-11366][feat] Add dedicated virtual memory tag for model weights, configurable restore mode by @tongyuantongyu in #11889
  • [https://nvbugs/5961430][fix] Fix CI issue of Mistral Large3 by @byshiue in #12073
  • [None][test] add Perf sanity gb200 test into QA test db by @xinhe-nv in #11882
  • [None][infra] Waive 2 failed cases for main in post-merge 2584 by @ZhanruiSunCh in #12108
  • [None][chore] Waive mpi hang test case by @jieli-matrix in #12077
  • [None][chore] re-enable benchmark test in post merge by @zhenhuaw-me in #12035
  • [None][feat] Mamba optimization and mixed quantization support for nemotron-h by @Wanli-Jiang in #11972
  • [None][fix] Various fixes for agentic flow by @2ez4bz in #12061
  • [https://nvbugs/5936322][fix] Fix sporadic port collision in multigpu AutoDeploy tests by @MrGeva in #11913
  • [TRTLLM-9523][feat] Adapting the transceiver to manager v2 (step 6) by @Shixiaowei02 in #11978
  • [TRTLLM-11928][feat] Fix sharding overwrite with multiple graph module by @greg-kwasniewski1 in #12051
  • [https://nvbugs/5948539][fix] Fix disagg gen-only benchmark by @Tabrizian in #12091
  • [None][fix] Split mContextChunkSize into per-target/draft fields by @Hrithvik-Alex in #12058
  • [None][fix] Fix ValueError and missing decoding statistics for MTP by @cascade812 in #12063
  • [None][fix] Enable more KV connector priority tests in CI by @jthomson04 in #11892
  • [https://nvbugs/5923949][fix] Improve NCCL library load stability by @nv-lschneider in #12015
  • [None][feat] Enable non-gated activation to the new MoE test by @IwakuraRein in #11996
  • [None][infra] Update CI allow list by @yuanjingx87 in #12119
  • [None][chore] Unwaiving disagg tests failing with address in use error by @pcastonguay in #12085
  • [https://nvbugs/5955170][fix] Disable TRTLLM GEN Routing PDL due to nan issue by @dongfengy in #11994
  • [None][fix] Enforce minimum NVSHMEM_QP_DEPTH of 128 for DeepEP low latency by @Tabrizian in #12100
  • [None][refactor] parallel vae refactor by @NVShreyas in #12123
  • [https://nvbugs/5826604][test] Remove test waive for Llama3.1 8B bfloat16 4gpu timeout … by @syuoni in #12092
  • [TRTLLM-11257][infra] Unwaive TestDeepSeekR1::test_fp8_blockscale[throughput_mtp] test case by @zhaoyangwang-nvidia in #12059
  • [None][infra] Waive 2 failed cases for main in post-merge 2586 by @ZhanruiSunCh in #12134
  • [None][feat] Optimize the q3n decode kernel with IO read by @JadoTu in #11344
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12093
  • [TRTLLM-11092][feat] add support for visual gen FA4 attention backend by @o-stoner in #11697
  • [https://nvbugs/5955173][fix] Add abort method for GenerationResultBase by @JunyiXu-nv in #11970
  • [None][test] Add speculative decoding test with exclude_input_in_output=true by @StanleySun639 in #12080
  • [None][feat] Add shared expert LoRA support for MoE models in PyTorch backend by @achartier in #11760
  • [https://nvbugs/5846166][bug] Fix Disagg Perf Test's MPI Issue and Port Conflict by @chenfeiz0326 in #12020
  • [TRTLLM-10244][doc] Add deployment guide for Nemotron 3 Super by @nv-guomingz in #12129
  • [None][fix] Narrow bare except clause and use identity check for None by @edenfunf in #12041
  • [TRTLLM-10303][feat] Deprecate trtllm-serve CLI options by @JunyiXu-nv in #12106
  • [#11800][fix] Add keepalive ping tolerance and context.abort to gRPC server by @CatherineSue in #11992
  • [None][test] Add e2e tests for KV cache connector async loading path by @Tabrizian in #12053
  • [TRTLLMINF-11][chore] Change image used for Preparation step of CI by @dpitman-nvda in #12086
  • [https://nvbugs/5973199][fix] support attn-dp TRTLLM-Gen NVFP4 MoE fu… by @tcherckez-nvidia in #12156
  • [TRTLLM-10617][feat] LTX-2 Model Support by @yibinl-nvidia in #12009
  • [TRTLLM-10695][ci] add verl stage in CI by @Superjomn in #11306
  • [None][feat] Optimize 6KD fp8 blockscale gemm by @CarstyYou in #11502
  • [https://nvbugs/5949033][fix] Add 3 Disagg gen_only tests back by @chenfeiz0326 in #12159
  • [TRTLLM-11037][bug] Fix MoE DeepEP hang caused by non-deterministic GC by @xxi-nv in #12060
  • [None][feat] Add flashinfer api for TRTLLMGenFusedMoE by @rosong11 in #10453
  • [None][chore] Add multinode e2e and accuracy cases on DGX-Spark by @JennyLiu-nv in #12110
  • [TRTLLM-11207][requirements] Update numpy version to 2 by @Funatiq in #11280
  • [None][chore] Fix KVCacheManagerV2 shrink for last level and improve init_ratio by @lowsfer in #12112
  • [TRTLLM-10319][feat] Dynamic draft length on spec decode one-model path by @zheyuf in #10860
  • [TRTLLM-11288][feat] Configurable warmup shapes for VisualGen by @luyiyun1021 in #12107
  • [None][feat] add trtllm-gen kernels for glm4.7 and support groupsTokensHeadsQ + e2m1 output by @PerkzZheng in #11643
  • [None][fix] Fixed mamba cache issue for pp>1 by @Wanli-Jiang in #12146
  • [None][feat] Qwen3.5 perf optimizations by @suyoggupta in #11581
  • [None][feat] Add mix-precision checkpoint support in AutoDeploy by @Fridah-nv in #12175
  • [https://nvbugs/5944411][fix] Handle anyOf parameter schemas in Qwen3Coder tool parser by @tijyojwad in #12173
  • [None][infra] Waive failed A10-PyTorch-1 test in pre-merge by @yuanjingx87 in #12207
  • [None][fix] Add streaming support to no for nemotron model by @tijyojwad in #12176
  • [None][chore] Add explicit error for intermediate size misalignment with fp8 block size by @leslie-fang25 in #12101
  • [https://nvbugs/5973316][fix] fix deepep with trtllm moe backend and seqlen one by @leslie-fang25 in #12158
  • [TRTLLM-8922][feat] py cache transceiver for gen-first workflow by @reasonsolo in #11941
  • [None][fix] remove test_llm_api_autodeploy.py::TestNemotronSuperV3::t… by @tcherckez-nvidia in #12193
  • [None][infra] Waive 9 failed cases for main in post-merge 2593 by @ZhanruiSunCh in #12224
  • [None][fix] port retry loop and exception handling by @MrGeva in #12225

New Contributors

Full Changelog: v1.3.0rc7...v1.3.0rc8

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.