github NVIDIA/TensorRT-LLM v1.3.0rc14

pre-release9 hours ago

Highlights

  • Model Support

    • Add prefix caching for Mamba hybrid models including Qwen3.5 and Nemotron Super V3 (#12185)
    • Improve Qwen3.5 support with custom MoE routing and dense and NVFP4 weight loading fixes (#13433, #13090, #13716)
    • Improve Nemotron and Nemotron Nano support with GEMM tuning and multimodal placeholder expansion (#13160, #13069)
    • Add Wan 2.2 5B TI2V support and refine LTX-2 FP4 stage handling (#13256, #13244)
  • API

    • Embed VisualGenParams in DiffusionRequest and simplify generate() inputs (#13313)
    • Add llm.encode() fast path support for encoder-only models (#12801)
    • Add per-iteration request-aggregate counters to InflightBatchingStats (#13199)
    • Add AGSI middleware support for Serve (#13378)
    • Introduce cancellation support in transceiver v2 (#12734)
    • Fix Triton backend generation parameter handling for promptIgnoreLength, lengthPenalty, earlyStopping, and early_stopping (#13633, #13692)
  • Feature

    • Improve VisualGen serving with fast PNG compression, multi-node diffusion workers, non-contiguous multimodal chunked prefill, and Attention2D sequence parallelism (#13074, #13140, #12944, #12943)
    • Improve disaggregated serving and routing with gen-first ADP serving, KV-aware hit-rate gates and fair-share caps, and consolidated aiohttp session handling (#13112, #13198, #13408)
    • Expand kernel and runtime performance with GEMM-to-allreduce registered buffers, CuteDSL bf16 dense GEMMs, sparse-attention GVR Top-K dispatchers, fused add-norm-FP8 quantization, TF32 DSA GEMMs, sampler optimizations, and leaner MPI collectives (#11589, #12074, #13477, #12674, #13452, #13480, #13380, #13089)
    • Improve speculative decoding with DFlash one-model support, Mamba-2 rollback replay, radix-based SWA cleanup, and trtllm-gen routing refactoring (#12794, #13453, #13346, #13328)
    • Support NVFP4 weight updates (#12320)
    • Add per-rank torch profile traces for distributed profiling (#13536)
  • Fix

    • Fix KV cache and scheduler correctness issues, including WindowBlockManager statistics, Mamba cache handling under MTP with CUDA graph padding, free-block counter corruption, V2 extra_tokens accounting, PEFT page accumulation, and temporary attention-window cleanup (#12448, #13151, #12834, #13619, #13709, #13528, #12450)
    • Fix disaggregated serving and worker reliability by resolving aggregate PP4 hangs, preventing zombie worker pods, and correcting cached-token usage accounting (#12888, #12718, #13620)
    • Fix OpenAI and Triton generation flows for None tokenizers, prompt ignore lengths, early stopping, and terminateRequest handling from background logits threads (#13184, #13633, #13692, #13059)
    • Fix attention and VisualGen runtime issues, including UlyssesAttention sequence lengths, Ulysses plus Sage execution, TRTLLM-Gen GmemReduction illegal memory access, and low-memory Qwen3 skip-softmax behavior (#13486, #13440, #13541, #13581)
    • Fix distributed runtime stability with corrected pipeline-parallel layer distribution, reduced host-memory regression in speculative decoding, and MoE communication fallback after init exceptions (#13066, #13130, #13331)
    • Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench and lower Eagle3 one-model acceptance thresholds for H20 (#13268, #13565)
  • Documentation

    • Add batch-size tuning guidance for CUDA graph padding and a GVR Top-K technical blog (#13393, #13714)
    • Remove outdated news items and clean up llmc licensing documentation (#13603, #13700)
  • Test & Infra

    • Add and refresh coverage for disaggregated post-merge performance, GPT-OSS 20B MHA, prefix-aware scheduling, cascade-prune repros, and issue-specific regressions (#13343, #12796, #13578, #13572, #13553)
    • Improve CI triage and failure analysis with Perf Triage Bot integration, rendered HTML failure reports, K8s infrastructure retry, PR base freshness checks, static test validation, and clearer Slurm pending logs (#12429, #13526, #13530, #13430, #13423, #13586)
    • Improve CI and build stability with lower test memory pressure, adjusted DeepEP token limits, CUDA line info defaults, Debug CUDA flag fixes, module-level skips, and longer FMHA timeouts (#13402, #13484, #13334, #13598, #13223, #12860)
    • Refresh test organization and dependencies with post-merge test moves, updated constraints, FlashInfer Python updates, B200 multimodal unit-test deduplication, and sorted waive enforcement (#13376, #13482, #13064, #13631, #13584, #12672)
    • Improve distributed and QA infrastructure with free-port FLUX/WAN test initialization, multinode fallback handling, NIXL-based perf sanity tests, QA popen workarounds, and KVCacheManager connector helper fixes (#13364, #13537, #13654, #13634, #13749)
    • Improve package and release infrastructure with llmc standalone package cleanup, release-scanning PLC nightly adjustments, devel-stage apt cache mounts, and pip cache reuse (#13466, #13694, #13245, #13510)

What's Changed

  • [https://nvbugs/6093714][fix] Reduce batch size and add memory guard for test by @govind-ramnarayan in #13402
  • [TRTLLM-11373][refactor] Embed VisualGenParams in DiffusionRequest and simplify generate() inputs by @zhenhuaw-me in #13313
  • [None][test] Update CI Post-Merge Disagg Perf Tests by @chenfeiz0326 in #13343
  • [None][chore] AutoDeploy: Refactor finegrained FP8 scale sharding helpers by @galagam in #12999
  • [https://nvbugs/6076564][fix] unwaive TestNemotronH::test_auto_dtype[trtllm-flashinfer_ssm-False] by @tcherckez-nvidia in #13187
  • [TRTLLM-10061][feat] Prefix caching support for mamba hybrid models by @VALLIS-NERIA in #12185
  • [None][cleanup] remove legacy addSequence path by @liji-nv in #13280
  • [None][infra] Waive 1 failed cases for main in pre-merge 35790 by @ZhanruiSunCh in #13483
  • [None][fix] Fix bugs in WindowBlockManager destructor statistics by @eopXD in #12448
  • [None][chore] Update CI allowlist 2026-04-23 by @ZhanruiSunCh in #13381
  • [None][fix] Consolidate aiohttp session management in disagg router by @reasonsolo in #13408
  • [None][test] Remove SLACK Bot and Modify Update Perf Data into CI Pipeline by @chenfeiz0326 in #12429
  • [None][infra] Waive 1 failed cases for main in post-merge 2694 by @ZhanruiSunCh in #13485
  • [None][infra] Waive 1 failed cases for main in post-merge 2695 by @ZhanruiSunCh in #13502
  • [https://nvbugs/6064029][perf] Use fast PNG compression for visual gen serving by @karljang in #13074
  • [None] [chore] Update skills by @kaiyux in #13507
  • [None][feat] Add llm.encode() fast path for encoder-only models by @tingyangk in #12801
  • [TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats by @nv-yna in #13199
  • [None][fix] Fix Mamba cache correctness under MTP + CUDA-graph padding by @Wanli-Jiang in #13151
  • [TRTLLM-10004][feat] Enable GEMM -> AR with GEMM output in registered buffers by @nv-lschneider in #11589
  • [https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods by @chienchunhung in #12718
  • [TRTLLM-11228][feat] Support DFlash in one-model spec dec by @ziyixiong-nv in #12794
  • [None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it by @yijingl-nvidia in #13393
  • [None][feat] Assert attention DP disabled when KV connector is in use by @jthomson04 in #13448
  • [https://nvbugs/6050489][fix] fix agg pp4 hang issue by @bo-nv in #12888
  • [https://nvbugs/6095953][fix] Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench by @hyukn in #13268
  • [None][test] add unit test and e2e test for gpt_oss_20b MHA kernel by @ruodil in #12796
  • [https://nvbugs/6037654][fix] Set DeepEP low-latency token limit for qwen3 CI to prevent OOM by @byshiue in #13484
  • [None][infra] Move some tests to post-merge by @EmmaQiaoCh in #13376
  • [TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed) by @tianyuxbear in #13196
  • [None][chore] Waive accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype[False] by @yihwang-nv in #13539
  • [None][feat] Reduce sampler overhead with min_tokens by @galagam in #13480
  • [None][infra] enable CUDA line info by default for Debug/RelWithDebInfo by @bobboli in #13334
  • [None][test] Waive failed cases for main in QA CI by @crazydemo in #13504
  • [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #13508
  • [None][chore] Introduce flashinfer-upgrade skill for automated version bumps by @yihwang-nv in #12987
  • [None][infra] Waive 2 failed cases for main in post-merge 2696 by @ZhanruiSunCh in #13548
  • [TRTLLM-12090][infra] add static tests validation hook by @xinhe-nv in #13423
  • [https://nvbugs/5880745][test] GPT-OSS piecewise CUDA graph regression by @crazydemo in #13406
  • [TRTLLM-12092][infra] Add PR Base Freshness Check Action by @crazydemo in #13430
  • [#13535][chore] AutoDeploy: Relax standalone test timeout by @govind-ramnarayan in #13514
  • [None][refactor] Remove EdgeLLM ONNX export pipeline from AutoDeploy by @nvyocox in #13418
  • [TRTLLMINF-45][infra] Upload rendered HTML failure analysis by @dpitman-nvda in #13526
  • [None][tests] Add TestServePrefixAwareScheduling base on LMBenchmark/synthetic-multi-round-qa by @SimengLiu-nv in #13243
  • [None][fix] Revert 'Add TestServePrefixAwareScheduling base on LMBenchmark/synthetic-multi-round-qa' by @tburt-nv in #13573
  • [None][fix] Handle None tokenizer in OpenAI server by @galagam in #13184
  • [None][ci] Fix misleading still running log when Slurm job is PENDING by @QiJune in #13586
  • [None][perf] Extend customMoeRouting kernel to support Qwen3.5 by @nv-guomingz in #13433
  • [None][feat] Add hit-rate gate and fair-share cap to KV-aware ADP router by @lancelly in #13198
  • [None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM by @venmugil in #13140
  • [TRTLLM-12358][chore] Dedup multimodal unit tests on B200 by @QiJune in #13584
  • [None][chore] Bump version to 1.3.0rc14 by @VALLIS-NERIA in #13602
  • [https://nvbugs/6059036][fix] Fix AutoDeploy max_batch_size vs cuda_graph_config validation mismatch by @marinayanov in #13093
  • [None][infra] disable -G in default Debug CUDA flags to fix CI OOM by @bobboli in #13598
  • [None][infra] Waive 1 failed cases for main in pre-merge 36256 by @ZhanruiSunCh in #13613
  • [None][fix] visual_gen UlyssesAttention: pass post-A2A seq_len to inner backend by @karljang in #13486
  • [None][chore] Update flashinfer-python from 0.6.6 to 0.6.8 by @yihwang-nv in #13064
  • [https://nvbugs/6018058][fix] Increase test timeout for test_fmha by @djns99 in #12860
  • [https://nvbugs/6080037][fix] pytest.skip(allow_module_level=True) in `tests/unittest/_torch/ray_orchestrator by @tensorrt-cicd in #13223
  • [https://nvbugs/6111076][fix] ulysses+sage by @xrq-phys in #13440
  • [None][fix] Optimize TorchSampler process_logprobs by @tongyuantongyu in #13380
  • [None][feat] Use a replay method for state rollback in Mamba-2 speculative decoding by @hnover-nv in #13453
  • [https://nvbugs/6094066][fix] Skip Qwen3 skip-softmax on low-memory GPUs by @xxi-nv in #13581
  • [None][cleanup] Remove unused code path by @2ez4bz in #13622
  • [None][fix] Qwen3.5 dense weight loading by @amukkara in #13090
  • [https://nvbugs/6065680][fix] Fixed layer distribution across pipeline-parallel ranks by @ziyixiong-nv in #13066
  • [https://nvbugs/6109719][doc] Remove outdated items in previous news sections. by @nv-guomingz in #13603
  • [None][test] Waive 9 failed cases for main in QA CI by @xinhe-nv in #13540
  • [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #13643
  • [None][chore] Update blossom-ci allowlist by @yuanjingx87 in #13621
  • [TRTLLM-11289][feat] Integrate CuteDSL's bf16 dense GEMMs by @peaceh-nv in #12074
  • [https://nvbugs/6098442][fix] Add fix for IMA with TRTLLM-Gen GmemReductionWithSeparateKernel by @pengbowang-nv in #13541
  • [#11823][feat] AutoDeploy trtllm_mla attention backend by @MrGeva in #13222
  • [None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K by @longcheng-nv in #13477
  • [None][feat] AutoDeploy: add Gemma 4 reasoning and tool-call parsers by @suyoggupta in #13248
  • [https://nvbugs/6114727][fix] Unwaive deepseek r1 fp4 v2 grace_blackwell r1 fp4 v2 tep4 mtp3 1k1k by @chenfeiz0326 in #13496
  • [TRTLLM-11946][feat] Disaggregated gen-first serving with ADP by @reasonsolo in #13112
  • [None][chore] Update flashinfer-python from 0.6.8 to 0.6.9 by @yihwang-nv in #13631
  • [None][test] refresh test constraints by @crazydemo in #13482
  • [None][feat] Add bf16 trtllm-gen moe support through flashinfer. by @nv-guomingz in #12738
  • [None][infra] post warning reply when /bot command is preceded by lea… by @niukuo in #13659
  • [None][fix] AutoDeploy logger fix by @suyoggupta in #13403
  • [TRTLLM-11951][feat] Chunked prefill for non-contiguous multimodal data + preproc fixes by @venkywonka in #12944
  • [#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle by @govind-ramnarayan in #12745
  • [None][fix] Fix disaggregated cached token usage accounting by @v-shobhit in #13620
  • [https://nvbugs/6117814][fix] Lower Eagle3 one-model acceptance rate threshold for H20 GPU by @tensorrt-cicd in #13565
  • [None][infra] Waive 1 failed cases for main in pre-merge 36341 by @ZhanruiSunCh in #13648
  • [#13209][feat] Add config id to AD models registry by @tcherckez-nvidia in #13375
  • [None][fix] Use bf16 for LTX-2 FP4 stage 2 by @yibinl-nvidia in #13244
  • [https://nvbugs/6084743][fix] Use free port for FLUX/WAN multi-GPU test distributed init by @karljang in #13364
  • [https://nvbugs/6035425][fix] Fix host memory usage regression with spec dec by @mikeiovine in #13130
  • [None][fix] repair test lists according to new check by @tburt-nv in #13676
  • [TRTLLM-11471][fix] Eliminate redundant serialization and MPI collectives in safe_allgather/safe_gather by @chienchunhung in #13089
  • [https://nvbugs/6105769][fix] Skip DSV3Lite test on L40S by @brb-nv in #13467
  • [https://nvbugs/6132301][infra] Waive 1 failed cases for main in pre-merge 36112 by @ZhanruiSunCh in #13679
  • [TRTLLM-11285][perf] Force enable TF32 tensor cores for DSA indexer fused GEMM by @peihu-nv in #13452
  • [TRTLLM-12365][ci] Dedup AutoDeploy unit tests on B200 by @QiJune in #13593
  • [TRTLLM-11421][fix] fix data races in speculative decoding fast-logits handoff by @eopXD in #13059
  • [#11879][fix] Fix free-block counter corruption in getFreeBlock offload path by @eopXD in #12834
  • [None][feat] llmc: standalone package improvements and enforce import discipline by @lucaslie in #13466
  • [TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages by @dpitman-nvda in #13530
  • [None][infra] Waive 1 failed cases for main in pre-merge 36527 by @ZhanruiSunCh in #13685
  • [None][fix] Revert 'Add bf16 trtllm-gen moe support through flashinfer.' by @tburt-nv in #13688
  • [None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping by @jhaotingc in #13633
  • [None][chore] improve gemm perf for nemotron in spark by @ttyio in #13160
  • [None][fix] Fix early_stopping type and plumb through Triton ensemble… by @jhaotingc in #13692
  • [None][fix] Continue MoE comm fallback on init exceptions by @xxi-nv in #13331
  • [TRTLLM-11635][feat] Introduce cancellation in transceiver v2 by @Shixiaowei02 in #12734
  • [None][feat] Support update weight for nvfp4 by @shuyixiong in #12320
  • [None][infra] llmc: stop managing .github/ in standalone package generator by @lucaslie in #13694
  • [TRTLLM-11160][feat] Clean up SWA work-arounds with the new radix sea… by @SimengLiu-nv in #13346
  • [None][fix] Add TestServePrefixAwareScheduling base on LMBenchmark/synthetic-multi-round-qa by @SimengLiu-nv in #13578
  • [https://nvbugs/6114821][fix] Fix extra_tokens in V2 KV cache by @dongfengy in #13619
  • [None][fix] write per-rank torch profile traces by @GavinZhu-GMI in #13536
  • [https://nvbugs/5839028][test] Unwaive DeepSeekR1 fp8 blockscale throughput_mtp by @xxi-nv in #13627
  • [None][docs] add GVR Top-K technical blog by @longcheng-nv in #13714
  • [https://nvbugs/6093715][fix] AutoDeploy: skip nvfp4 test pre-blackwell by @galagam in #13494
  • [https://nvbugs/6114821][fix] Fix extra_tokens in V2 KV cache (test unwaive) by @dongfengy in #13709
  • [#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1 by @MrGeva in #13497
  • [None][infra] Waive 1 failed cases for main in pre-merge 36642 by @ZhanruiSunCh in #13717
  • [#13320][test] Test coverage and repro for #13320 by @eopXD in #13553
  • [None][feat] Add Attention2D sequence parallelism for visual-gen models by @venmugil in #12943
  • [None][fix] Fix Qwen3.5 NVFP4 weight loading by preserving weight_scales by @achartier in #13716
  • [TRTLLM-11974][feat] Handle multimodal placeholder expansion in token space for Nemotron Nano by @moraxu in #13069
  • [https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test by @MrGeva in #13610
  • [https://nvbugs/6112500][fix] AD: L40S coverage for Nemotron-Nano-V3 by @MrGeva in #13715
  • [None][perf] Fuse add + norm + fp8 quant pattern by @amukkara in #12674
  • [None][infra] Adjust PLC nightly to handle release scanning by @yuanjingx87 in #13245
  • [#13099][feat] Support Wan 2.2 5B TI2V model by @abc99lr in #13256
  • [None][fix] Clean up llmc licensing docs by @bmarimuthu-nv in #13700
  • [None][feat] AutoDeploy: Add Gemma4 vision support by @bmarimuthu-nv in #12861
  • [None][chore] Sort waives.txt and add pre-commit hook to enforce ordering by @chienchunhung in #12672
  • [https://nvbugs/6104831][test] Add cascade-prune reproducer tests by @chienchunhung in #13572
  • [None][feat] Serve should support AGSI middlewares by @faucct in #13378
  • [None][chore] Remove temp attention window concept in KV cache manager by @eopXD in #12450
  • [None][test] rename test case and add fallback for multinode cases by @ruodil in #13537
  • [None][infra] Add apt cache mounts to devel stage and use existing pip cache by @eopXD in #13510
  • [https://nvbugs/6112508][fix] WAR for popen in QA env in disagg_test_utils by @xwang233 in #13634
  • [#11823][feat] AutoDeploy's mla chunked prefill loop support by @MrGeva in #13677
  • [None][infra] Waive 4 failed cases for main in post-merge 2705 by @ZhanruiSunCh in #13747
  • [None][test] update perf qa sanity tests, use NIXL instead of UCX by @xinhe-nv in #13654
  • [None][fix] Fix KVCacheManager constructor call in connector test helper by @eopXD in #13749
  • [None][feat] Resubmission of the routing refactor in trtllmgen by @ChristinaZ in #13328
  • [None][fix] fix PEFT page accumulation in MaxUtilizationPolicy scheduler by @achartier in #13528

New Contributors

Full Changelog: v1.3.0rc13...v1.3.0rc14

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.