Highlights
-
Model Support
- Add prefix caching for Mamba hybrid models including Qwen3.5 and Nemotron Super V3 (#12185)
- Improve Qwen3.5 support with custom MoE routing and dense and NVFP4 weight loading fixes (#13433, #13090, #13716)
- Improve Nemotron and Nemotron Nano support with GEMM tuning and multimodal placeholder expansion (#13160, #13069)
- Add Wan 2.2 5B TI2V support and refine LTX-2 FP4 stage handling (#13256, #13244)
-
API
- Embed VisualGenParams in DiffusionRequest and simplify generate() inputs (#13313)
- Add llm.encode() fast path support for encoder-only models (#12801)
- Add per-iteration request-aggregate counters to InflightBatchingStats (#13199)
- Add AGSI middleware support for Serve (#13378)
- Introduce cancellation support in transceiver v2 (#12734)
- Fix Triton backend generation parameter handling for promptIgnoreLength, lengthPenalty, earlyStopping, and early_stopping (#13633, #13692)
-
Feature
- Improve VisualGen serving with fast PNG compression, multi-node diffusion workers, non-contiguous multimodal chunked prefill, and Attention2D sequence parallelism (#13074, #13140, #12944, #12943)
- Improve disaggregated serving and routing with gen-first ADP serving, KV-aware hit-rate gates and fair-share caps, and consolidated aiohttp session handling (#13112, #13198, #13408)
- Expand kernel and runtime performance with GEMM-to-allreduce registered buffers, CuteDSL bf16 dense GEMMs, sparse-attention GVR Top-K dispatchers, fused add-norm-FP8 quantization, TF32 DSA GEMMs, sampler optimizations, and leaner MPI collectives (#11589, #12074, #13477, #12674, #13452, #13480, #13380, #13089)
- Improve speculative decoding with DFlash one-model support, Mamba-2 rollback replay, radix-based SWA cleanup, and trtllm-gen routing refactoring (#12794, #13453, #13346, #13328)
- Support NVFP4 weight updates (#12320)
- Add per-rank torch profile traces for distributed profiling (#13536)
-
Fix
- Fix KV cache and scheduler correctness issues, including WindowBlockManager statistics, Mamba cache handling under MTP with CUDA graph padding, free-block counter corruption, V2 extra_tokens accounting, PEFT page accumulation, and temporary attention-window cleanup (#12448, #13151, #12834, #13619, #13709, #13528, #12450)
- Fix disaggregated serving and worker reliability by resolving aggregate PP4 hangs, preventing zombie worker pods, and correcting cached-token usage accounting (#12888, #12718, #13620)
- Fix OpenAI and Triton generation flows for None tokenizers, prompt ignore lengths, early stopping, and terminateRequest handling from background logits threads (#13184, #13633, #13692, #13059)
- Fix attention and VisualGen runtime issues, including UlyssesAttention sequence lengths, Ulysses plus Sage execution, TRTLLM-Gen GmemReduction illegal memory access, and low-memory Qwen3 skip-softmax behavior (#13486, #13440, #13541, #13581)
- Fix distributed runtime stability with corrected pipeline-parallel layer distribution, reduced host-memory regression in speculative decoding, and MoE communication fallback after init exceptions (#13066, #13130, #13331)
- Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench and lower Eagle3 one-model acceptance thresholds for H20 (#13268, #13565)
-
Documentation
-
Test & Infra
- Add and refresh coverage for disaggregated post-merge performance, GPT-OSS 20B MHA, prefix-aware scheduling, cascade-prune repros, and issue-specific regressions (#13343, #12796, #13578, #13572, #13553)
- Improve CI triage and failure analysis with Perf Triage Bot integration, rendered HTML failure reports, K8s infrastructure retry, PR base freshness checks, static test validation, and clearer Slurm pending logs (#12429, #13526, #13530, #13430, #13423, #13586)
- Improve CI and build stability with lower test memory pressure, adjusted DeepEP token limits, CUDA line info defaults, Debug CUDA flag fixes, module-level skips, and longer FMHA timeouts (#13402, #13484, #13334, #13598, #13223, #12860)
- Refresh test organization and dependencies with post-merge test moves, updated constraints, FlashInfer Python updates, B200 multimodal unit-test deduplication, and sorted waive enforcement (#13376, #13482, #13064, #13631, #13584, #12672)
- Improve distributed and QA infrastructure with free-port FLUX/WAN test initialization, multinode fallback handling, NIXL-based perf sanity tests, QA popen workarounds, and KVCacheManager connector helper fixes (#13364, #13537, #13654, #13634, #13749)
- Improve package and release infrastructure with llmc standalone package cleanup, release-scanning PLC nightly adjustments, devel-stage apt cache mounts, and pip cache reuse (#13466, #13694, #13245, #13510)
What's Changed
- [https://nvbugs/6093714][fix] Reduce batch size and add memory guard for test by @govind-ramnarayan in #13402
- [TRTLLM-11373][refactor] Embed VisualGenParams in DiffusionRequest and simplify generate() inputs by @zhenhuaw-me in #13313
- [None][test] Update CI Post-Merge Disagg Perf Tests by @chenfeiz0326 in #13343
- [None][chore] AutoDeploy: Refactor finegrained FP8 scale sharding helpers by @galagam in #12999
- [https://nvbugs/6076564][fix] unwaive TestNemotronH::test_auto_dtype[trtllm-flashinfer_ssm-False] by @tcherckez-nvidia in #13187
- [TRTLLM-10061][feat] Prefix caching support for mamba hybrid models by @VALLIS-NERIA in #12185
- [None][cleanup] remove legacy addSequence path by @liji-nv in #13280
- [None][infra] Waive 1 failed cases for main in pre-merge 35790 by @ZhanruiSunCh in #13483
- [None][fix] Fix bugs in WindowBlockManager destructor statistics by @eopXD in #12448
- [None][chore] Update CI allowlist 2026-04-23 by @ZhanruiSunCh in #13381
- [None][fix] Consolidate aiohttp session management in disagg router by @reasonsolo in #13408
- [None][test] Remove SLACK Bot and Modify Update Perf Data into CI Pipeline by @chenfeiz0326 in #12429
- [None][infra] Waive 1 failed cases for main in post-merge 2694 by @ZhanruiSunCh in #13485
- [None][infra] Waive 1 failed cases for main in post-merge 2695 by @ZhanruiSunCh in #13502
- [https://nvbugs/6064029][perf] Use fast PNG compression for visual gen serving by @karljang in #13074
- [None] [chore] Update skills by @kaiyux in #13507
- [None][feat] Add llm.encode() fast path for encoder-only models by @tingyangk in #12801
- [TRTLLM-12123][feat] Add per-iteration request-aggregate counters to InflightBatchingStats by @nv-yna in #13199
- [None][fix] Fix Mamba cache correctness under MTP + CUDA-graph padding by @Wanli-Jiang in #13151
- [TRTLLM-10004][feat] Enable GEMM -> AR with GEMM output in registered buffers by @nv-lschneider in #11589
- [https://nvbugs/6043291][fix] Add fatal error detection to prevent zombie worker pods by @chienchunhung in #12718
- [TRTLLM-11228][feat] Support DFlash in one-model spec dec by @ziyixiong-nv in #12794
- [None][doc] Add blog post for tuning batch sizes for CUDA graph padding and increasing the default batch size granularity for it by @yijingl-nvidia in #13393
- [None][feat] Assert attention DP disabled when KV connector is in use by @jthomson04 in #13448
- [https://nvbugs/6050489][fix] fix agg pp4 hang issue by @bo-nv in #12888
- [https://nvbugs/6095953][fix] Fix cache memory estimation for Qwen3 hybrid models in trtllm-bench by @hyukn in #13268
- [None][test] add unit test and e2e test for gpt_oss_20b MHA kernel by @ruodil in #12796
- [https://nvbugs/6037654][fix] Set DeepEP low-latency token limit for qwen3 CI to prevent OOM by @byshiue in #13484
- [None][infra] Move some tests to post-merge by @EmmaQiaoCh in #13376
- [TRTLLM-10491][test] unwaive DeepSeekV3Lite nvfp4 4gpus test (flaky, self-healed) by @tianyuxbear in #13196
- [None][chore] Waive accuracy/test_disaggregated_serving.py::TestDeepSeekV32Exp::test_auto_dtype[False] by @yihwang-nv in #13539
- [None][feat] Reduce sampler overhead with min_tokens by @galagam in #13480
- [None][infra] enable CUDA line info by default for Debug/RelWithDebInfo by @bobboli in #13334
- [None][test] Waive failed cases for main in QA CI by @crazydemo in #13504
- [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #13508
- [None][chore] Introduce flashinfer-upgrade skill for automated version bumps by @yihwang-nv in #12987
- [None][infra] Waive 2 failed cases for main in post-merge 2696 by @ZhanruiSunCh in #13548
- [TRTLLM-12090][infra] add static tests validation hook by @xinhe-nv in #13423
- [https://nvbugs/5880745][test] GPT-OSS piecewise CUDA graph regression by @crazydemo in #13406
- [TRTLLM-12092][infra] Add PR Base Freshness Check Action by @crazydemo in #13430
- [#13535][chore] AutoDeploy: Relax standalone test timeout by @govind-ramnarayan in #13514
- [None][refactor] Remove EdgeLLM ONNX export pipeline from AutoDeploy by @nvyocox in #13418
- [TRTLLMINF-45][infra] Upload rendered HTML failure analysis by @dpitman-nvda in #13526
- [None][tests] Add TestServePrefixAwareScheduling base on LMBenchmark/synthetic-multi-round-qa by @SimengLiu-nv in #13243
- [None][fix] Revert 'Add TestServePrefixAwareScheduling base on LMBenchmark/synthetic-multi-round-qa' by @tburt-nv in #13573
- [None][fix] Handle None tokenizer in OpenAI server by @galagam in #13184
- [None][ci] Fix misleading still running log when Slurm job is PENDING by @QiJune in #13586
- [None][perf] Extend customMoeRouting kernel to support Qwen3.5 by @nv-guomingz in #13433
- [None][feat] Add hit-rate gate and fair-share cap to KV-aware ADP router by @lancelly in #13198
- [None][feat] Add multi-node support for VisualGen diffusion workers via torchrun/SLURM by @venmugil in #13140
- [TRTLLM-12358][chore] Dedup multimodal unit tests on B200 by @QiJune in #13584
- [None][chore] Bump version to 1.3.0rc14 by @VALLIS-NERIA in #13602
- [https://nvbugs/6059036][fix] Fix AutoDeploy max_batch_size vs cuda_graph_config validation mismatch by @marinayanov in #13093
- [None][infra] disable -G in default Debug CUDA flags to fix CI OOM by @bobboli in #13598
- [None][infra] Waive 1 failed cases for main in pre-merge 36256 by @ZhanruiSunCh in #13613
- [None][fix] visual_gen UlyssesAttention: pass post-A2A seq_len to inner backend by @karljang in #13486
- [None][chore] Update flashinfer-python from 0.6.6 to 0.6.8 by @yihwang-nv in #13064
- [https://nvbugs/6018058][fix] Increase test timeout for test_fmha by @djns99 in #12860
- [https://nvbugs/6080037][fix]
pytest.skip(allow_module_level=True)in `tests/unittest/_torch/ray_orchestrator by @tensorrt-cicd in #13223 - [https://nvbugs/6111076][fix] ulysses+sage by @xrq-phys in #13440
- [None][fix] Optimize TorchSampler process_logprobs by @tongyuantongyu in #13380
- [None][feat] Use a replay method for state rollback in Mamba-2 speculative decoding by @hnover-nv in #13453
- [https://nvbugs/6094066][fix] Skip Qwen3 skip-softmax on low-memory GPUs by @xxi-nv in #13581
- [None][cleanup] Remove unused code path by @2ez4bz in #13622
- [None][fix] Qwen3.5 dense weight loading by @amukkara in #13090
- [https://nvbugs/6065680][fix] Fixed layer distribution across pipeline-parallel ranks by @ziyixiong-nv in #13066
- [https://nvbugs/6109719][doc] Remove outdated items in previous news sections. by @nv-guomingz in #13603
- [None][test] Waive 9 failed cases for main in QA CI by @xinhe-nv in #13540
- [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #13643
- [None][chore] Update blossom-ci allowlist by @yuanjingx87 in #13621
- [TRTLLM-11289][feat] Integrate CuteDSL's bf16 dense GEMMs by @peaceh-nv in #12074
- [https://nvbugs/6098442][fix] Add fix for IMA with TRTLLM-Gen GmemReductionWithSeparateKernel by @pengbowang-nv in #13541
- [#11823][feat] AutoDeploy trtllm_mla attention backend by @MrGeva in #13222
- [None][perf] Scheme X L2-aware dispatcher and PDL launchers for sparse-attention GVR Top-K by @longcheng-nv in #13477
- [None][feat] AutoDeploy: add Gemma 4 reasoning and tool-call parsers by @suyoggupta in #13248
- [https://nvbugs/6114727][fix] Unwaive deepseek r1 fp4 v2 grace_blackwell r1 fp4 v2 tep4 mtp3 1k1k by @chenfeiz0326 in #13496
- [TRTLLM-11946][feat] Disaggregated gen-first serving with ADP by @reasonsolo in #13112
- [None][chore] Update flashinfer-python from 0.6.8 to 0.6.9 by @yihwang-nv in #13631
- [None][test] refresh test constraints by @crazydemo in #13482
- [None][feat] Add bf16 trtllm-gen moe support through flashinfer. by @nv-guomingz in #12738
- [None][infra] post warning reply when /bot command is preceded by lea… by @niukuo in #13659
- [None][fix] AutoDeploy logger fix by @suyoggupta in #13403
- [TRTLLM-11951][feat] Chunked prefill for non-contiguous multimodal data + preproc fixes by @venkywonka in #12944
- [#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle by @govind-ramnarayan in #12745
- [None][fix] Fix disaggregated cached token usage accounting by @v-shobhit in #13620
- [https://nvbugs/6117814][fix] Lower Eagle3 one-model acceptance rate threshold for H20 GPU by @tensorrt-cicd in #13565
- [None][infra] Waive 1 failed cases for main in pre-merge 36341 by @ZhanruiSunCh in #13648
- [#13209][feat] Add config id to AD models registry by @tcherckez-nvidia in #13375
- [None][fix] Use bf16 for LTX-2 FP4 stage 2 by @yibinl-nvidia in #13244
- [https://nvbugs/6084743][fix] Use free port for FLUX/WAN multi-GPU test distributed init by @karljang in #13364
- [https://nvbugs/6035425][fix] Fix host memory usage regression with spec dec by @mikeiovine in #13130
- [None][fix] repair test lists according to new check by @tburt-nv in #13676
- [TRTLLM-11471][fix] Eliminate redundant serialization and MPI collectives in safe_allgather/safe_gather by @chienchunhung in #13089
- [https://nvbugs/6105769][fix] Skip DSV3Lite test on L40S by @brb-nv in #13467
- [https://nvbugs/6132301][infra] Waive 1 failed cases for main in pre-merge 36112 by @ZhanruiSunCh in #13679
- [TRTLLM-11285][perf] Force enable TF32 tensor cores for DSA indexer fused GEMM by @peihu-nv in #13452
- [TRTLLM-12365][ci] Dedup AutoDeploy unit tests on B200 by @QiJune in #13593
- [TRTLLM-11421][fix] fix data races in speculative decoding fast-logits handoff by @eopXD in #13059
- [#11879][fix] Fix free-block counter corruption in getFreeBlock offload path by @eopXD in #12834
- [None][feat] llmc: standalone package improvements and enforce import discipline by @lucaslie in #13466
- [TRTLLMINF-43][feat] Extend infrastructure-failure retry to K8s test stages by @dpitman-nvda in #13530
- [None][infra] Waive 1 failed cases for main in pre-merge 36527 by @ZhanruiSunCh in #13685
- [None][fix] Revert 'Add bf16 trtllm-gen moe support through flashinfer.' by @tburt-nv in #13688
- [None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping by @jhaotingc in #13633
- [None][chore] improve gemm perf for nemotron in spark by @ttyio in #13160
- [None][fix] Fix early_stopping type and plumb through Triton ensemble… by @jhaotingc in #13692
- [None][fix] Continue MoE comm fallback on init exceptions by @xxi-nv in #13331
- [TRTLLM-11635][feat] Introduce cancellation in transceiver v2 by @Shixiaowei02 in #12734
- [None][feat] Support update weight for nvfp4 by @shuyixiong in #12320
- [None][infra] llmc: stop managing .github/ in standalone package generator by @lucaslie in #13694
- [TRTLLM-11160][feat] Clean up SWA work-arounds with the new radix sea… by @SimengLiu-nv in #13346
- [None][fix] Add TestServePrefixAwareScheduling base on LMBenchmark/synthetic-multi-round-qa by @SimengLiu-nv in #13578
- [https://nvbugs/6114821][fix] Fix extra_tokens in V2 KV cache by @dongfengy in #13619
- [None][fix] write per-rank torch profile traces by @GavinZhu-GMI in #13536
- [https://nvbugs/5839028][test] Unwaive DeepSeekR1 fp8 blockscale throughput_mtp by @xxi-nv in #13627
- [None][docs] add GVR Top-K technical blog by @longcheng-nv in #13714
- [https://nvbugs/6093715][fix] AutoDeploy: skip nvfp4 test pre-blackwell by @galagam in #13494
- [https://nvbugs/6114821][fix] Fix extra_tokens in V2 KV cache (test unwaive) by @dongfengy in #13709
- [#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1 by @MrGeva in #13497
- [None][infra] Waive 1 failed cases for main in pre-merge 36642 by @ZhanruiSunCh in #13717
- [#13320][test] Test coverage and repro for #13320 by @eopXD in #13553
- [None][feat] Add Attention2D sequence parallelism for visual-gen models by @venmugil in #12943
- [None][fix] Fix Qwen3.5 NVFP4 weight loading by preserving weight_scales by @achartier in #13716
- [TRTLLM-11974][feat] Handle multimodal placeholder expansion in token space for Nemotron Nano by @moraxu in #13069
- [https://nvbugs/6072808][fix] Retry on EADDRINUSE in AutoDeploy allreduce-fusion test by @MrGeva in #13610
- [https://nvbugs/6112500][fix] AD: L40S coverage for Nemotron-Nano-V3 by @MrGeva in #13715
- [None][perf] Fuse add + norm + fp8 quant pattern by @amukkara in #12674
- [None][infra] Adjust PLC nightly to handle release scanning by @yuanjingx87 in #13245
- [#13099][feat] Support Wan 2.2 5B TI2V model by @abc99lr in #13256
- [None][fix] Clean up llmc licensing docs by @bmarimuthu-nv in #13700
- [None][feat] AutoDeploy: Add Gemma4 vision support by @bmarimuthu-nv in #12861
- [None][chore] Sort waives.txt and add pre-commit hook to enforce ordering by @chienchunhung in #12672
- [https://nvbugs/6104831][test] Add cascade-prune reproducer tests by @chienchunhung in #13572
- [None][feat] Serve should support AGSI middlewares by @faucct in #13378
- [None][chore] Remove temp attention window concept in KV cache manager by @eopXD in #12450
- [None][test] rename test case and add fallback for multinode cases by @ruodil in #13537
- [None][infra] Add apt cache mounts to devel stage and use existing pip cache by @eopXD in #13510
- [https://nvbugs/6112508][fix] WAR for popen in QA env in disagg_test_utils by @xwang233 in #13634
- [#11823][feat] AutoDeploy's mla chunked prefill loop support by @MrGeva in #13677
- [None][infra] Waive 4 failed cases for main in post-merge 2705 by @ZhanruiSunCh in #13747
- [None][test] update perf qa sanity tests, use NIXL instead of UCX by @xinhe-nv in #13654
- [None][fix] Fix KVCacheManager constructor call in connector test helper by @eopXD in #13749
- [None][feat] Resubmission of the routing refactor in trtllmgen by @ChristinaZ in #13328
- [None][fix] fix PEFT page accumulation in MaxUtilizationPolicy scheduler by @achartier in #13528
New Contributors
- @tingyangk made their first contribution in #12801
- @tianyuxbear made their first contribution in #13196
- @venmugil made their first contribution in #13140
- @GavinZhu-GMI made their first contribution in #13536
- @abc99lr made their first contribution in #13256
- @faucct made their first contribution in #13378
- @xwang233 made their first contribution in #13634
Full Changelog: v1.3.0rc13...v1.3.0rc14