NVIDIA/TensorRT-LLM v1.3.0rc11 on GitHub

Highlights

Model Support
- Add Mistral 4-small support to AutoDeploy (#12266)
- Add GlmMoeDsaForCausalLM to EPLB supported model list (#12607)
- Add Qwen3-Next MTP (#11370)
- Enable sliding window attention for Mistral/Mixtral (#12597)
API
- Support include_stop_token_in_output in gRPC request manager (#12517)
- Add deprecation warnings on TRT backend entrypoints (#11723)
- Accept strict field in tools and store field in chat requests (#12482)
- Mark TRTLLMSampler as deprecated and update documentation (#11938)
- Move VisualGen APIs to a separate directory (#12538)
- Remove some fields with redefined defaults (#11671)
Feature
- Apply norm before FC in Eagle (#12561)
- Split MLA DSA custom op for piecewise CUDA graph capture (#12503)
- Optimize host performance for Python cache transceiver (#12273)
- Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding (#12537)
- Add serve-config-guide skill for basic aggregate single-node serving configs (#12054)
- Add FORCE_CHUNK context chunking policy (#12483)
- Add dense GEMM backend for MoE (#10479)
- Implement gen-first disaggregated scheduling, part 2 (#12239)
- Support EPLB with various MoE backends for Nemotron-H models (#12280)
- Skip softmax via sparsity ratio (#11995)
- Add DWDP (distributed weight data parallelism) support for MoE inference (#12136)
- Add AutoDeploy Super V3 MTP support (#12326)
- Introduce fast path (token IDs + multimodal) for VLMs without re-tokenizing encoded prompts (#11708)
- Add global pool support for suffix automaton speculative decoding (#12130)
- Add Triton paged attention for AutoDeploy (#12642)
- Refactor VisualGen attention backend (#12663)
- Add support of linear attention state for C++ KV cache manager (#12531)
- Add temporally-correlated heuristic-guided indexer TopK for sparse attention (#12385)
- Support MLA generation in TrtllmGen attention backend (#12606)
- Extend Python cache transceiver to support Nemotron (#12150)
- Handle different chat template types (#12336)
- Add multi-turn support for trtllm-bench (#12468)
- Add fused DiT QK Norm + RoPE CUDA kernel for FLUX (#11869)
- Support cache reuse for SSM in KVCacheManagerV2 (#12644)
- Add MLIR-based auto-generated elementwise fusion for AutoDeploy (#12427)
- Add --custom_tokenizer CLI option to trtllm-bench (#12586)
- Support LoRA adapter for Nemotron-H models (#12154)
- Apply multiple host performance optimizations for DSA (#12581)
- Reuse Triton slicing kernel for GDN prefill transpose (#12737)
- Add Trtllm-gen FMHA JIT support (#12612)
- Retune causalConv1d forward dispatch for variable-length and short sequences (#12739)
- Update configuration to enable NVFP4 (#12776)
- Fuse SiLU+Mul in AutoDeploy transform (#12497)
Fix
- Fix Triton kernels in wheel (#12569)
- Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models (#12571)
- Reorder generation_logits to align with final beam search output ordering (#12268)
- Handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported (#12613)
- Fix autotuner OOM for trtllmGen MoE runners at large context length (#12523)
- Always sync sampler_event in update_requests (#12585)
- Avoid counting KV cache uses during warmup for Prometheus KV cache metrics (#12132)
- Fix lost requests (#12348)
- Fix GPTOSS CUTLASS MoE on Hopper NVLink one-sided workspace overflow (#12666)
- Fix Mooncake dynamic load in transfer_agent_binding (#12181)
- Fix disaggregated pipeline-parallel hang (#12528)
- Correct reused block counting in corner case (#12404)
- Clamp block indices to prevent out-of-bounds in DSA with MTP (#12657)
- Synchronize NCCL memory allocation error handling (#12125)
- Adjust prompt logprobs to use the correct prompt token id (#12499)
- Improve NIXL agent import error diagnostics (#12446)
- Fix disaggregated serving hang on block reuse after eviction (#12667)
- Use the first non-None result returned by Hugging Face download workers (#12259)
- Replace assertions with warnings for unsupported logits/logprobs in speculative sampler (#12547)
- Address H20 weights loading OOM for GPTOSS (#11321)
- Improve Harmony parser (delta grouping, reuse report, test coverage) (#12467)
- Fix hang issues on DGX B200 8-GPU PyTorch configurations (#12656)
- Fix disaggregated KV cache router for chat API; add disaggregated benchmark for ai_perf (#12337)
- Fix CUDA event crash with performance metrics (#12639)
- Update Nemotron-H handling for corner cases (#12620)
- Fix KV cache issue (#12673)
- Fix wrong token suppressed with ignore_eos in Torch sampler (#12358)
- Fix GPTOSS chat template for disaggregated tests (#12724)
- Fix top-K logprobs size for pipeline parallelism (#12623)
- Remove clone in FP8 quantization (#12687)
- Fix Qwen2.5 mixed precision accuracy issue (#12609)
- Fix Mamba metadata prefill bubble in chunked prefill serving (#12736)
- Fix outdated README argument for executorExampleDisaggregated.cpp (#12276)
Documentation
- Add MoE developer guide for fused_moe module (#12534)
- Update supported models to include Kimi K2/K2.5 and GLM-5 (#12654)
- Publish blog post for DWDP (#12725)
- Add visual generation models to supported models page (#12464)
- Clean up latest news and blogs; update overview and highlight visual generation (#12753)
- Update C++ coding guidelines (#12577)
Test & Infra
- Use shared utility for node labels (#9095)
- Adjust RocketKV test threshold (#12527)
- Enhance performance tests with GPU availability check in test_perf.py (#12535)
- Move AD performance regression tests to AD pre- and post-merge jobs (#12461)
- Remove Model Registry Check from workflows; check runs in pre-commit (#12590)
- Add Ubuntu 24.04 wheel image for SBSA (#12436)
- Pin mypy version due to dependency conflicts (#12650)
- Fix Pyxis error in disaggregated performance test (#12575)
- Skip already-applied patches gracefully in third-party FetchContent (#12550)
- Add container scanning to PLC nightly pipeline (#12549)
- Use JobBuilder to trigger downstream job (#7079)
- Prefer GitHub then GitLab for TOT waive list (#11063)
- Isolate single-GPU Ray orchestrator tests to avoid CI timeouts (#12616)
- Add workaround for trtllm-bench hang and improve robustness (#12655)
- Bump tornado and black in container (#12600)
- Remove OOM test case from L40S test list (#12685)
- Temporarily disable warn_unused_ignores (#12728)
- Add supplemental Ruff lint for legacy files via ruff-legacy hook (#11469)
- Add port conflict retry for disaggregated multi-process tests (#12618)
- Add CI agent failure analysis to L0 merge request pipeline (#12543)
- Fix source code scanning (#12773)
- Remove gpu-shell tool from ad-run-agent (#12418)
- Move to FlexCache in Austin for 5080 nodes (#12615)

What's Changed

[https://nvbugs/5882636][fix] Fix triton_kernels in wheel by @dongfengy in #12569
[https://nvbugs/5919796][test] AutoDeploy: unwaive Super V3 autodeploy failure by @galagam in #12556
[None][test] Waive another flaky test case on Dis-agg serving with Ne… by @nv-guomingz in #12587
[#11992][fix] Support include_stop_token_in_output in gRPC request manager by @CatherineSue in #12517
[None][feat] Eagle: Norm before FC by @IzzyPutterman in #12561
[#10607][fix] moved AD perf regression tests to AD jobs pre and post merge by @MrGeva in #12461
[None][infra] Waive 1 failed cases for main in post-merge 2626 by @ZhanruiSunCh in #12592
[TRTLLM-7335] [infra] Use shared utility for node labels by @niukuo in #9095
[None][infra] Waive 1 failed cases for main in pre-merge 31714 by @ZhanruiSunCh in #12589
[https://nvbugs/6007197][fix] Adjust RocketKV test threshold by @heyuhhh in #12527
[None][test] Enhance performance tests by adding GPU availability check in test_perf.py by @yufeiwu-nv in #12535
[None][infra] Waive 2 failed cases for main in post-merge 2627 by @ZhanruiSunCh in #12605
[None][fix] Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models by @lancelly in #12571
[None][doc] Add MoE developer guide for fused_moe module by @xxi-nv in #12534
[None][chore] Remove Model Registry Check from workflows, the check already runs in pre-commit by @tcherckez-nvidia in #12590
[https://nvbugs/5983390][perf] Split MLA DSA custom op for piecewise CUDA graph capture by @liji-nv in #12503
[None][fix] Reorder generation_logits to align with final beam search output ordering by @achartier in #12268
[TRTC-351][chore] Deprecation warnings on TRT backend entrypoints by @venkywonka in #11723
[TRTLLM-10804][infra] add ubuntu2404 wheel image for SBSA by @niukuo in #12436
[#12288][feat] Add Mistral 4-small support to AutoDeploy by @bmarimuthu-nv in #12266
[None][infra] waive failed case for main by @EmmaQiaoCh in #12621
[https://nvbugs/5920751][chore] Unwaive a test that has been fixed by @longlee0622 in #12610
[TRTLLM-9526][feat] optimize host perf for python cache transceiver by @chuangz0 in #12273
[None][test] Fix kimi-k2 test issue by @yufeiwu-nv in #12604
[None][feat] Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding by @JadoTu in #12537
[None][chore] Bump version to 1.3.0rc11 by @ZhanruiSunCh in #12627
[https://nvbugs/6013692][fix] handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported by @lfr-0531 in #12613
[https://nvbugs/6011517][fix] Fix autotuner OOM for trtllmGen MoE runners at large context length by @hyukn in #12523
[None][feat] serve-config-guide skill for basic aggregate single-node serving configs by @venkywonka in #12054
[None][fix] always sync sampler_event in update_requests by @Funatiq in #12585
[None][infra] Pin the version of mypy due to dependency conflicts by @EmmaQiaoCh in #12650
[https://nvbugs/5996645][fix] Fix Pyxis Error in Disagg Perf Test by @chenfeiz0326 in #12575
[TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy by @VALLIS-NERIA in #12483
[None] [feat] Add densegemm backend for MoE by @zongfeijing in #10479
[TRTLLM-8922][feat] gen-first disagg scheduling, part 2 by @reasonsolo in #12239
[https://nvbugs/5972362][fix] Avoid counting KV cache uses during warmup for Prometheus KV cache metrics by @yijingl-nvidia in #12132
[https://nvbugs/6007352][fix] Accept strict field in tools and store field in chat requests by @JunyiXu-nv in #12482
[TRTLLM-11551][feat] Support EPLB with various MoE backends for nemotron-h models by @Wanli-Jiang in #12280
[None][infra] Skip already-applied patches gracefully in 3rdparty FetchContent by @achartier in #12550
[TRTLLM-11385][chore] Mark TRTLLMSampler as deprecated and update documentation by @Funatiq in #11938
[None][feat] Skip softmax via sparsity ratio by @rohansjoshi in #11995
[None][infra] Add container scanning to plc nightly pipeline by @yuanjingx87 in #12549
[https://nvbugs/5948878][fix] fix lost requests by @bo-nv in #12348
[TRTLLM-7335][infra] use JobBuilder to trigger downstream job by @niukuo in #7079
[https://nvbugs/5800672][fix] Unwaive tests fixed by Austin Lab GPU topo config resolution by @peaceh-nv in #12453
[TRTLLM-11119][feat] Blackwell SageAttention, Integrate into AttentionOp API by @xrq-phys in #11718
[TRTLLM-9970][infra] Get TOT waive list from github repo first and then gitlab repo by @yiqingy0 in #11063
[https://nvbugs/5850183][fix] Re-enable passing tests by @dongfengy in #12568
[None][feat] Add DWDP (Distributed Weight Data Parallelism) support for MoE inference by @tianyuz-nv in #12136
[https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness by @yufeiwu-nv in #12655
[https://nvbugs/5836828][fix] Fix GPTOSS CUTLASS MOE on Hopper nvlink one-sided workspace overflow by @dongfengy in #12666
[https://nvbugs/5996656][fix] unwaive qwen3 ci test by @byshiue in #12652
[None][fix] fix mooncake dynamic load in transfer_agent_binding by @chuangz0 in #12181
[None][fix] Add GlmMoeDsaForCausalLM to EPLB supported model list by @qiaoxj07 in #12607
[None][doc] update supported models to include Kimi K2/K2.5 and GLM-5 by @dc3671 in #12654
[https://nvbugs/5911788][fix] Isolate single_gpu ray orchestrator tests to avoid CI timeouts by @shuyixiong in #12616
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12645
[https://nvbugs/6007967][fix] fix disagg pp hang issue by @bo-nv in #12528
[TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models by @Wanli-Jiang in #12154
[https://nvbugs/5983390][perf] Multiple host perf optimizations for DSA part by @hyukn in #12581
[None][fix] Correct reused block counting on corner case by @tongyuantongyu in #12404
[https://nvbugs/6032056][fix] Clamp block indices to prevent OOB in DSA with MTP by @sunnyqgg in #12657
[None][revert] Revert "[TRTLLM-11119][feat] Blackwell SageAttention, Integrate into … by @yunruis in #12679
[#12332][feat] AutoDeploy: SuperV3 MTP Support by @govind-ramnarayan in #12326
[TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt by @moraxu in #11708
[TRTLLM-11237][fix] [fix] Synchronize NCCL memory allocation error handling by @nv-lschneider in #12125
[https://nvbugs/6008710][fix] Adjust prompt logprobs to use the correct prompt token id by @stnie in #12499
[None][infra] Bump tornado and black in container by @yuanjingx87 in #12600
[TRTLLM-11043][feat] Add global pool support for suffix automaton speculative decoding by @cascade812 in #12130
[https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics by @Shixiaowei02 in #12446
[None][feat] Add triton paged attention for AutoDeploy by @nvchenghaoz in #12642
[#12560][fix] Fix disaggserving hang on block reuse after eviction by @Tabrizian in #12667
[None][refactor] VisualGen attention backend refactor by @NVShreyas in #12663
[TRTLLM-11318][feat] move VisualGen APIs to a separate dir by @zhenhuaw-me in #12538
[None][infra] Waive failed test by @yuanjingx87 in #12714
[#11538][fix] Enable sliding window attention for Mistral/Mixtral by @karljang in #12597
[None][infra] Waive failure test by @yiqingy0 in #12726
[TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager by @VALLIS-NERIA in #12531
[None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention by @longcheng-nv in #12385
[None][test] Remove OOM test case from L40S test list by @yufeiwu-nv in #12685
[None][feat] Support MLA generation in TrtllmGen attention backend by @yihwang-nv in #12606
[None][infra] No warn_unused_ignores temporarily by @EmmaQiaoCh in #12728
[None][feat] Qwen3-Next MTP by @IzzyPutterman in #11370
[None][doc] Blog19 for DWDP. by @wanqian-nv in #12725
[https://nvbugs/5800591][chore] Unwaive a deepseek MTP test by @mikeiovine in #12327
[None][doc] Add visual generation models to supported models page by @chang-l in #12464
[TRTLLM-11146][feat] Extend python cache transceiver to support nemotron by @bo-nv in #12150
[TRTLLM-11523][feat] Handle different chat template types by @2ez4bz in #12336
[#12257][fix] Use the first non-None result returned by hf download workers by @kev-bi in #12259
[None][feat] Add multi-turn support for trtllm-bench by @cascade812 in #12468
[None][fix] Replace assertions with warnings for unsupported logits/logprobs in speculative sampler by @yifjiang in #12547
[None][cleanup] Add supplemental ruff lint for legacy files via ruff-legacy hook by @venkywonka in #11469
[https://nvbugs/5864187][fix] Address H20 Weights Loading OOM for GPTOSS by @dongfengy in #11321
[None][feat] Add fused DiT QK Norm + RoPE CUDA kernel for FLUX by @karljang in #11869
[TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2 by @lowsfer in #12644
[None][fix] Harmony Parser Delta Grouping + Reuse Report + Better Test Coverage by @dongfengy in #12467
[None][feat] MLIR-based auto-generated elementwise fusion for AutoDeploy by @suyoggupta in #12427
[None][doc] Clean up latest news + blogs, update overview, highlight visual gen by @laikhtewari in #12753
[None][feat] Add --custom_tokenizer CLI option to trtllm-bench by @qiaoxj07 in #12586
[https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo… by @bo-nv in #12656
[TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf by @reasonsolo in #12337
[None][fix] Fix Cuda event crash with perf metrics by @jthomson04 in #12639
[None][fix] Update codes to support nemotron-h corner cases by @Wanli-Jiang in #12620
[None][infra] Waive 10 failed cases for main in post-merge 2636 by @ZhanruiSunCh in #12767
[https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests by @reasonsolo in #12618
[https://nvbugs/6025177][fix] Fix KV cache issue by @thorjohnsen in #12673
[TRTLLMINF-37][feat] Add CI agent failure analysis to L0_MergeRequest… by @dpitman-nvda in #12543
[None][doc] Update C++ coding guidelines. by @hnover-nv in #12577
[#12324][fix] Fixed wrong token suppressed with ignore_eos in torch sampler by @MrGeva in #12358
[https://nvbugs/5849648][fix] Fix GPTOSS Chat Template for Disagg Tests by @dongfengy in #12724
[#11094][feat] AutoDeploy transform to fuse silu+mul by @MrGeva in #12497
[None][infra] Fix source code scanning by @yuanjingx87 in #12773
[None][chore] Remove gpu-shell tool from ad-run-agent by @govind-ramnarayan in #12418
[#9306][cleanup] Remove some fields with redefined defaults by @2ez4bz in #11671
[None][feat] reuse triton slicing kernel for GDN prefill transpose by @nv-guomingz in #12737
[None][feat] fix mamba metadata prefill bubble in chunked prefill serving by @nv-guomingz in #12736
[https://nvbugs/5781731][fix] Unwaive Ray test by @dominicshanshan in #9654
[None][fix] Fix outdated argument of readme.md for executorExampleDisaggregated.cpp by @Fan-Yunfan in #12276
[None][feat] Trtllm-gen FMHA JIT support by @yunruis in #12612
[None][feat] retune causalConv1d fwd dispatch for varlen and short sequences by @nv-guomingz in #12739
[TRTLLM-9948][infra] Move to use FlexCache in Austin for 5080 nodes by @EmmaQiaoCh in #12615
[TRTLLM-11768][fix] Config updates to enable NVFP4 by @2ez4bz in #12776
[https://nvbugs/6008468][fix] Fix top-K logprobs size for PP by @pengbowang-nv in #12623
[None][fix] Remove clone in fp8 quant. by @Tracin in #12687
[https://nvbugs/6011284][fix] Fix Qwen2.5 mixed precision accuracy issue. by @Tracin in #12609

New Contributors

@rohansjoshi made their first contribution in #11995
@xrq-phys made their first contribution in #11718
@tianyuz-nv made their first contribution in #12136
@wanqian-nv made their first contribution in #12725
@kev-bi made their first contribution in #12259
@yifjiang made their first contribution in #12547

Full Changelog: v1.3.0rc10...v1.3.0rc11