github NVIDIA/TensorRT-LLM v1.3.0rc11

pre-release21 hours ago

Highlights

  • Model Support
    • Add Mistral 4-small support to AutoDeploy (#12266)
    • Add GlmMoeDsaForCausalLM to EPLB supported model list (#12607)
    • Add Qwen3-Next MTP (#11370)
    • Enable sliding window attention for Mistral/Mixtral (#12597)
  • API
    • Support include_stop_token_in_output in gRPC request manager (#12517)
    • Add deprecation warnings on TRT backend entrypoints (#11723)
    • Accept strict field in tools and store field in chat requests (#12482)
    • Mark TRTLLMSampler as deprecated and update documentation (#11938)
    • Move VisualGen APIs to a separate directory (#12538)
    • Remove some fields with redefined defaults (#11671)
  • Feature
    • Apply norm before FC in Eagle (#12561)
    • Split MLA DSA custom op for piecewise CUDA graph capture (#12503)
    • Optimize host performance for Python cache transceiver (#12273)
    • Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding (#12537)
    • Add serve-config-guide skill for basic aggregate single-node serving configs (#12054)
    • Add FORCE_CHUNK context chunking policy (#12483)
    • Add dense GEMM backend for MoE (#10479)
    • Implement gen-first disaggregated scheduling, part 2 (#12239)
    • Support EPLB with various MoE backends for Nemotron-H models (#12280)
    • Skip softmax via sparsity ratio (#11995)
    • Add DWDP (distributed weight data parallelism) support for MoE inference (#12136)
    • Add AutoDeploy Super V3 MTP support (#12326)
    • Introduce fast path (token IDs + multimodal) for VLMs without re-tokenizing encoded prompts (#11708)
    • Add global pool support for suffix automaton speculative decoding (#12130)
    • Add Triton paged attention for AutoDeploy (#12642)
    • Refactor VisualGen attention backend (#12663)
    • Add support of linear attention state for C++ KV cache manager (#12531)
    • Add temporally-correlated heuristic-guided indexer TopK for sparse attention (#12385)
    • Support MLA generation in TrtllmGen attention backend (#12606)
    • Extend Python cache transceiver to support Nemotron (#12150)
    • Handle different chat template types (#12336)
    • Add multi-turn support for trtllm-bench (#12468)
    • Add fused DiT QK Norm + RoPE CUDA kernel for FLUX (#11869)
    • Support cache reuse for SSM in KVCacheManagerV2 (#12644)
    • Add MLIR-based auto-generated elementwise fusion for AutoDeploy (#12427)
    • Add --custom_tokenizer CLI option to trtllm-bench (#12586)
    • Support LoRA adapter for Nemotron-H models (#12154)
    • Apply multiple host performance optimizations for DSA (#12581)
    • Reuse Triton slicing kernel for GDN prefill transpose (#12737)
    • Add Trtllm-gen FMHA JIT support (#12612)
    • Retune causalConv1d forward dispatch for variable-length and short sequences (#12739)
    • Update configuration to enable NVFP4 (#12776)
    • Fuse SiLU+Mul in AutoDeploy transform (#12497)
  • Fix
    • Fix Triton kernels in wheel (#12569)
    • Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models (#12571)
    • Reorder generation_logits to align with final beam search output ordering (#12268)
    • Handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported (#12613)
    • Fix autotuner OOM for trtllmGen MoE runners at large context length (#12523)
    • Always sync sampler_event in update_requests (#12585)
    • Avoid counting KV cache uses during warmup for Prometheus KV cache metrics (#12132)
    • Fix lost requests (#12348)
    • Fix GPTOSS CUTLASS MoE on Hopper NVLink one-sided workspace overflow (#12666)
    • Fix Mooncake dynamic load in transfer_agent_binding (#12181)
    • Fix disaggregated pipeline-parallel hang (#12528)
    • Correct reused block counting in corner case (#12404)
    • Clamp block indices to prevent out-of-bounds in DSA with MTP (#12657)
    • Synchronize NCCL memory allocation error handling (#12125)
    • Adjust prompt logprobs to use the correct prompt token id (#12499)
    • Improve NIXL agent import error diagnostics (#12446)
    • Fix disaggregated serving hang on block reuse after eviction (#12667)
    • Use the first non-None result returned by Hugging Face download workers (#12259)
    • Replace assertions with warnings for unsupported logits/logprobs in speculative sampler (#12547)
    • Address H20 weights loading OOM for GPTOSS (#11321)
    • Improve Harmony parser (delta grouping, reuse report, test coverage) (#12467)
    • Fix hang issues on DGX B200 8-GPU PyTorch configurations (#12656)
    • Fix disaggregated KV cache router for chat API; add disaggregated benchmark for ai_perf (#12337)
    • Fix CUDA event crash with performance metrics (#12639)
    • Update Nemotron-H handling for corner cases (#12620)
    • Fix KV cache issue (#12673)
    • Fix wrong token suppressed with ignore_eos in Torch sampler (#12358)
    • Fix GPTOSS chat template for disaggregated tests (#12724)
    • Fix top-K logprobs size for pipeline parallelism (#12623)
    • Remove clone in FP8 quantization (#12687)
    • Fix Qwen2.5 mixed precision accuracy issue (#12609)
    • Fix Mamba metadata prefill bubble in chunked prefill serving (#12736)
    • Fix outdated README argument for executorExampleDisaggregated.cpp (#12276)
  • Documentation
    • Add MoE developer guide for fused_moe module (#12534)
    • Update supported models to include Kimi K2/K2.5 and GLM-5 (#12654)
    • Publish blog post for DWDP (#12725)
    • Add visual generation models to supported models page (#12464)
    • Clean up latest news and blogs; update overview and highlight visual generation (#12753)
    • Update C++ coding guidelines (#12577)
  • Test & Infra
    • Use shared utility for node labels (#9095)
    • Adjust RocketKV test threshold (#12527)
    • Enhance performance tests with GPU availability check in test_perf.py (#12535)
    • Move AD performance regression tests to AD pre- and post-merge jobs (#12461)
    • Remove Model Registry Check from workflows; check runs in pre-commit (#12590)
    • Add Ubuntu 24.04 wheel image for SBSA (#12436)
    • Pin mypy version due to dependency conflicts (#12650)
    • Fix Pyxis error in disaggregated performance test (#12575)
    • Skip already-applied patches gracefully in third-party FetchContent (#12550)
    • Add container scanning to PLC nightly pipeline (#12549)
    • Use JobBuilder to trigger downstream job (#7079)
    • Prefer GitHub then GitLab for TOT waive list (#11063)
    • Isolate single-GPU Ray orchestrator tests to avoid CI timeouts (#12616)
    • Add workaround for trtllm-bench hang and improve robustness (#12655)
    • Bump tornado and black in container (#12600)
    • Remove OOM test case from L40S test list (#12685)
    • Temporarily disable warn_unused_ignores (#12728)
    • Add supplemental Ruff lint for legacy files via ruff-legacy hook (#11469)
    • Add port conflict retry for disaggregated multi-process tests (#12618)
    • Add CI agent failure analysis to L0 merge request pipeline (#12543)
    • Fix source code scanning (#12773)
    • Remove gpu-shell tool from ad-run-agent (#12418)
    • Move to FlexCache in Austin for 5080 nodes (#12615)

What's Changed

  • [https://nvbugs/5882636][fix] Fix triton_kernels in wheel by @dongfengy in #12569
  • [https://nvbugs/5919796][test] AutoDeploy: unwaive Super V3 autodeploy failure by @galagam in #12556
  • [None][test] Waive another flaky test case on Dis-agg serving with Ne… by @nv-guomingz in #12587
  • [#11992][fix] Support include_stop_token_in_output in gRPC request manager by @CatherineSue in #12517
  • [None][feat] Eagle: Norm before FC by @IzzyPutterman in #12561
  • [#10607][fix] moved AD perf regression tests to AD jobs pre and post merge by @MrGeva in #12461
  • [None][infra] Waive 1 failed cases for main in post-merge 2626 by @ZhanruiSunCh in #12592
  • [TRTLLM-7335] [infra] Use shared utility for node labels by @niukuo in #9095
  • [None][infra] Waive 1 failed cases for main in pre-merge 31714 by @ZhanruiSunCh in #12589
  • [https://nvbugs/6007197][fix] Adjust RocketKV test threshold by @heyuhhh in #12527
  • [None][test] Enhance performance tests by adding GPU availability check in test_perf.py by @yufeiwu-nv in #12535
  • [None][infra] Waive 2 failed cases for main in post-merge 2627 by @ZhanruiSunCh in #12605
  • [None][fix] Fix DSACacheManager and RocketCacheManager KV cache estimation ignoring num_layers for draft models by @lancelly in #12571
  • [None][doc] Add MoE developer guide for fused_moe module by @xxi-nv in #12534
  • [None][chore] Remove Model Registry Check from workflows, the check already runs in pre-commit by @tcherckez-nvidia in #12590
  • [https://nvbugs/5983390][perf] Split MLA DSA custom op for piecewise CUDA graph capture by @liji-nv in #12503
  • [None][fix] Reorder generation_logits to align with final beam search output ordering by @achartier in #12268
  • [TRTC-351][chore] Deprecation warnings on TRT backend entrypoints by @venkywonka in #11723
  • [TRTLLM-10804][infra] add ubuntu2404 wheel image for SBSA by @niukuo in #12436
  • [#12288][feat] Add Mistral 4-small support to AutoDeploy by @bmarimuthu-nv in #12266
  • [None][infra] waive failed case for main by @EmmaQiaoCh in #12621
  • [https://nvbugs/5920751][chore] Unwaive a test that has been fixed by @longlee0622 in #12610
  • [TRTLLM-9526][feat] optimize host perf for python cache transceiver by @chuangz0 in #12273
  • [None][test] Fix kimi-k2 test issue by @yufeiwu-nv in #12604
  • [None][feat] Add Mamba2 MTP SSM cache CUDA kernel for tree-based speculative decoding by @JadoTu in #12537
  • [None][chore] Bump version to 1.3.0rc11 by @ZhanruiSunCh in #12627
  • [https://nvbugs/6013692][fix] handle CUDA_ERROR_INVALID_VALUE in kv_cache_v2 _is_prop_supported by @lfr-0531 in #12613
  • [https://nvbugs/6011517][fix] Fix autotuner OOM for trtllmGen MoE runners at large context length by @hyukn in #12523
  • [None][feat] serve-config-guide skill for basic aggregate single-node serving configs by @venkywonka in #12054
  • [None][fix] always sync sampler_event in update_requests by @Funatiq in #12585
  • [None][infra] Pin the version of mypy due to dependency conflicts by @EmmaQiaoCh in #12650
  • [https://nvbugs/5996645][fix] Fix Pyxis Error in Disagg Perf Test by @chenfeiz0326 in #12575
  • [TRTLLM-10061][feat] Add FORCE_CHUNK context chunking policy by @VALLIS-NERIA in #12483
  • [None] [feat] Add densegemm backend for MoE by @zongfeijing in #10479
  • [TRTLLM-8922][feat] gen-first disagg scheduling, part 2 by @reasonsolo in #12239
  • [https://nvbugs/5972362][fix] Avoid counting KV cache uses during warmup for Prometheus KV cache metrics by @yijingl-nvidia in #12132
  • [https://nvbugs/6007352][fix] Accept strict field in tools and store field in chat requests by @JunyiXu-nv in #12482
  • [TRTLLM-11551][feat] Support EPLB with various MoE backends for nemotron-h models by @Wanli-Jiang in #12280
  • [None][infra] Skip already-applied patches gracefully in 3rdparty FetchContent by @achartier in #12550
  • [TRTLLM-11385][chore] Mark TRTLLMSampler as deprecated and update documentation by @Funatiq in #11938
  • [None][feat] Skip softmax via sparsity ratio by @rohansjoshi in #11995
  • [None][infra] Add container scanning to plc nightly pipeline by @yuanjingx87 in #12549
  • [https://nvbugs/5948878][fix] fix lost requests by @bo-nv in #12348
  • [TRTLLM-7335][infra] use JobBuilder to trigger downstream job by @niukuo in #7079
  • [https://nvbugs/5800672][fix] Unwaive tests fixed by Austin Lab GPU topo config resolution by @peaceh-nv in #12453
  • [TRTLLM-11119][feat] Blackwell SageAttention, Integrate into AttentionOp API by @xrq-phys in #11718
  • [TRTLLM-9970][infra] Get TOT waive list from github repo first and then gitlab repo by @yiqingy0 in #11063
  • [https://nvbugs/5850183][fix] Re-enable passing tests by @dongfengy in #12568
  • [None][feat] Add DWDP (Distributed Weight Data Parallelism) support for MoE inference by @tianyuz-nv in #12136
  • [https://nvbugs/6038228][test] Add WA for trtllm-bench hang issue and improve its robustness by @yufeiwu-nv in #12655
  • [https://nvbugs/5836828][fix] Fix GPTOSS CUTLASS MOE on Hopper nvlink one-sided workspace overflow by @dongfengy in #12666
  • [https://nvbugs/5996656][fix] unwaive qwen3 ci test by @byshiue in #12652
  • [None][fix] fix mooncake dynamic load in transfer_agent_binding by @chuangz0 in #12181
  • [None][fix] Add GlmMoeDsaForCausalLM to EPLB supported model list by @qiaoxj07 in #12607
  • [None][doc] update supported models to include Kimi K2/K2.5 and GLM-5 by @dc3671 in #12654
  • [https://nvbugs/5911788][fix] Isolate single_gpu ray orchestrator tests to avoid CI timeouts by @shuyixiong in #12616
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12645
  • [https://nvbugs/6007967][fix] fix disagg pp hang issue by @bo-nv in #12528
  • [TRTLLM-10232][feat] Support LoRA adapter for nemotron-h models by @Wanli-Jiang in #12154
  • [https://nvbugs/5983390][perf] Multiple host perf optimizations for DSA part by @hyukn in #12581
  • [None][fix] Correct reused block counting on corner case by @tongyuantongyu in #12404
  • [https://nvbugs/6032056][fix] Clamp block indices to prevent OOB in DSA with MTP by @sunnyqgg in #12657
  • [None][revert] Revert "[TRTLLM-11119][feat] Blackwell SageAttention, Integrate into … by @yunruis in #12679
  • [#12332][feat] AutoDeploy: SuperV3 MTP Support by @govind-ramnarayan in #12326
  • [TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt by @moraxu in #11708
  • [TRTLLM-11237][fix] [fix] Synchronize NCCL memory allocation error handling by @nv-lschneider in #12125
  • [https://nvbugs/6008710][fix] Adjust prompt logprobs to use the correct prompt token id by @stnie in #12499
  • [None][infra] Bump tornado and black in container by @yuanjingx87 in #12600
  • [TRTLLM-11043][feat] Add global pool support for suffix automaton speculative decoding by @cascade812 in #12130
  • [https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics by @Shixiaowei02 in #12446
  • [None][feat] Add triton paged attention for AutoDeploy by @nvchenghaoz in #12642
  • [#12560][fix] Fix disaggserving hang on block reuse after eviction by @Tabrizian in #12667
  • [None][refactor] VisualGen attention backend refactor by @NVShreyas in #12663
  • [TRTLLM-11318][feat] move VisualGen APIs to a separate dir by @zhenhuaw-me in #12538
  • [None][infra] Waive failed test by @yuanjingx87 in #12714
  • [#11538][fix] Enable sliding window attention for Mistral/Mixtral by @karljang in #12597
  • [None][infra] Waive failure test by @yiqingy0 in #12726
  • [TRTLLM-10061][feat] Add support of linear attention state for C++ KV cache manager by @VALLIS-NERIA in #12531
  • [None][feat] Temporally-Correlated Heuristic-guided Indexer TopK for Sparse Attention by @longcheng-nv in #12385
  • [None][test] Remove OOM test case from L40S test list by @yufeiwu-nv in #12685
  • [None][feat] Support MLA generation in TrtllmGen attention backend by @yihwang-nv in #12606
  • [None][infra] No warn_unused_ignores temporarily by @EmmaQiaoCh in #12728
  • [None][feat] Qwen3-Next MTP by @IzzyPutterman in #11370
  • [None][doc] Blog19 for DWDP. by @wanqian-nv in #12725
  • [https://nvbugs/5800591][chore] Unwaive a deepseek MTP test by @mikeiovine in #12327
  • [None][doc] Add visual generation models to supported models page by @chang-l in #12464
  • [TRTLLM-11146][feat] Extend python cache transceiver to support nemotron by @bo-nv in #12150
  • [TRTLLM-11523][feat] Handle different chat template types by @2ez4bz in #12336
  • [#12257][fix] Use the first non-None result returned by hf download workers by @kev-bi in #12259
  • [None][feat] Add multi-turn support for trtllm-bench by @cascade812 in #12468
  • [None][fix] Replace assertions with warnings for unsupported logits/logprobs in speculative sampler by @yifjiang in #12547
  • [None][cleanup] Add supplemental ruff lint for legacy files via ruff-legacy hook by @venkywonka in #11469
  • [https://nvbugs/5864187][fix] Address H20 Weights Loading OOM for GPTOSS by @dongfengy in #11321
  • [None][feat] Add fused DiT QK Norm + RoPE CUDA kernel for FLUX by @karljang in #11869
  • [TRTLLM-9772][feat] Support cache reuse for SSM in KVCacheManagerV2 by @lowsfer in #12644
  • [None][fix] Harmony Parser Delta Grouping + Reuse Report + Better Test Coverage by @dongfengy in #12467
  • [None][feat] MLIR-based auto-generated elementwise fusion for AutoDeploy by @suyoggupta in #12427
  • [None][doc] Clean up latest news + blogs, update overview, highlight visual gen by @laikhtewari in #12753
  • [None][feat] Add --custom_tokenizer CLI option to trtllm-bench by @qiaoxj07 in #12586
  • [https://nvbugs/6027560][fix] fix hang issues on DGX_B200-8_GPUs-PyTo… by @bo-nv in #12656
  • [TRTLLM-11597][fix] fix disagg kvcache router for chat API and add disagg benchmark for ai_perf by @reasonsolo in #12337
  • [None][fix] Fix Cuda event crash with perf metrics by @jthomson04 in #12639
  • [None][fix] Update codes to support nemotron-h corner cases by @Wanli-Jiang in #12620
  • [None][infra] Waive 10 failed cases for main in post-merge 2636 by @ZhanruiSunCh in #12767
  • [https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests by @reasonsolo in #12618
  • [https://nvbugs/6025177][fix] Fix KV cache issue by @thorjohnsen in #12673
  • [TRTLLMINF-37][feat] Add CI agent failure analysis to L0_MergeRequest… by @dpitman-nvda in #12543
  • [None][doc] Update C++ coding guidelines. by @hnover-nv in #12577
  • [#12324][fix] Fixed wrong token suppressed with ignore_eos in torch sampler by @MrGeva in #12358
  • [https://nvbugs/5849648][fix] Fix GPTOSS Chat Template for Disagg Tests by @dongfengy in #12724
  • [#11094][feat] AutoDeploy transform to fuse silu+mul by @MrGeva in #12497
  • [None][infra] Fix source code scanning by @yuanjingx87 in #12773
  • [None][chore] Remove gpu-shell tool from ad-run-agent by @govind-ramnarayan in #12418
  • [#9306][cleanup] Remove some fields with redefined defaults by @2ez4bz in #11671
  • [None][feat] reuse triton slicing kernel for GDN prefill transpose by @nv-guomingz in #12737
  • [None][feat] fix mamba metadata prefill bubble in chunked prefill serving by @nv-guomingz in #12736
  • [https://nvbugs/5781731][fix] Unwaive Ray test by @dominicshanshan in #9654
  • [None][fix] Fix outdated argument of readme.md for executorExampleDisaggregated.cpp by @Fan-Yunfan in #12276
  • [None][feat] Trtllm-gen FMHA JIT support by @yunruis in #12612
  • [None][feat] retune causalConv1d fwd dispatch for varlen and short sequences by @nv-guomingz in #12739
  • [TRTLLM-9948][infra] Move to use FlexCache in Austin for 5080 nodes by @EmmaQiaoCh in #12615
  • [TRTLLM-11768][fix] Config updates to enable NVFP4 by @2ez4bz in #12776
  • [https://nvbugs/6008468][fix] Fix top-K logprobs size for PP by @pengbowang-nv in #12623
  • [None][fix] Remove clone in fp8 quant. by @Tracin in #12687
  • [https://nvbugs/6011284][fix] Fix Qwen2.5 mixed precision accuracy issue. by @Tracin in #12609

New Contributors

Full Changelog: v1.3.0rc10...v1.3.0rc11

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.