github NVIDIA/TensorRT-LLM v1.3.0rc16

pre-release3 hours ago

Highlights

  • Model Support

    • Add Gemma4 multimodal support with native vision and audio towers (#14300)
    • Add Qwen3.5 MTP and Qwen3.6-27B-FP8 model support (#12646, #14359)
    • Add EXAONE-4.5 and Laguna model support (#12873, #13559)
    • Switch DeepSeek, NemotronH, Qwen3, and Qwen3.5-MoE to sharding-IR canonical models (#13478)
  • API

    • Refactor the VisualGenArgs API and registry (#14175)
    • Drop sink_token_length from the PyTorch attention surface (#14275)
    • Add OpenAI chat logit bias validation (#13518)
    • Reject incompatible KV connector configurations at construction time (#13577)
  • Feature

    • Add exact multimodal KV block hashing and KV cache reuse probing (#13815, #14333)
    • Add KV cache manager v2 with Python transceiver updates (#12928)
    • Add disaggregated serving support with block reuse enabled for hybrid models (#14060)
    • Add FlashInfer MLA attention backend support and SkipSoftmax sparse attention support for visual generation (#13428, #12947)
    • Add Ring Attention and unified context parallelism for VisualGen (#13821)
    • Add legacy and TensorRT-LLM 1.x modelopt quantization config support (#14088)
    • Add debugging environment variables for mamba modules (#14170)
    • Add single-rank MPI sleep/wakeup support and a rank-0 collective_rpc shim (#14052)
    • Add opentelemetry metrics for disaggregated serving with multiple postprocessing workers (#12637)
    • Support SWA scratch reuse rewind (#14412)
    • Improve FMHA, FlashInfer TRTLLM-Gen, and KV cache buffer calculation paths (#14291, #12525)
    • Improve fused-kernel and attention performance with shared-expert combine fusion, paged MQA logits decode tuning, LTX2 fused RMSNorm/RoPE, EAGLE3 dynamic tree kernel optimizations, and cu_seqlens conversion updates (#14306, #14133, #13985, #13426, #13566)
    • Optimize beam search candidate reconstruction by skipping prompt-prefix copies (#14197)
    • Update cubins to resolve the FMHA PDL issue (#14462)
    • Use CUDA 13 CUTLASS DSL package (#14354)
  • Fix

    • Fix disaggregated benchmark, usage propagation, and worker registration stability issues (#13347, #14177, #14289)
    • Fix DeepSeek-V3 OOM handling and artifacts paths (#14232)
    • Fix missing get_draft_token_length import in py_executor (#14366)
    • Fix Lora load failure handling (#13517)
    • Fix Kimi K2.5 speculative decoding behavior (#14379)
    • Fix Qwen3HybridConfig layer_types derivation and route load_hf_model_config through AutoConfig (#13832, #14410)
    • Fix CppMambaHybridCacheManager functional and performance issues (#14003)
    • Fix MTP disaggregated speculative_config coverage (#14391)
    • Fix KVCacheTransfer divide-by-zero and KV cache grain slot refinement issues (#13618, #14442)
    • Fix memory usage during refit and EPLB config model loading (#14331, #11962)
    • Fix MPI worker allocator configuration and GB300 cluster environment setup (#14152, #14460)
    • Fix profiler runner exception handling with synchronized CUDA cleanup (#13469)
    • Disable mamba replay by default (#14471)
  • Documentation

    • Add a Claude skill for multimodal model onboarding (#13842)
    • Update Gemma 4 entries in supported-models.md (#14463)
    • Fix invalid documentation and deployment guide links (#14337, #14522)
  • Benchmark

    • Add LPIPS scoring for visual generation model regression tests (#13567)
    • Add a bench_moe microbenchmark (#14507)
    • Update visual generation and accuracy thresholds for Wan 2.2, Qwen3.5-4B DFlash, and Nano V3 (#14372, #14411, #14078)
    • Disable ignore-eos when using speculative decoding in performance tests (#14347)
  • Test & Infra

    • Split verl tests into fine-grained per-case wrappers (#14037)
    • Add new stress cases (#14390)
    • Clean outdated test duration entries and remove deprecated disaggregated sampler and spark test cases (#14340, #14335, #14380)
    • Isolate ray tests to avoid GCS timeout in a single pytest session (#14342)
    • Improve L0 retry timeout budgeting and cap infra retry attempts (#14323, #14415)
    • Handle sacct errors when checking Slurm job status (#14367)
    • Fix B300 MegaMoE and MoE test selection (#14362, #14401)
    • Fix container scanning according to the latest security team guidance (#14430)
    • Deduplicate miscellaneous unit tests on B200 (#14525)

What's Changed

  • [None][chore] Update Claude Code agents and skills by @kaiyux in #14344
  • [None][perf] Fuse sigmoid+mul+add shared-expert combine into one Trit… by @nv-guomingz in #14306
  • [None][infra] Waive 1 failed cases for main in pre-merge 38925 by @ZhanruiSunCh in #14346
  • [None][infra] Revert Mingyang back to mingyangHao in allowlist by @ZhanruiSunCh in #14349
  • [None][cleanup] MistralSmall related cleanups by @2ez4bz in #14271
  • [None][chore] Clean test_durations file by removing outdated items. by @nv-guomingz in #14340
  • [None][infra] Waive 2 failed cases for main in post-merge 2725 by @ZhanruiSunCh in #14357
  • [None][feat] Exact multimodal KV blockhashing by @venkywonka in #13815
  • [None][infra] Waive 1 failed cases for main in pre-merge 38987 by @ZhanruiSunCh in #14350
  • [None][feat] Update the logic of FMHA JIT path by @heyuhhh in #14291
  • [None][feat] opentelemetry metrics for num_postproc_workers > 0 disagg by @karen-sy in #12637
  • [TRTLLM-12385][feat] Use LPIPS score for visual gen model regression test by @yibinl-nvidia in #13567
  • [None][chore] Remove closed bugs by @xinhe-nv in #14217
  • [https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by @xwang233 in #14191
  • [None][feat] add single-rank MPI sleep/wakeup and rank-0 collective_rpc shim by @hhzhang16 in #14052
  • [https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance by @chienchunhung in #13347
  • [None][fix] Import missing get_draft_token_length in py_executor by @nv-guomingz in #14366
  • [TRTLLM-12342][feat] Ring Attention, Unified Context Parallel for VisualGen by @NVShreyas in #13821
  • [None][test] Split verl tests into 19 fine-grained per-case wrappers by @Superjomn in #14037
  • [TRTLLM-11547][feat] Add Qwen3.5 MTP support. by @nv-guomingz in #12646
  • [https://nvbugs/6143599][fix] DeepSeek-V3 OOM and artifacts path by @dominicshanshan in #14232
  • [https://nvbugs/6114141][test] Remove deprecated disagg trtllm_sampler test by @Shixiaowei02 in #14335
  • [None][doc] Add Claude skill for multimodal model onboarding by @yechank-nvidia in #13842
  • [https://nvbugs/6141803][fix] Skip Qwen3.5-4B tests pre-hopper by @amukkara in #14055
  • [None][fix] ADP router crashes on serve when scheduling_params.attent… by @nv-guomingz in #14267
  • [https://nvbugs/6185190][doc] fix invalid links in doc by @nv-guomingz in #14337
  • [None][feat] Refactor to support legacy and 1.x modelopt quant config format by @Wanli-Jiang in #14088
  • [None][feature] Add env variables to help debugging mamba modules. by @Wanli-Jiang in #14170
  • [None][infra] Handle sacct error when checking slurm job status by @yuanjingx87 in #14367
  • [https://nvbugs/6027594][fix] Unwaive testcase by @YihuiLu512 in #14383
  • [None][chore] Remove unnecessary buffer to save memory during refit by @shuyixiong in #14331
  • [https://nvbugs/6153638][fix] unwaive tests for testing the flaky issue by @JunyiXu-nv in #14284
  • [https://nvbugs/6171743][fix] Set PYTORCH_ALLOC_CONF=expandable_segments:True on MPI workers via `patch_mpi_ by @tensorrt-cicd in #14152
  • [None][test] Add new stress cases by @fredricz-20070104 in #14390
  • [None][feat] Gemma4 MM: native vision + audio towers by @Hudayday in #14300
  • [TRTLLM-12719][cbts] Add core code related rule by @crazydemo in #14266
  • [None][test] Update bug ID for test_all_optimizations_combined waiver by @mzweilz in #14402
  • [None][infra] Waive 6 failed cases for main in post-merge 2726 by @ZhanruiSunCh in #14405
  • [None][test] Disable ignore-eos when Spec Decoding in Perf Test by @chenfeiz0326 in #14347
  • [None][fix] Isolate ray tests to avoid GCS timeout in one pytest session by @shuyixiong in #14342
  • [https://nvbugs/6110638][fix] Mark AutoDeploy attention DP world sizes by GPU count by @galagam in #14148
  • [None][feat] EXAONE-4.5 Support by @yechank-nvidia in #12873
  • [https://nvbugs/6004530][fix] Unwaive Qwen3.5 35B A3B FP8 test case by @nv-guomingz in #14406
  • [None][feat] add KV cache reuse probe by @lowsfer in #14333
  • [TRTLLM-12706][perf] Optimize beam search candidate reconstruction by skipping prompt-prefix copies by @xuanzic in #14197
  • [#12359][feat] AutoDeploy: MTP performance: Integrate FI kernel for extend path by @galagam in #13711
  • [TRTLLMINF-89][feat] Make L0 retries timeout-budget aware by @dpitman-nvda in #14323
  • [TRTLLM-12500][feat] Add support for Qwen3.5 VL MoE by @moraxu in #14164
  • [None][chore] Bump version to 1.3.0rc16 by @VALLIS-NERIA in #14422
  • [None][feat] Add Laguna model support (Poolside Laguna-XS.2) by @DomBrown in #13559
  • [None][fix] Reduce host memory usage during EPLB config model loading. by @jthomson04 in #11962
  • [https://nvbugs/6079440][test] Unwaive MTP speculative decoding test by @sunnyqgg in #14341
  • [https://nvbugs/5911709][fix] Wrap lora load failures by @yibinl-nvidia in #13517
  • [https://nvbugs/6175060][fix] Fix B300 MegaMoE test selection by @xxi-nv in #14362
  • [None][fix] Reject incompatible KV connector configurations at construction time by @jthomson04 in #13577
  • [https://nvbugs/6115562][fix] defer worker registration until HTTP server is accepting by @reasonsolo in #14289
  • [None][chore] Fix Kimi_k25 with spec dec by @ziyixiong-nv in #14379
  • [None][feat] KV cache manager v2 + python transceiver bug fix by @chuangz0 in #12928
  • [https://nvbugs/6141606][fix] Move the layer_types derivation into Qwen3HybridConfig.from_hf (where `pretr by @tensorrt-cicd in #13832
  • [None][fix] Fix CppMambaHybridCacheManager functional and perf issues by @VALLIS-NERIA in #14003
  • [TRTLLM-35237][feat] Tune cute dsl paged MQA logits decode kernel by @limin2021 in #14133
  • [https://nvbugs/6185182][fix] Adjust H20 accuracy for Qwen3.5-4B DFlash by @tensorrt-cicd in #14411
  • [None][test] Remove the testcases for spark by @JennyLiu-nv in #14380
  • [None][fix] Add MTP speculative_config to wideep dep48 mtp3 disagg yaml by @yingguo-trt in #14391
  • [https://nvbugs/6185234][fix] route load_hf_model_config via AutoConfig by @Hudayday in #14410
  • [https://nvbugs/6110074][fix] Add torch.cuda.synchronize() wrapped in try/except in the _profile_runners excep by @tensorrt-cicd in #13469
  • [None][feat] Disable shared paged index in flashinfer trtllm-gen fmha kernel and unify kv cache buffer calculation with thop.attention by @yihwang-nv in #12525
  • [https://nvbugs/6112497][test] Unwaive passing test by @yihwang-nv in #14387
  • [None][fix] cold-start warmup for KV-aware ADP router by @lancelly in #14307
  • [https://nvbugs/6185212][fix] Fix B300 MoE test list ids by @xxi-nv in #14401
  • [None][infra] Waive 1 failed cases for main in pre-merge 39395 by @ZhanruiSunCh in #14455
  • [https://nvbugs/6156492][fix] Fix disaggregated usage propagation by @reasonsolo in #14177
  • [https://nvbugs/6106659][fix] Add Qwen3.6-27B-FP8 support. by @nv-guomingz in #14359
  • [None][chore] Use CUDA 13 CUTLASS DSL package by @xxi-nv in #14354
  • [None][doc] Update Gemma 4 entries in supported-models.md by @Hudayday in #14463
  • [None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in #14450
  • [None][fix] Cap infra-retry budget at 2 attempts total by @dpitman-nvda in #14415
  • [None][infra] Fix container scanning according to security teams latest update by @yuanjingx87 in #14430
  • [https://nvbugs/6184143][fix] AutoDeploy: Fix newly added unit tests for Transformers 5.5.3 by @govind-ramnarayan in #14273
  • [https://nvbugs/6185192][fix] raise Wan 2.2 VBench threshold by @o-stoner in #14372
  • [TRTLLM-11320][refactor] Refactor VisualGenArgs API and registry by @zhenhuaw-me in #14175
  • [None][chore] bump transformers to 5.5.4 by @longlee0622 in #14456
  • [https://nvbugs/5914391][fix] Add OpenAI chat logit bias validation by @yibinl-nvidia in #13518
  • [None][feat] Revert Add support for Qwen3.5 VL MoE (#14164) by @nv-guomingz in #14465
  • [None][feat] Disable mamba replay by default by @tijyojwad in #14471
  • [None][feat] Update cubins to resolve FMHA PDL issue by @heyuhhh in #14462
  • [None][infra] Waive 1 failed cases for main in pre-merge 39582 by @ZhanruiSunCh in #14485
  • [TRTLLM-12580][perf] ltx2: fused RMSNorm+RoPE across all attention paths + PE pre-shard by @luyiyun1021 in #13985
  • [None][perf] EAGLE3 dynamic tree kernel optimizations by @sunnyqgg in #13426
  • [https://nvbugs/6079901][fix] Avoid divide-by-zero in KVCacheTransfer… by @farazkh80 in #13618
  • [None][feat] support SWA scratch reuse rewind by @lowsfer in #14412
  • [#14173][tests] move autodeploy accuracy tests to post merge and use model registry by @MrGeva in #14352
  • [TRTLLM-12027][feat] Disagg serving support with block reuse ON for hybrid models by @bo-nv in #14060
  • [None][chore] Drop sink_token_length from PyTorch attention surface by @yuxianq in #14275
  • [https://nvbugs/6185248][test] Unwaive K2.5 thinking MTP3 perf sanity test by @tianyuxbear in #14461
  • [None][feat] Add FlashInfer MLA attention backend support by @Tracin in #13428
  • [None][feat] Add SkipSoftmax sparse attention support for visual generation by @karljang in #12947
  • [https://nvbugs/6190759][fix] set env on some gb300 cluster by @chuangz0 in #14460
  • [https://nvbugs/6157131][fix] lower the GSM8K accuracy grade for Nano V3 by @tcherckez-nvidia in #14078
  • [None][fix] Fix KV cache grain slot refinement by @lowsfer in #14442
  • [None][test] Waive 7 failed cases for main in QA CI by @xinhe-nv in #14504
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14503
  • [None][infra] Waive 10 failed cases for main in post-merge 2733 by @ZhanruiSunCh in #14514
  • [https://nvbugs/6094070][fix] Skip ray-marked integration tests when --run-ray is not set by @shikicloud in #14226
  • [TRTLLM-12635][feat] add bench_moe microbenchmark by @xxi-nv in #14507
  • [None][fix] Unwaive Qwen3.5 bf16 mtp on case by @nv-guomingz in #14511
  • [None][fix] fix typo by @bo-nv in #14510
  • [https://nvbugs/6215684][fix] Fix invalid links in deployment guide by @nv-guomingz in #14522
  • [https://nvbugs/6120981][fix] Switch to cu_seqlens_to_chunk_indices_offsets_triton with total_seqlens/extra_ch by @tensorrt-cicd in #13566
  • [None][infra] Waive 5 failed cases for main in post-merge 2733 by @ZhanruiSunCh in #14516
  • [TRTLLM-13429][feat] Switch DeepSeek/NemotronH/Qwen3/Qwen3.5-MoE to sharding-IR canonical models by @greg-kwasniewski1 in #13478
  • [TRTLLM-12942][ci] Dedup misc unit tests on B200 by @YihuiLu512 in #14525
  • [None][infra] Waive 9 failed cases for main in post-merge by @xinhe-nv in #14515

New Contributors

Full Changelog: v1.3.0rc15...v1.3.0rc16

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.