github NVIDIA/TensorRT-LLM v1.3.0rc18

pre-release9 hours ago
  • Known Issues

    • DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.
  • Model Support

    • Support Nemotron-H NVFP4 checkpoint on Hopper (#14775)
    • Add Qwen image support (#13449)
    • Support Step-3.7-Flash model (#14711)
    • Add Cosmos3-Nano and Cosmos3-Super support (#14824)
    • Add AFMoE Trinity support (#13148)
  • API

    • Add logprobs_simple_format option to return logprobs as a flat list[float] (#13972)
    • trtllm-serve, trtllm-eval, trtllm-bench: Make CLI flags take precedence over --config / --extra_llm_api_options YAML (#14812)
  • Feature

    • Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
    • Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
    • Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
    • Add per-expert LoRA support with Cutlass backend (#14801)
    • Reduce OpenAI stream postprocess overhead (#14708)
    • Add encoder CUDA graph support to llm.encode() (#14326)
    • Use a Triton kernel for C++ mamba hybrid state update (#14869)
    • Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
    • Support KVCacheManagerV2 adjust() in single GPU + agg PyExecutor loop (#14578)
    • Add disk cache config for KVCacheManagerV2 (#14845)
    • Add Wan I2V generation example (#14981)
    • Add LTX-2 visual generation example (#14976)
    • Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)
  • Fix

    • Fix mamba-out-of-block error with ADP + BS=1 + disagg (#14853)
    • Fix XQA IMA for invalid pages with sliding window (#14459)
    • Propagate event loop errors to await_responses callers (#12735)
    • Fix Mamba replay mode accuracy issues (#14509)
    • Fix PyExecutor hang in disagg TP prefill (#14020)
    • Fix stale runtime metadata issues during MLA fallback transitions (#14049)
    • Fix KVCacheManagerV2 block counting correctness issues (#14725)
    • Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
    • Fix LTX-2 audio PE padding issues (#14818)
    • Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
    • Fix config sharing issue for Qwen3-VL (#14766)
    • Enforce request and buffer index lifecycle integrity (#14768)
    • Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
    • Clamp KV pool window sizes to max_seq_len (#14905)
    • Fix mamba block calculation (#14524)
    • Add trust_remote_code=True to the LLM(...) constructor to fix various model loading issues (#14892)
    • Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
    • Add warmup for trtllm-gen fmha JIT kernels (#14851)
  • Documentation

    • Add VisualGen API walkthrough example and docs page (#14685)
    • Add Nemotron 3 Ultra doc (#14964, #15113)
  • Test & Infra

    • Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
    • Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
    • Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
    • Relocate tests to right-sized stages (#14684)
    • Move non-default-feature tests to post merge (#15038)

What's Changed

  • [None][test] Update datasets path by @JennyLiu-nv in #14671
  • [None][infra] Update new .test_durations by @EmmaQiaoCh in #14661
  • [TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in #14632
  • [https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in #14459
  • [None][feat] Tune mamba config by env variables by @Wanli-Jiang in #14730
  • [None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in #14803
  • [None][test] Update precision of previous device step time by @fredricz-20070104 in #14809
  • [None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in #14802
  • [TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in #14559
  • [https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in #12735
  • [TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in #14775
  • [TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in #13972
  • [None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14509
  • [None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in #14436
  • [None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in #14453
  • [TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in #14479
  • [None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in #14020
  • [https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in #14774
  • [#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in #14194
  • [None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #14789
  • [None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in #14791
  • [https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in #14793
  • [#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in #14477
  • [https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in #14639
  • [TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by @zhenhuaw-me in #14685
  • [None][chore] Update flashinfer-python from 0.6.12rc2 to 0.6.12 by @yihwang-nv in #14805
  • [None][fix] AutoDeploy: Unwaive llmc standalone tests by @bmarimuthu-nv in #14700
  • [TRTLLM-35882][feat] Add cute dsl gvr top-k decode kernel by @limin2021 in #14602
  • [https://nvbugs/6222480][test] fix stress test issue on H100 by @xinhe-nv in #14721
  • [None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #14787
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14783
  • [None][fix] synchronize MLA cache reuse fallback metadata by @DhineshPonnarasan in #14049
  • [None][feat] Add KV cache prefetch by @lowsfer in #14748
  • [https://nvbugs/6191524][fix] In MLA.forward_context, also call the warmup when has_cached_kv_for_mla_context by @tensorrt-cicd in #14536
  • [None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #14839
  • [None][fix] Cherry-pick kv_cache_manager_v2 fixes to main by @lowsfer in #14725
  • [None][test] Waive 11 failed cases for main in post-merge by @tensorrt-cicd in #14854
  • [None][feat] Enable flashifner gdn decoding kernel for qwen3.5 by @nv-guomingz in #13645
  • [https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update by @pcicotti in #14697
  • [https://nvbugs/6221450][fix] AutoDeploy: Qwen3.5 400B NVFP4 accuracy regression fix by @taylor-yb-lee in #14667
  • [TRTLLM-12648][test] implement disagg cancel stress metrics_thread by @chienchunhung in #14807
  • [None][chore] Update AD model list by @tcherckez-nvidia in #14686
  • [https://nvbugs/6226933][fix] canonicalize multimodal cache-key serialization to prevent hash collisions by @venkywonka in #14800
  • [https://nvbugs/6240561][fix] Unwaive DeepSeek R1 accuracy test by @taylor-yb-lee in #14870
  • [None][feat] Add Qwen image support by @pst2154 in #13449
  • [TRTLLM-12507][feat] Per-expert lora support with Cutlass backend by @brb-nv in #14801
  • [None][chore] Make submit.py can run single GPU test and accept customized config file by @HuiGao-NV in #14630
  • [None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #14792
  • [None][test] Update DSV32 32k4k config to avoid timeout issue by @chenfeiz0326 in #14856
  • [None][chore] Bump version to 1.3.0rc18 by @yuanjingx87 in #14872
  • [None][infra] Waive 5 failed cases for main in post-merge 2755 by @ZhanruiSunCh in #14883
  • [None][fix] LTX-2 audio PE pad: use token-axis seq_dim=1 for token-major rope by @luyiyun1021 in #14818
  • [#13082][fix] Fix-multimodal embedding mismatch by @aashirvad08 in #13240
  • [None][fix] Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750) by @yufeiwu-nv in #14750
  • [None][fix] Use renamed get_param_count_and_checkpoint_size in hybrid configs by @yufeiwu-nv in #14855
  • [None][test] Remove duplicate test cases in llm_perf_core file by @yufeiwu-nv in #14749
  • [None][test] Remove 28 closed-bug waive entries for main by @tensorrt-cicd in #14545
  • [TRTLLM-13022][test] remove deprecated models from tests by @xinhe-nv in #14660
  • [None][feat] Reserve one more slots for attention_dp in mixed mamba cache manager by @Wanli-Jiang in #14853
  • [https://nvbugs/6195110][fix] Restore DeepSeek shared-weights vanilla MTP path by @zhaoyangwang-nvidia in #14457
  • [#12359][feat] AutoDeploy: Support SSM replay kernel for MTP with FlashInfer by @galagam in #13725
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14857
  • [None][chore] add attention module owner for VisualGen by @zhenhuaw-me in #14814
  • [None][fix] release v1 KV blocks on MAX_UTILIZATION pause by @eopXD in #14723
  • [None][perf] Reduce OpenAI stream postprocess overhead by @2ez4bz in #14708
  • [None][fix] propagate chat prompt token ids (#14420) by @reasonsolo in #14859
  • [https://nvbugs/6211193][fix] etcd listen all interfaces by @reasonsolo in #14863
  • [#5247][fix] auto-detect local cnn_dailymail dataset by directory layout by @guan404ming in #13722
  • [https://nvbugs/6248987][fix] Made the slow-tokenizer swap lazy and idempotent. __init__ now just sets `_slo by @tensorrt-cicd in #14846
  • [None][chore] redact internal NVIDIA URLs from exec-slurm-compile skill by @ssam18 in #14538
  • [None][feat] VisualGen: Attention2D + Ulysses & Multi-GPU LPIPS Evals by @juney-nvidia in #13944
  • [TRTLLM-13077][feat] Decompose post_load_weights() by @chienchunhung in #14770
  • [None][fix] Fix config sharing issue for Qwen3-VL by @2ez4bz in #14766
  • [https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity by @chienchunhung in #14768
  • [None][feat] Add encoder CUDA graph support to llm.encode() by @tingyangk in #14326
  • [None][feat] Support Step-3.7-Flash model by @kaiyux in #14711
  • [https://nvbugs/6050489][chore] unwaive tests by @bo-nv in #14866
  • [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14896
  • [None][infra] Source code and container vulnerability fix by @yuanjingx87 in #14025
  • [None][infra] Waive 11 failed cases for main in post-merge 2757 by @ZhanruiSunCh in #14925
  • [https://nvbugs/5821415][test] update rtx6k test list by @xinhe-nv in #14929
  • [None][fix] Update dataset identifier for cnn_dailymail to use namespaced repo in quantization scripts by @yufeiwu-nv in #14930
  • [https://nvbugs/5979673][fix] Unwaive test_agent_multi_backends.py::test_run_with_different_env by @Shixiaowei02 in #14939
  • [None][fix] Add nemotron-v3 as the proper nemotron-h reasoning parser by @Wanli-Jiang in #14900
  • [https://nvbugs/6193836][test] Use EP=8 + attention DP for minimax_m2.5 8-GPU perf by @ruodil in #14613
  • [TRTLLM-8236][infra] fix platform tag for public wheel by @niukuo in #14616
  • [None][test] update bug ids in waives by @xinhe-nv in #14946
  • [https://nvbugs/6244474][fix] AutoDeploy: skip explicit shape-prop after MLIR elementwise fusion by @tensorrt-cicd in #14795
  • [None][infra] fix cbts json decode by @crazydemo in #14928
  • [https://nvbugs/6222480][fix] Fix stress by @xinhe-nv in #14949
  • [None][test] Decrease P1 models number and merge sanity test list into core by @yufeiwu-nv in #14952
  • [None][perf] Use a Triton kernel for Cpp mamba hybrid state update by @VALLIS-NERIA in #14869
  • [None][chore] Autodeploy unwaive 5888827, 6200112 by @galagam in #14894
  • [NVBUG-6248780][fix] Add --decoupled flag to benchmark_core_model in multi-instance test by @karljang in #14888
  • [TRTLLM-12870][feat] Support num_images_per_prompt for FLUX pipelines by @karljang in #14890
  • [TRTLLM-12214][perf] DeepGemmFusedMoE: fuse masked gather + finalize-scale into one Triton kernel by @xwang233 in #14592
  • [None][fix] Fix AutoDeploy accuracy tests by @bmarimuthu-nv in #13925
  • [TRTLLMINF-69][infra] Migrate A100X-FMHA-Post-Merge-1 and A100X-Triton-Post-Merge-[1,2] to SLURM by @mlefeb01 in #14921
  • [TRTLLM-11508][refactor] Merge Eagle3 and MTP-eagle one-model workers by @zhaoyangwang-nvidia in #12353
  • [https://nvbugs/6143787][fix] Add kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.6) to TestQwen3 by @tensorrt-cicd in #13852
  • [https://nvbugs/6248764][fix] Normalize non-sliding KV windows to full attention in AutoDeploy by @eopXD in #14906
  • [https://nvbugs/6240420][fix] Clamp KV pool window sizes to max_seq_len by @eopXD in #14905
  • [None][test] Fix the ci disagg perf local submit test scope too large issue to avoid HF Model not found by @fredricz-20070104 in #14989
  • [None][test] remove outdated model in perf test by @ruodil in #14992
  • [TRTLLM-12648][test] implement disagg cancellation injector thread by @chienchunhung in #14920
  • [None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash by @bmarimuthu-nv in #14759
  • [None] [waive] Waive the failed step3p7 test case due to ckpt update by @kaiyux in #14997
  • [https://nvbugs/6210714][fix] Fix mamba block calculation by @VALLIS-NERIA in #14524
  • [https://nvbugs/5546507][test] Remove obsolet… by @xinhe-nv in #14995
  • [None][fix] AutoDeploy: Move hf_id_to_local_model_dir to function for GLM4.7 Flash test by @bmarimuthu-nv in #14999
  • [TRTLLM-12893][infra] Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis by @ZhanruiSunCh in #14528
  • [None][infra] Waive 11 failed cases for main in post-merge 2760 by @ZhanruiSunCh in #15003
  • [None][fix] Uncomment Qwen3.5 and DSR1 from model registry so that they can run f… by @taylor-yb-lee in #15001
  • [TRTLLM-11410][feat] Cosmos3 Support by @NVShreyas in #14824
  • [https://nvbugs/5859886][fix] Remove the waiver by @ziyixiong-nv in #14948
  • [https://nvbugs/6248744][fix] Added trust_remote_code=True to the LLM(...) constructor and removed the… by @tensorrt-cicd in #14892
  • [https://nvbugs/6160629][fix] AutoDeploy: Fix manual seed setting for standalone tests by @galagam in #14954
  • [TRTLLM-12714][feat] KVCacheManagerV2: wire PyExecutor rebalance hook (single GPU, aggregated for now) by @thorjohnsen in #14578
  • [None][feat] add Wan I2V generation example by @o-stoner in #14981
  • [TRTLLM-12527][feat] Parallelize multi-shard visual-gen checkpoint loading and pre-fetch checking by @yibinl-nvidia in #14021
  • [https://nvbugs/6250866][fix] Fix deep ep partial warp sync for gptoss shapes by @dongfengy in #14977
  • [https://nvbugs/6272668][infra] Unwaive DSR1 and Qwen3.5 again by @taylor-yb-lee in #15010
  • [None][feat] Afmoe trinity support by @alyosha-swamy in #13148
  • [TRTLLM-13027][ci] Relocate under-using tests to right-sized stages by @QiJune in #14684
  • [None][feat] Add LTX-2 visual generation example by @yibinl-nvidia in #14976
  • [None][feat] AutoDeploy: Fix hardcoded configs by @taylor-yb-lee in #14943
  • [#13718][feat] AutoDeploy MoE all-to-all: cache + runtime max-tokens by @greg-kwasniewski1 in #13723
  • [TRTLLM-13177][doc] Add Nemotron 3 Ultra doc by @nv-guomingz in #14964
  • [#10710][feat] Make explicit CLI flags take precedence over --config / --extra_llm_api_options YAML by @marinayanov in #14812
  • [https://nvbugs/6260907][fix] unwaive test by @bo-nv in #15058
  • [None][chore] Increase GB200-4_GPUs-PyTorch shards by @tburt-nv in #14836
  • [TRTLLM-12648][test] implement disagg cancellation canary thread by @chienchunhung in #15015
  • [TRTLLM-12507][feat] Cudagraph support for routed-expert MoE LoRA with Cutlass backend - Part 1 by @brb-nv in #14923
  • [https://nvbugs/6245317][test] set Harmony tiktoken env for GPT-OSS disagg by @dongfengy in #14935
  • [https://nvbugs/6153955][test] unwaive GPT-OSS w4 DP4 CUTLASS by @dongfengy in #14884
  • [None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing by @lancelly in #14994
  • [TRTLLM-13259][ci] Merge DGX_H100 DeepSeek and GptOss stages by @QiJune in #15035
  • [None][infra] Waive 11 failed cases for main in post-merge 2765 by @ZhanruiSunCh in #15080
  • [None][infra] Waive 3 failed cases for main in post-merge 2765 by @ZhanruiSunCh in #15082
  • [None][test] waive weekly qa ci failure cases by @crazydemo in #15077
  • [None][feat] AutoDeploy: propagate layer_type hint across pattern-matcher rewrites by @greg-kwasniewski1 in #14835
  • [None][test] Waive 15 failed cases for main in QA CI by @tensorrt-cicd in #15056
  • [None][infra] Waive 1 failed cases for main in pre-merge 41894 by @ZhanruiSunCh in #15089
  • [TRTLLM-13262][ci] Move non-default-feature tests to post merge by @QiJune in #15038
  • [None][feat] Enable disk cache config for KV cache v2 by @reasonsolo in #14845
  • [https://nvbugs/6185446][fix] Add warmup for trtllm-gen fmha JIT kernels by @pengbowang-nv in #14851
  • [https://nvbugs/6162940][chore] Unwaive fixed test by @longlee0622 in #15078
  • [None][perf] Support Gemma RMSNorm + interleaved mRoPE in fused_qk_no… by @nv-guomingz in #14898
  • [None][test] Half K25 Agg Multi Round to Solve Timeout Issue by @chenfeiz0326 in #15083
  • [None][infra] Reduce Docker image layer count in release stage by @tburt-nv in #14972
  • [#14828][feat] AutoDeploy: support multi KV cache memory pool in trtllm attention by @MrGeva in #14911
  • [None][doc] Refine Nemotron Ultra doc by @nv-guomingz in #15113

New Contributors

Full Changelog: v1.3.0rc17...v1.3.0rc18

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.