github NVIDIA/TensorRT-LLM v1.3.0rc15

pre-release4 hours ago

Highlights

  • Model Support

    • Add Gemma4 multimodal model support with text, vision, audio, and chunked prefill capabilities (#12932, #14134)
    • Add Kimi K2.5 multimodal vision support and reasoning parser integration (#12788, #13801)
    • Add GPT-OSS, Ministral3, Nemotron-H, Nemotron Nano, and DeepSeek model enablement and compatibility updates (#12743, #12884, #13844, #13977)
    • Improve DeepSeek V4 and DeepSeek V3.2 support with new attention kernels, routing updates, tokenizer loading, and AutoConfig registration (#13652, #13186, #14261, #14293)
  • API

    • Add a typed exception hierarchy, shared classifier, retry-consumer migration, and typed Slurm infra failures (#13732, #13780, #13863, #13809, #14147)
    • Add VisualGen public output APIs, serving batch inference, and benchmark timing decomposition (#13635, #12350)
    • Add per-request media_io_kwargs support for chat completions (#13779)
    • Add per-rank iteration statistics and Attention-DP metrics to serving endpoints (#13221, #13649)
    • Add cache_salt_id support to the KV cache v2 manager (#13793)
    • Limit requested sampling logprobs as a breaking API change (#13520)
  • Feature

    • Improve MoE and fused-kernel performance with MegaMoE DeepGEMM, CUTEDSL MoE, shared-expert SwiGLU quantization, GDN fusion, bf16 FlashInfer MoE, and refreshed MoE cubins (#13384, #12884, #11897, #12966, #13689, #12440)
    • Add FP4 and FP8 decode kernels, FP4 DSA indexing, DeepSeek V4 attention kernels, FMHA head_dim 80 cubins, and multi-K and multi-dtype GVR Top-K support (#13929, #13219, #13340, #13652, #13808, #13948)
    • Improve VisualGen and diffusion pipelines with SageAttention for Wan/FLUX, fused cross-head QK Norm plus RoPE for WAN, LTX2 refactoring, and parallel VAE scaling (#13570, #13052, #13285, #13873)
    • Improve KV reuse, disaggregated serving, and transfer paths with transceiver v2 KV reuse, multi-threaded KV transfer, internal TRTLLM-Gen routing, additional conversation headers, and LoRA request-broadcast reduction (#13115, #13075, #13997, #13656, #12959)
    • Improve speculative decoding and hybrid-model execution with fractional synthetic acceptance rates, MTP block reuse, EAGLE3 rejection sampling, MTP max_draft_len decoupling, and mamba SSD prefill optimizations (#13569, #12896, #12588, #12341, #12731)
    • Improve performance tooling and runtime throughput with DFlash optimizations, host-profiler utilities, batch-full benchmark metrics, model-init NVLink caching, scheduling overhead reductions, beam-search overlap scheduling, and FC2 DenseGEMM autotuning (#13996, #11741, #13638, #14070, #13843, #14061, #13833)
    • Add CMake third-party cache support for clean builds (#13942)
  • Fix

    • Fix CUDA graph, profiling, and scheduling correctness issues including YAML CudaGraphConfig validation, profiler scoping, piecewise capture, Eagle3 hidden-state reuse, and guided decoding GIL handling (#13397, #12432, #13574, #13920, #13251)
    • Fix KV cache and scheduler behavior for FlashMLA token block overrides, mamba slot memory, delayed batching page release, adaptive ratio sampling, zero-layer mamba ranks, stale Scheduler V2 state, stale attention metadata, and chunked prefill EVS merging (#13752, #13489, #13805, #13857, #13999, #13592, #13696, #13754)
    • Fix model loading and quantization issues for GPT-OSS MXFP4, dummy weights, Mixtral modelopt export, DeepSeek V3 Lite FP8 MTP weights, composite HF configs, GLM-5 router GEMM, INT4 AWQ on SM120/121, and Qwen3 FP4 CUTLASS MoE OOM (#13708, #13879, #14179, #12530, #14068, #13740, #11561, #13349)
    • Fix serving and benchmark clients with hardened media URL loading, split SSE chunk parsing, aiohttp 3.13 streaming handling, /metrics tee-buffer serving, bounded gRPC payloads, router tokenizer skipping, unset attention_dp_relax handling, and clear GPT-OSS backend errors (#12748, #13686, #13952, #13405, #13519, #14030, #14276, #13166)
    • Fix distributed and disaggregated runtime stability for mamba disaggregation, worker preparation, PP executor shutdown, SM120 all-reduce launch, guided-decoding PP warmup barriers, Torch process-group teardown, Triton MoE memory freeing, and GB300 UCX settings (#13274, #13755, #13267, #13169, #13132, #12993, #14069, #14168)
    • Fix accuracy and memory regressions in DeepSeek, Nemotron, Qwen3, MTP, beam search, FMHA workspace sizing, and FP8 block-scaling autotuner cache growth (#13924, #13968, #13782, #14063, #13799, #13880, #14165)
    • Fix package, license, and compliance issues in llm-c standalone generation, SPDX headers, OSS headers, diffusers pinning, and broken documentation URLs (#14011, #14106, #14193, #14281, #13242, #13422)
  • Documentation

    • Add and update technical blogs for Helix Parallelism, Scaffolding, Gemma4, MoE as Dense GEMM on Blackwell, and VisualGen-related content (#13547, #11841, #13947, #13834, #14171)
    • Add DFlash quickstart updates, custom PyTorch backend kernel integration guidance, Gemma4 usage examples, spec-decoding support matrices, and layer-wise benchmark doc fixes (#13545, #13917, #14303, #14195, #13979)
    • Refresh image links and broken URLs in documentation and blog content (#13838, #13422)
  • Test & Infra

    • Add model and multimodal coverage for Wan 2.2 TI2V, nano v3 omni audio and video, Nemotron Ultra V3, Gemma4 CUDA graph registration, and W4A8_MXFP4_FP8 MoE unit tests (#13739, #13616, #13750, #13883, #13658, #14082, #13401)
    • Add and refresh performance coverage for VisualGen sanity, GB300 disaggregated NIXL, DSR1 disaggregated tests, trtllm-bench metrics, and Kimi K2.5 FP4 RCCA tests (#13144, #13594, #13882, #14178, #14172)
    • Improve change-based testing, CI triggers, GitHub checks, stage splitting, rerun handling, and LFS synchronization (#13382, #13899, #13993, #14022, #14064, #14035, #12406, #13826)
    • Improve build, dependency, and package infrastructure with FlashInfer updates, Transformers 5.x upgrades, compressed cubin archives, SBSA wheel image support, license scanning, and llm-c artifact cleanup (#13746, #13992, #14076, #12829, #13994, #13542, #12635, #13921, #13272)
    • Improve CI coverage organization by moving chunked-prefill cases, splitting long hardware-agnostic tests, adding feature-contract keys, and promoting DeepSeek-V4-Flash to the MoE CI subset (#14083, #13751, #13756, #13933, #13964)
    • Improve developer and CI operations with blossom-ci allowlist updates, skills naming enforcement, pre-commit validation, source-scan cleanup, and NFS temporary-file ignores (#13951, #14132, #14295, #14304, #14285, #13778, #14211)

What's Changed

  • [https://nvbugs/6001694][fix] Add CUDA profiler API scoping for visual gen nsys profiling by @chang-l in #12432
  • [https://nvbugs/6080024][fix] Fix CudaGraphConfig validation conflict from YAML deep merge by @nvchenghaoz in #13397
  • [None][perf] AutoDeploy: reduce C++ dispatch overhead in decode scheduling loop by @nvchenghaoz in #13012
  • [None][doc] Blogpost for Helix Parallelism by @brb-nv in #13547
  • [None][chore] Fix indexing conflict in blogposts by @brb-nv in #13772
  • [#12713][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (Remove Patches) by @govind-ramnarayan in #13247
  • [https://nvbugs/5911304][fix] Add URL validation and request hardening for media input loading by @yibinl-nvidia in #12748
  • [None][infra] Remove PULSE_REPO_BRANCH when running source code scanning by @yuanjingx87 in #13778
  • [TRTLLMINF-54][feat] Add typed exception hierarchy + unified classifier by @dpitman-nvda in #13732
  • [https://nvbugs/6094072][fix] swizzle GPT-OSS dummy MXFP4 weights by @dongfengy in #13708
  • [https://nvbugs/6094224][fix] Fix mamba disagg issues when conc > mbs by @bo-nv in #13274
  • Add log for raw model weights memory consumption by @HuiGao-NV in #13760
  • [None][perf] Drop cubin and Eliminate ~6s FMHA JIT recompile in eager generation by aligning kernel selection with CUDA graph warmup by @yunruis in #13505
  • [https://nvbugs/5615248][fix] Reduce beam-search prefill->decode handoff cost by @brb-nv in #13748
  • [None][chore] Update flashinfer-python from 0.6.9 to 0.6.10 by @yihwang-nv in #13746
  • [None][feat] Fuse GDN elementwise ops and split/transpose kernels by @Wong4j in #12966
  • [None][infra] Waive 3 failed cases for main in post-merge by @xinhe-nv in #13797
  • [None][chore] Update nvidia-cutlass-dsl version in visual_gen pyproject.toml by @yihwang-nv in #13642
  • [None][infra] Waive 3 failed cases for main in post-merge by @xinhe-nv in #13789
  • [None][feat] Update TRTLLM MoE cubins by @rosenrodt in #12440
  • [None][fix] Fix Autodeploy standalone package builder script tests by @bmarimuthu-nv in #13794
  • [#13320][fix] Propagate FlashMLA tokens_per_block override onto kv_cache_config by @eopXD in #13752
  • [None][test] Unset MPI related Env in local Perf Test Script by @chenfeiz0326 in #13795
  • [https://nvbugs/5615248][fix] Broader capture of piecewise cudagraph by @brb-nv in #13574
  • [TRTLLM-13271][feat] Artifact cleanup by @greg-kwasniewski1 in #13272
  • [TRTLLM-12023][infra] Create a basic perf sanity test suite for visua… by @taianz-nv in #13144
  • [https://nvbugs/5969216][fix] Ministral3 loading fix by @evezhier in #12743
  • [TRTLLMINF-54][feat] Migrate retry consumers to classify() + isFinalAttempt fix by @dpitman-nvda in #13780
  • [https://nvbugs/6082303][fix] Treat <tool_call> as implicit end-of-reasoning in nano-v3 parser by @tijyojwad in #13684
  • [None][fix] Add support for context multiCtaKv sparse fmha by @heyuhhh in #13410
  • [None][feat] Improve memory calculation for mamba hybrid models when block reuse is off by @VALLIS-NERIA in #13549
  • [None][chore] update CI allowlist 2026-05-06 by @tburt-nv in #13812
  • [TRTLLM-11851][feat] Add MX-only P2P checkpoint loading support for TRTLLM by @chienchunhung in #13531
  • [None][test] Add Wan 2.2 5B TI2V pipeline test in CI by @chang-l in #13739
  • [TRTLLM-12390][feat] Support fractional synthetic acceptance rates by @mikeiovine in #13569
  • [TRTLLM-12297][fix] Multimodal hash for VideoData ignores extracted audio, breaks KV reuse by @moraxu in #13585
  • [https://nvbugs/5972889][fix] Tighten allowed type by @yibinl-nvidia in #12850
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #13790
  • [TRTLLM-12089][test] split thop/parallel into hw-agnostic siblings by @QiJune in #13751
  • [None][chore] Refactor attention forward context by @yuxianq in #13662
  • [TRTLLM-10990][feat] Fuse SwiGLU and quant into shared expert by @peaceh-nv in #11897
  • [TRTLLM-12457][test] split _torch/speculative into hw-agnostic subdir by @QiJune in #13756
  • [None][fix] Register AutoDeploy accuracy tests in CI by @bmarimuthu-nv in #12864
  • [#4674][feat] optimize llama8B decode: trtllm silu_mul backend, quant+silu_mul, QKV passthrough to attention by @MrGeva in #12507
  • [None][chore] Bump version to 1.3.0rc15 by @VALLIS-NERIA in #13839
  • [None][infra] Waive 1 failed cases for main in post-merge by @xinhe-nv in #13823
  • [None][infra] Waive flaky openai-server tests + super_ad/qwen3 perf-r… by @Barry-Delaney in #13836
  • [None][doc] Update the image links in tech blog 19/20/21. by @nv-guomingz in #13838
  • [https://nvbugs/6075345][fix] test_llmapi_launch_multiple_tasks ignored the task_script parameter and always by @Superjomn in #13588
  • [None][infra] Waive 2 failed cases for main in post-merge 2709 by @ZhanruiSunCh in #13840
  • [None][feat] Integrate FP4 indexer for DSA on Blackwell by @lfr-0531 in #13340
  • [None][perf] Offload KvCacheAwareRouter tokenize+block-hash to a thread by @lishicheng1996-nv in #13377
  • [https://nvbugs/6115290][fix] Fix GPT OSS 120B GB200 Test Regression by @yijingl-nvidia in #13743
  • [TRTLLM-12173][tests] Add E2E accuracy test for nano v3 omni by @2ez4bz in #13616
  • [TRTLLM-12188][feat] Implement SWA prefill memory reuse (scratch slots) by @lowsfer in #13368
  • [TRTLLM-12258][chore] add visual gen dir in getMultiGpuFileChanged by @NVShreyas in #13512
  • [None][infra] Waive 1 failed cases for main in pre-merge 37091 by @ZhanruiSunCh in #13865
  • [None][feat] Upgrade transformers dependency to 5.3.0 by @longlee0622 in #12829
  • [https://nvbugs/6143879][fix] add a fixed random seed for the test by @nvchenghaoz in #13861
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #13829
  • [None][feat] Update the deepseek routing by @ChristinaZ in #13186
  • [None][fix] Always sync local ranks after prefetch in HfWeightLoader by @lancelly in #13556
  • [None][feat] Add MegaMoEDeepGemmFusedMoE backend wrapping DeepGEMM fp8_fp4_mega_moe by @Barry-Delaney in #13384
  • [TRTLLM-12515][feat] Update logic for quant exclusion for nemotron-h by @Wanli-Jiang in #13844
  • [None][chore] Convert cubins in repository to compressed archives by @tongyuantongyu in #13542
  • [https://nvbugs/6157131][ci] Waive flaky AutoDeploy accuracy test by @2ez4bz in #13881
  • [TRTLLM-11508][refactor] decouple MTP num_nextn_predict_layers from max_draft_len by @zhaoyangwang-nvidia in #12341
  • [https://nvbugs/6115562][fix] wait for disagg worker preparation by @reasonsolo in #13755
  • [TRTLLM-12152][infra] Init Rule Based Change Based Test Selection by @crazydemo in #13382
  • [TRTLLM-11319][feat] VisualGen public output API + bench timing decomposition by @zhenhuaw-me in #13635
  • [None][test] Unwaive DSR1 V32 Agg TEP tests by @chenfeiz0326 in #13550
  • [TRTLLM-12015][feat] Introduce KV reuse in transceiver v2 by @Shixiaowei02 in #13115
  • [None][infra] Waive 1 failed cases for main in post-merge 2710 by @ZhanruiSunCh in #13894
  • [TRTLLM-12508][infra] Enable pre-merge stage for gb300 by @EmmaQiaoCh in #13827
  • [None][infra] Waive 3 failed AutoDeploy accuracy tests for main by @Hudayday in #13906
  • [None][perf] Improve TRTLLM MoE autotune in DEP by @rosenrodt in #13667
  • [None][feat] Add per-rank iteration stats to /metrics endpoint by @lishicheng1996-nv in #13221
  • [None][infra] Waive 1 failed cases for main in pre-merge 37342 by @ZhanruiSunCh in #13912
  • [https://nvbugs/6108995][fix] Fix workspace size calculation for fmha_bmm1_scale_size with FP8ContextMLA by @pengbowang-nv in #13880
  • [#12784][feat] AutoDeploy: Optimize DeepSeek-R1 model performance by @taylor-yb-lee in #12946
  • [TRTLLM-12429][tests] Add audio E2E test for nano v3 omni by @2ez4bz in #13750
  • [https://nvbugs/6115271][fix] Fix stale TRTLLM attention backend metadata reuse across request shape transitions by @yibinl-nvidia in #13696
  • [TRTLLM-12287][feat] support per-request media_io_kwargs in chat completions by @aswinvisva in #13779
  • [TRTLLMINF-54][infra] Migrate typed-exception classifier to shared library by @dpitman-nvda in #13863
  • [TRTLLM-34871][feat] Add cute dsl FP8 paged MQA logits decode kernel by @limin2021 in #13219
  • [None][fix] Gate cudaProfilerStart/Stop on iter_counter, not loop counter by @Tabrizian in #13744
  • [https://nvbugs/6160248][ci] Waive broken test by @2ez4bz in #13931
  • [https://nvbugs/6115832][fix] Fix SSE stream parsing in benchmark client to handle split chunks by @tensorrt-cicd in #13686
  • [None][feat] Add DeepSeekV4 attention kernels by @heyuhhh in #13652
  • [None][fix] Skip calibration scalars in initialize_dummy_weights by @shikicloud in #13879
  • [None][infra] Waive 3 failed cases for main in pre-merge 37379 by @ZhanruiSunCh in #13941
  • [https://nvbugs/6017720][fix] Fix moe backend mismatch on Blackwell in perf test. by @dominicshanshan in #13470
  • [TRTLLM-12502][infra] Add GitHub Action to sync LFS objects from fork… by @niukuo in #13826
  • [None][feat] Add Gemma4 multimodal model support (text + vision + audio) by @lfr-0531 in #12932
  • [None][infra] Fix BASE_SHA to use merge commit parent instead of stal… by @niukuo in #13946
  • [None][feat] Add more disagg conversation ID headers support by @reasonsolo in #13656
  • [None][feat] Enable joint optimization of agent applications and TensorRT-LLM with Scaffolding by @Boreas618 in #11173
  • [TRTLLM-11585][feat] Add CUTEDSL moe backend for nemotron-h by @Wanli-Jiang in #12884
  • [None][fix] Use dynamic port in sharded_rmsnorm tests to avoid EADDRINUSE by @MrGeva in #13835
  • [https://nvbugs/6095421][fix] fix PP>=3 executor shutdown hang in broadcast sample state loop by @yihwang-nv in #13267
  • [https://nvbugs/6140411][fix] Fix index error of shared expert when loading weights by @shuyixiong in #13856
  • [None][chore] Unwaive stale autodeploy waives by @galagam in #13862
  • [None][chore] Remove the waiver by @ziyixiong-nv in #13647
  • [None][fix] Fix accracy regression in DeepSeek models by @taylor-yb-lee in #13924
  • [TRTLLM-12026][feat] Support MTP with block reuse enabled for hybrid models by @VALLIS-NERIA in #12896
  • [https://nvbugs/6087632][fix] fix test def to use local model by @bo-nv in #13555
  • [None][refactor] MoEScheduler split + MegaMoE EPLB / multi-chunk / CI integration by @xxi-nv in #13908
  • [TRTLLM-11228][feat] Update quickstart for DFlash by @ziyixiong-nv in #13545
  • [None][doc] Gemma 4 support & eval task updates by @Hudayday in #13947
  • [None][doc] Refactor blog18 by @bobboli in #13956
  • [None][test] Add more models into Pre-merge Perf Test by @chenfeiz0326 in #13884
  • [None][fix] Use one mamba slot sentinel to save memory by @Wanli-Jiang in #13489
  • [https://nvbugs/6114711][fix] add reasoning parser for kimi-k2.5 and enable the auto flow by @JadoTu in #13801
  • [TRTLLM-12466][refactor] Refactor hashing with new container types by @2ez4bz in #13800
  • [https://nvbugs/6100102][fix] Fix cutlass grouped gemm launcher EpilogueScalars construction by @yifeizhang-c in #13945
  • [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #13927
  • [None][infra] Waive 3 failed cases for main in post-merge 2714 by @ZhanruiSunCh in #13973
  • [None][chore] Remove glm_moe_dsa tokenizer WAR after Transformers 5.x upgrade by @longlee0622 in #13901
  • [#13560][fix] AutoDeploy: cut per-iter host overhead by @MrGeva in #13810
  • [None][test] Waive 5 failed cases for main in QA CI by @xinhe-nv in #13959
  • [https://nvbugs/6162853][chore] unwaive test by @galagam in #13976
  • [https://nvbugs/6084764][fix] unwaive DeepSeek-R1 fp8 blockscale throughput by @bobboli in #13591
  • [None][infra] Add users to blossom-ci allowlist by @yuanjingx87 in #13951
  • [TRTLLM-12453][fix] Accommodate chunked prefill in Nemotron's EVS merging logic by @moraxu in #13754
  • [TRTLLM-12430][tests] Add video E2E test for nano v3 omni by @2ez4bz in #13883
  • [None][feat] Remove LoRA weights from request broadcast by @achartier in #12959
  • [TRTLLM-12624][ci] Drop tensorrt_llm/llmapi/ from multi-GPU trigger list by @QiJune in #13993
  • [None][infra] Waive 12 failed cases for main in post-merge by @xinhe-nv in #13982
  • [None][infra] Waive 19 failed cases for main in post-merge by @xinhe-nv in #13981
  • [None][infra] Waive 13 failed cases for main in post-merge by @xinhe-nv in #13980
  • [None][infra] Waive 13 failed cases for main in post-merge by @xinhe-nv in #13986
  • [None][infra] Waive 7 failed cases for main in post-merge by @xinhe-nv in #13987
  • [None][feat] Optimize mamba SSD prefill and extend flashinfer dispatch by @Wanli-Jiang in #12731
  • [TRTLLM-12128][feat] enable SageAttention for Wan/FLUX (new commits) by @xrq-phys in #13570
  • [None][infra] Waive 4 failed cases for main in post-merge by @xinhe-nv in #13984
  • [None][feat] Multi-K (512/1024/2048) and Multi-dtype (fp32/bf16/fp16) GVR Top-K by @longcheng-nv in #13948
  • [None][infra] Waive 4 failed cases for main in post-merge by @xinhe-nv in #13990
  • [None][refactor] Decouple cached prefix from KVSlice token_range by @Shixiaowei02 in #13937
  • [None][infra] Waive 20 failed cases for main in post-merge by @xinhe-nv in #13989
  • [TRTLLM-11579][feat] VisualGen batch inference support in serve module by @JunyiXu-nv in #12350
  • [None][infra] Waive 10 failed cases for main in post-merge 2715 by @ZhanruiSunCh in #14032
  • [None][infra] Waive 1 failed cases for main in pre-merge 37568 by @ZhanruiSunCh in #14034
  • [None][perf] Follow-up patch for "Improve TRTLLM MoE autotune in DEP (#13667)" by @rosenrodt in #13971
  • [https://nvbugs/6162940][fix] Added a SentencePieceTokenizer wrapper in examples/utils.py that drives `sen by @tensorrt-cicd in #13983
  • [None][infra] Waive 1 failed cases for main in pre-merge 37674 by @ZhanruiSunCh in #14007
  • [https://nvbugs/6115832][fix] Fix aiohttp 3.13 streaming ValueError in benchmark client by @chenfeiz0326 in #13952
  • [None][fix] Release deferred ctx KV pages in V2 delay batching by @lancelly in #13805
  • [TRTLLM-12399][fix] Fix KV cache adaptive ratio sampling by @lowsfer in #13857
  • [None][test] Waive 2 failed cases for main in QA CI on H100 by @xinhe-nv in #14001
  • [None][test] unwaive cases for main in QA CI by @xinhe-nv in #13991
  • [https://nvbugs/6163147][fix] unfuse transformers 5.x fused Mixtral MoE for modelopt quant path by @longlee0622 in #14027
  • [https://nvbugs/6072125][fix] harden allreduce rmsnorm fusion multigpu test by @tcherckez-nvidia in #13606
  • [None][infra] Waive 1 failed cases for main in pre-merge 37652 by @ZhanruiSunCh in #14045
  • [None][fix] only configure gc thresholds once by @ixlmar in #13910
  • [None][fix] CppMambaHybridCacheManager: handle ranks with zero local mamba layers by @VALLIS-NERIA in #13999
  • [TRTLLM-12631][infra] Split some long stages by @EmmaQiaoCh in #14035
  • [#13909][fix] Reuse hidden_states buffer across CUDA graph captures in Eagle3 by @ml-inference in #13920
  • [TRTLLM-7081][refactor] Add MultimodalModelMixin by @2ez4bz in #13866
  • [None][fix] Qwen3.5 DFlash by @amukkara in #13782
  • [None][infra] Check license with both isPermissive and isProprietary flags by @yuanjingx87 in #13921
  • [None][fix] Unwaive standalone llm-c package generation test by @bmarimuthu-nv in #14011
  • [None][test] add Nemotron Ultra V3 AutoDeploy accuracy test by @tcherckez-nvidia in #13658
  • [https://nvbugs/6162128][ci] Skip nano v3 E2E test for GB300 by @2ez4bz in #14036
  • [https://nvbugs/5945047][fix] Fix cluster launch enablement for SM120 GPUs in allReduce fusion by @ziyixiong-nv in #13169
  • [None][fix] fix warm up number in disagg benchmark by @chuangz0 in #14041
  • [None][test] Add func and perf case of nemotron-3-Nano-Omni model on DGX-Spark by @JennyLiu-nv in #13837
  • [https://nvbugs/6050483][fix] pin diffusers version by @o-stoner in #13242
  • [None][fix] Fix GIL management for guided decoding host func by @Tabrizian in #13251
  • [None][feat] Support cache_salt_id in KV cache v2 manager by @eopXD in #13793
  • [TRTLLM-9651][infra] Enhance test rerun logic with unfinished and not-run test handling by @yiqingy0 in #12406
  • [None][fix] Fix replay iter flag names in layer-wise benchmarks docs by @kaiyux in #13979
  • [None][feat] Update FMHA cubins for head_dim 80 by @yuxianq in #13808
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14062
  • [None][feat] Add --use-3rdparty-cache to accelerate cmake configuration of clean build by @yuantailing in #13942
  • [None][test] Add GB300 DISAGG NIXL CI Perf Test Back by @chenfeiz0326 in #13594
  • [None][feat] use multi thread for kv transfer by @chuangz0 in #13075
  • [None][chore] Update flashinfer-python from 0.6.10 to 0.6.11 by @yihwang-nv in #13992
  • [#8542][feat] AutoDeploy: add Llama-3.1-8B FP8 perf-sanity test on H100 by @MrGeva in #14039
  • [https://nvbugs/6160248][fix] AutoDeploy: fixed broken pattern matching of fuse_rope_into_trtllm_attention transform by @MrGeva in #14038
  • [https://nvbugs/6141806][test] unwaive disaggregated overlap gen-first tests by @reasonsolo in #13807
  • [https://nvbugs/6084568][fix] Fix workspace_size error when running qwen3 CI by @byshiue in #14067
  • [None][chore] Make poetry.lock update opt-in in flashinfer-upgrade skill by @yihwang-nv in #14077
  • [TRTLLM-12659][ci] move 3 python_scheduler chunked_prefill cases to post merge by @QiJune in #14083
  • [None][fix] AutoDeploy: Cleanup CUDA graph memory in shutdown by @galagam in #14050
  • [None][feat] enable TRTLLM-Gen internal routing by @tcherckez-nvidia in #13997
  • [https://nvbugs/6160248][fix] AutoDeploy: unwaived test_fuse_qkv_passthrough_with_rope by @MrGeva in #14081
  • [https://nvbugs/6102381][fix] serve /metrics from tee buffer to avoid racing iter stats collector by @JunyiXu-nv in #13405
  • [None][infra] Waive 1 failed cases for main in pre-merge 37998 by @ZhanruiSunCh in #14094
  • [#12716][feat] Fused cross-head QK Norm + RoPE kernel for WAN by @anikaj-eng in #13052
  • [https://nvbugs/6078431][fix] Unwaive the test_llm_disagg_streaming_gen_cancelled test. by @zheyuf in #14058
  • [TRTLLM-11767][feat] LTX2 pipeline refactor part1 - auto upgrade to two stage pipeline by @yibinl-nvidia in #13285
  • [None][feat] Emit per-rank Attention-DP iteration stats by @indrajit96 in #13649
  • [None][infra] Add explicit llmapi-compatibility label gh check when api touched by @venkywonka in #14064
  • [TRTLLM-10804][infra] add LLM_SBSA_WHEEL_DOCKER_IMAGE by @niukuo in #12635
  • [None][test] Add checkpoint_format / load_format keys to test_features_contract by @chienchunhung in #13933
  • [None][test] promote DeepSeek-V4-Flash to MoE CI config subset by @xxi-nv in #13964
  • [TRTLLM-12503][feat] Parallel VAE independent scaling and fix arg passing by @NVShreyas in #13873
  • [None][feat] add batch-full benchmark throughput metric by @zhaoyangwang-nvidia in #13638
  • [https://nvbugs/6076767][fix] Add barrier before warmup to prevent PP hang with guided decoding by @ziyixiong-nv in #13132
  • [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #14086
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14093
  • [TRTLLM-11375][feat] Add Kimi K2.5 multimodal vision support by @tianyuxbear in #12788
  • [None][chore] Update flashinfer-python from 0.6.11 to 0.6.11.post1 by @yihwang-nv in #14076
  • [None][fix] Raise clear error when GPT-OSS is used with non-TRTLLM attention backend by @ssam18 in #13166
  • [TRTLLM-35237][feat] Add cute dsl FP4 paged MQA logits decode kernel by @limin2021 in #13929
  • [None][chore] Remove the waiver by @ziyixiong-nv in #14114
  • [TRTLLM-12627][ci] Narrow tensorrt_llm/serve/ MGPU trigger to disagg-only files by @QiJune in #14022
  • [None][fix] Fix misleading skills that use the -ccache option by @yuantailing in #13970
  • [None][fix] skip tokenizer in kvcache router when there is only one server by @reasonsolo in #14030
  • [https://nvbugs/6058251][fix] Resolve top-level model_type for composite HF configs by @JunyiXu-nv in #14068
  • [https://nvbugs/6109719][fix] Update all broken URLs to their new locations and remove the expired event entry by @tensorrt-cicd in #13422
  • [None][infra] Waive 3 failed cases for main in post-merge 2717 by @ZhanruiSunCh in #14119
  • [None][infra] Waive 1 failed cases for main in post-merge 2718 by @ZhanruiSunCh in #14131
  • [None][fix] Move drain inside pause_generation() for async RL by @hchings in #13784
  • [https://nvbugs/6143811][fix] AutoDeploy gate quantization tests by @galagam in #13846
  • [None][fix] Gemma4 CUDA-graph test KV pre-alloc + L0 registration by @Hudayday in #14082
  • [https://nvbugs/6140422][fix] Unwaive K25 Disagg DEP Test by @chenfeiz0326 in #14139
  • [None][fix] Reuse prior-attempt passes when infra retry fires by @dpitman-nvda in #14002
  • [https://nvbugs/6095421][chore] Unwaive 1 failed test by @heyuhhh in #13877
  • [TRTLLMINF-54][feat] SlurmConfig boundary throws typed InfraFailure by @dpitman-nvda in #13809
  • [https://nvbugs/6108841][fix] add hidden_dim=6144 router GEMM instantiation for GLM-5 by @yijingl-nvidia in #13740
  • [None][feat] AutoDeploy re-onboard GPT_OSS by @nvchenghaoz in #14004
  • [https://nvbugs/6080024][fix] autodeploy unwaive test by @nvchenghaoz in #14103
  • [#13534][chore] AutoDeploy: Remove Two Model Speculative Decoding Support by @govind-ramnarayan in #13532
  • [None][fix] Add SPDX Apache-2.0 headers and fix license compliance for llm-c standalone repo by @bmarimuthu-nv in #14106
  • [TRTLLM-11540][feat] Support rejection sampling in EAGLE3 dynamic tree by @zhaoyangwang-nvidia in #12588
  • [TRTLLM-12152][infra] change based testing rules on tests by @crazydemo in #13899
  • [None][doc] scaffolding tech blog part two by @Boreas618 in #11841
  • [TRTLLMINF-54][feat] Delete legacy classifyInfraFailure + four pattern lists by @dpitman-nvda in #14147
  • [https://nvbugs/5981122][fix] Lower KV cache fraction for python_scheduler MTP combo on H100 by @lancelly in #14063
  • [https://nvbugs/5955792][fix] Fix DeepSeekV32 test_fp8_blockscale[baseline_mtp1] OOM on Blackwell by @sunnyqgg in #12823
  • [None][doc] Add tech blog23: MoE as Dense GEMM on Blackwell by @zongfeijing in #13834
  • [None][infra] Add mingyangHao to blossom-ci allowlist by @ZhanruiSunCh in #14132
  • [https://nvbugs/6168136][fix] Unwaive GPT-OSS test_w4_4gpus dp4-trtllm-fp8 by @xwang233 in #14118
  • [https://nvbugs/6084825][fix] Unwaive testcase by @YihuiLu512 in #13641
  • [None] [docs] rename blog23_MoE_as_Dense_GEMM to blog24_MoE_as_Dense_GEMM by @Kefeng-Duan in #14171
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14136
  • [None][fix] Clear stale Scheduler V2 request state by @jiaganc in #13592
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14163
  • [https://nvbugs/5805494][fix] Limit maximum warmup token count to prevent crash in autotuner by @dbari in #13758
  • [None][feat] Upgrade transformers dependency to 5.5.3 by @Hudayday in #13994
  • [https://nvbugs/5923456][fix] GRPC bound request payloads by @yibinl-nvidia in #13519
  • [https://nvbugs/5944731][fix] BREAKING: Limit sampling requested logprobs by @yibinl-nvidia in #13520
  • [None][perf] Speed up model init: cache support_nvlink() by @yuantailing in #14070
  • [https://nvbugs/6162323][fix] Make mxfp4 H20 swizzle WAR more robust by @dongfengy in #14054
  • [None][fix] Make SleepConfig picklable by replacing closure lambda in defaultdict by @hhzhang16 in #13918
  • [#13321][fix] disable multi_stream on piecewise path instead of persistent buffer by @suyoggupta in #13396
  • [None][tests] Speed up EPD disagg tests by @2ez4bz in #14101
  • [TRTLLM-11950][perf] Audio feature extractor optimizations by @2ez4bz in #14031
  • [None][infra] Waive 1 failed cases for main in pre-merge 38383 by @ZhanruiSunCh in #14192
  • [None][perf] Enable in-flight batching for Nemotron3 Nano Omni multimodal encoder by @yechank-nvidia in #13977
  • [TRTLLM-12533][refactor] Move Media IO modality loading into MediaIO Interfaces by @aswinvisva in #14010
  • [None][infra] Waive 1 failed cases for main in pre-merge 38428 by @ZhanruiSunCh in #14204
  • [TRTLLM-11228][feat] Perf optimizations for DFlash by @ziyixiong-nv in #13996
  • [#13446][feat] AutoDeploy: Add Remaining Models From Model Onboarding Sprint Part 1 (03/19) by @govind-ramnarayan in #13787
  • [https://nvbugs/6184143][chore] AutoDeploy Waive DeciLM and GraniteMoEHybrid failures by @galagam in #14215
  • [https://nvbugs/6152892][fix] Fix Triton MOE memory free when no swizzling enabled by @dongfengy in #14069
  • [https://nvbugs/6163147][fix] swap layer.mlp in place for Mixtral modelopt export by @longlee0622 in #14179
  • [https://nvbugs/6094100][fix] set UCX_TLS for gb300 to enable disagg test by @chuangz0 in #14168
  • [https://nvbugs/5879577][fix] Fix KeyError in DeepSeekV3Lite FP8 MTP weight loading by @sunnyqgg in #12530
  • [None][feat] Add bf16 trtllm moe through flashinfer. by @nv-guomingz in #13689
  • [TRTLLM-10851][feat] Further doc and utility features for host profiler. by @hyukn in #11741
  • [https://nvbugs/6163030][fix] Unwaive testcase by @leslie-fang25 in #14169
  • [None][infra] Waive 13 failed cases for main in post-merge 2722 by @ZhanruiSunCh in #14235
  • [None][test] add metric for trtllm-bench by @ruodil in #14178
  • [None][perf] FC2 DenseGEMM autotune: split-K, swap_ab, fine-grained tuning buckets by @JacobHu-NV in #13833
  • [None][test] Add DSR1 B200 DISAGG to CI Perf Test by @chenfeiz0326 in #13882
  • [https://nvbugs/6025177][test] rcca tests using kimi k2.5 fp4 by @xinhe-nv in #14172
  • [https://nvbugs/5615248][fix] Beam history copies only on terminal steps by @brb-nv in #13799
  • [None][chore] Refactor salting support for KVCacheManagerV2 by @lowsfer in #14140
  • [None][fix] Fix bugs related with nemotron-nas model by @Wanli-Jiang in #13968
  • [None][chore] gitignore NFS system temporary files by @zhenhuaw-me in #14211
  • [#14173][chore] AutoDeploy: Removed perf tests from L0 by @MrGeva in #14220
  • [None][fix] Add SPDX Apache-2.0 headers to auto_deploy test files by @bmarimuthu-nv in #14193
  • [#8542][feat] AutoDeploy: add DeepSeek-R1 FP8 perf test on 8x B200 post merge, remove super perf test from premerge by @MrGeva in #14144
  • [https://nvbugs/6094208][fix] AutoDeploy: skip bf16 Nemotron-Nano-V3 accuracy test on <80GB GPUs by @galagam in #14243
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14221
  • [https://nvbugs/6185713][fix] Revert PR13758's code changes on Limiting maximum warmup token count by @chenfeiz0326 in #14252
  • [https://nvbugs/6059036][fix] Unwaive test_autodeploy_from_registry[google_gemma-3-1b] and test_encode_matches_huggingface[gemma-3-1b] by @marinayanov in #14228
  • [None][doc] Update spec dec support matrices by @mikeiovine in #14195
  • [None][chore] Rename .claude skills with trtllm- prefix and drop ci-failure-retrieval by @kaiyux in #14234
  • [TRTLLM-12462][fix] Fix FP8 block scaling GEMM autotuner cache growth by @cascade812 in #14165
  • [#13076][fix] Destroy torch distributed process groups on PyExecutor shutdown by @janbernloehr in #12993
  • [None][doc] Add guide for integrating custom kernels in PyTorch backend by @chang-l in #13917
  • [None][test] Waive 4 failed cases for main in QA test list by @xinhe-nv in #14233
  • [None][infra] Waive 2 failed cases for main in post-merge 2723 by @ZhanruiSunCh in #14294
  • [None][infra] Update blossom-ci allowlist: fix Mingyang, add brnguyen2 and yongzhiz by @ZhanruiSunCh in #14295
  • [None][test] Waive 2 failed cases for main in QA CI by @xinhe-nv in #14260
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14259
  • [https://nvbugs/6185234][fix] DeepSeek-V3.2 tokenizer load on transformers 5.x by @Hudayday in #14261
  • [None][feat] Add chunked prefill support for Gemma4 (text + vision multimodal) by @Hudayday in #14134
  • [https://nvbugs/6162128][tests] Skip nano v3 E2E tests entirely on G/B300 by @2ez4bz in #14185
  • [None][test] Waive 5 failed cases for main in QA CI by @xinhe-nv in #14283
  • [TRTLLM-11127][feat] add W4A8_MXFP4_FP8 MoE unit test support by @xxi-nv in #13401
  • [None][chore] Enforce Claude Code skill and agent naming convention via pre-commit by @kaiyux in #14285
  • [None][fix] Prevent SLURM dispatcher retry duplicate-upload error by @dpitman-nvda in #14269
  • [https://nvbugs/6069543][fix] Lower accuracy threshold for H20 qwen3.5 test by @rosenrodt in #13895
  • [https://nvbugs/6168859][chore] Waive test_openai_chat_guided_decoding on all GPUs by @tburt-nv in #14313
  • [None][infra] Waive 1 failed cases for main in pre-merge 38844 by @ZhanruiSunCh in #14316
  • [https://nvbugs/6189918][chore] Waive test_auto_dtype_with_helix[fifo-cudagraph:with_padding-pp1tp1cp4] by @chienchunhung in #14319
  • [#13561][fix] AutoDeploy: forward garbage_collection_gen0_threshold to PyExecutor by @MrGeva in #14218
  • [https://nvbugs/6117814][fix] AutoDeploy: Fix Eagle cu_seqlen data race by @govind-ramnarayan in #14008
  • [https://nvbugs/6185234][fix] register deepseek_v32 / kimi_k2 with transformers AutoConfig by @longlee0622 in #14293
  • [None][infra] Waive 3 failed cases for main in pre-merge 38724 by @ZhanruiSunCh in #14324
  • [None][fix] Update the OSS headers in derived FLA ops and AD modeling code by @bmarimuthu-nv in #14281
  • [#13580][fix] AutoDeploy: Support Gemma3n/4 E2B variants by @bmarimuthu-nv in #13630
  • [None][chore] Remove trailing spaces from module name in logger output by @nv-guomingz in #14122
  • [None][fix] Handle unset attention_dp_relax in ADP routers by @peihu-nv in #14276
  • [None][fix] Fix int4 awq for sm120/121 by @pamelap-nvidia in #11561
  • [None][infra] Update blossom-ci allowlist: fix jdebache by @yiqingy0 in #14304
  • [None][test] Waive 8 failed cases for main in QA CI by @xinhe-nv in #14309
  • [https://nvbugs/5615248][perf] Early emission of first token with overlap scheduling by @brb-nv in #14061
  • [https://nvbugs/6095421][fix] Update resolve_moe_backend by @heyuhhh in #14282
  • [https://nvbugs/6094108][fix] Fix Qwen3-30B-A3B NVFP4 tep4 CUTLASS MoE test OOM on B300 by @tensorrt-cicd in #13349
  • [TRTLLM-12520][perf] Reduce host overhead during scheduling and sampling by @tongyuantongyu in #13843
  • [https://nvbugs/6162624][test] Unwaive passing test by @dongfengy in #14158
  • [TRTLLM-10362][fix] Fix trtllm-bench for Nemotron models by @pamelap-nvidia in #12364
  • [None][doc] Gemma 4: usage examples by @Hudayday in #14303
  • [None][refactor] clean up AttentionForwardArgs by @yuxianq in #14244
  • [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14332

New Contributors

  • @taianz-nv made their first contribution in #13144
  • @aswinvisva made their first contribution in #13779
  • @shikicloud made their first contribution in #13879
  • @ml-inference made their first contribution in #13920
  • @anikaj-eng made their first contribution in #13052
  • @JacobHu-NV made their first contribution in #13833
  • @janbernloehr made their first contribution in #12993

Full Changelog: v1.3.0rc14...v1.3.0rc15

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.