-
Known Issues
- DSV3.2 will crash with an IMA in various long-running perf tests on GB200/GB300 when the CuteDSL MoE backend is used. Work around this issue by using another MoE backend.
-
Model Support
-
API
-
Feature
- Upgrade NIXL to v1.0.1 and UCX to 1.21 (#14436)
- Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL (#14453)
- Enable FlashInfer GDN decoding kernel for Qwen3.5 (#13645)
- Add per-expert LoRA support with Cutlass backend (#14801)
- Reduce OpenAI stream postprocess overhead (#14708)
- Add encoder CUDA graph support to
llm.encode()(#14326) - Use a Triton kernel for C++ mamba hybrid state update (#14869)
- Fuse masked gather + finalize-scale into one Triton kernel in DeepGemmFusedMoE (#14592)
- Support KVCacheManagerV2
adjust()in single GPU + agg PyExecutor loop (#14578) - Add disk cache config for KVCacheManagerV2 (#14845)
- Add Wan I2V generation example (#14981)
- Add LTX-2 visual generation example (#14976)
- Update flashinfer-python from 0.6.12rc2 to 0.6.12 (#14805)
-
Fix
- Fix
mamba-out-of-blockerror with ADP + BS=1 + disagg (#14853) - Fix XQA IMA for invalid pages with sliding window (#14459)
- Propagate event loop errors to
await_responsescallers (#12735) - Fix Mamba replay mode accuracy issues (#14509)
- Fix PyExecutor hang in disagg TP prefill (#14020)
- Fix stale runtime metadata issues during MLA fallback transitions (#14049)
- Fix KVCacheManagerV2 block counting correctness issues (#14725)
- Canonicalize multimodal cache-key serialization to prevent hash collisions (#14800)
- Fix LTX-2 audio PE padding issues (#14818)
- Release KVCacheManagerV1 blocks on MAX_UTILIZATION pause (#14723)
- Fix config sharing issue for Qwen3-VL (#14766)
- Enforce request and buffer index lifecycle integrity (#14768)
- Add nemotron-v3 as the proper nemotron-h reasoning parser (#14900)
- Clamp KV pool window sizes to
max_seq_len(#14905) - Fix mamba block calculation (#14524)
- Add
trust_remote_code=Trueto theLLM(...)constructor to fix various model loading issues (#14892) - Fix deep EP partial warp sync for GPT-OSS shapes (#14977)
- Add warmup for trtllm-gen fmha JIT kernels (#14851)
- Fix
-
Documentation
-
Test & Infra
- Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750)
- Remove obsolete tests (#14995, #14660, #14992, #14952, #14749)
- Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis (#14528)
- Relocate tests to right-sized stages (#14684)
- Move non-default-feature tests to post merge (#15038)
What's Changed
- [None][test] Update datasets path by @JennyLiu-nv in #14671
- [None][infra] Update new .test_durations by @EmmaQiaoCh in #14661
- [TRTLLM-13015][feat] drop complex visual_gen CLI example scripts by @zhenhuaw-me in #14632
- [https://nvbugs/6117811][fix] Fix XQA IMA for invalid pages with sliding window by @pengbowang-nv in #14459
- [None][feat] Tune mamba config by env variables by @Wanli-Jiang in #14730
- [None][test] Update moe backend for ctx and acceptance length env by @fredricz-20070104 in #14803
- [None][test] Update precision of previous device step time by @fredricz-20070104 in #14809
- [None][infra] Waive 12 failed cases for main in post-merge 2749 by @ZhanruiSunCh in #14802
- [TRTLLM-12971][infra] Fix parse classname logic in timeout result by @yiqingy0 in #14559
- [https://nvbugs/6038228][fix] Propagate event loop errors to await_responses callers by @JunyiXu-nv in #12735
- [TRTLLM-12288][feat] Support Nemotron-H nvfp4 ckpt on Hopper by @JadoTu in #14775
- [TRTLLM-12596][feat] Support simple logprob format by @tongyuantongyu in #13972
- [None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14509
- [None][feat] Upgrade NIXL to v1.0.1 and UCX to 1.21 by @chuangz0 in #14436
- [None][feat] Refactor DWDP from CUDA IPC to CUDA VMM + MNNVL composite VA by @tianyuz-nv in #14453
- [TRTLLM-10947][perf] eagle3: use cudaMemcpy2DAsync custom op for hidden-state capture by @pcicotti in #14479
- [None][fix] PyExecutor Hang in Disagg TP Prefill by @jthomson04 in #14020
- [https://nvbugs/6240561][fix] Autodeploy fix the deepseek accuracy drop by @nvchenghaoz in #14774
- [#12702][feat] Autodeploy deprecate the legacy triton attention by @nvchenghaoz in #14194
- [None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #14789
- [None][test] Waive 7 failed cases for main in QA CI by @tensorrt-cicd in #14791
- [https://nvbugs/6240561][fix] Fix AutoDeploy DeepSeek-R1 accuracy drop by @taylor-yb-lee in #14793
- [#14588][fix] [AutoDeploy] Fix OOM of DeepSeek-R1 NVFP4 for tp=4 by @taylor-yb-lee in #14477
- [https://nvbugs/6179761][fix] Save LTX-2 BF16 weights to speed up perf by @yibinl-nvidia in #14639
- [TRTLLM-13028][doc] Add VisualGen API walkthrough example and docs page by @zhenhuaw-me in #14685
- [None][chore] Update flashinfer-python from 0.6.12rc2 to 0.6.12 by @yihwang-nv in #14805
- [None][fix] AutoDeploy: Unwaive llmc standalone tests by @bmarimuthu-nv in #14700
- [TRTLLM-35882][feat] Add cute dsl gvr top-k decode kernel by @limin2021 in #14602
- [https://nvbugs/6222480][test] fix stress test issue on H100 by @xinhe-nv in #14721
- [None][test] Waive 6 failed cases for main in QA CI by @tensorrt-cicd in #14787
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14783
- [None][fix] synchronize MLA cache reuse fallback metadata by @DhineshPonnarasan in #14049
- [None][feat] Add KV cache prefetch by @lowsfer in #14748
- [https://nvbugs/6191524][fix] In MLA.forward_context, also call the warmup when has_cached_kv_for_mla_context by @tensorrt-cicd in #14536
- [None][test] Waive 2 failed cases for main in QA CI by @tensorrt-cicd in #14839
- [None][fix] Cherry-pick kv_cache_manager_v2 fixes to main by @lowsfer in #14725
- [None][test] Waive 11 failed cases for main in post-merge by @tensorrt-cicd in #14854
- [None][feat] Enable flashifner gdn decoding kernel for qwen3.5 by @nv-guomingz in #13645
- [https://nvbugs/5940460][fix] Harden FP8 quant fusion matching after PyTorch 26.02 update by @pcicotti in #14697
- [https://nvbugs/6221450][fix] AutoDeploy: Qwen3.5 400B NVFP4 accuracy regression fix by @taylor-yb-lee in #14667
- [TRTLLM-12648][test] implement disagg cancel stress metrics_thread by @chienchunhung in #14807
- [None][chore] Update AD model list by @tcherckez-nvidia in #14686
- [https://nvbugs/6226933][fix] canonicalize multimodal cache-key serialization to prevent hash collisions by @venkywonka in #14800
- [https://nvbugs/6240561][fix] Unwaive DeepSeek R1 accuracy test by @taylor-yb-lee in #14870
- [None][feat] Add Qwen image support by @pst2154 in #13449
- [TRTLLM-12507][feat] Per-expert lora support with Cutlass backend by @brb-nv in #14801
- [None][chore] Make submit.py can run single GPU test and accept customized config file by @HuiGao-NV in #14630
- [None][test] Waive 9 failed cases for main in QA CI by @tensorrt-cicd in #14792
- [None][test] Update DSV32 32k4k config to avoid timeout issue by @chenfeiz0326 in #14856
- [None][chore] Bump version to 1.3.0rc18 by @yuanjingx87 in #14872
- [None][infra] Waive 5 failed cases for main in post-merge 2755 by @ZhanruiSunCh in #14883
- [None][fix] LTX-2 audio PE pad: use token-axis seq_dim=1 for token-major rope by @luyiyun1021 in #14818
- [#13082][fix] Fix-multimodal embedding mismatch by @aashirvad08 in #13240
- [None][fix] Pipe stderr separately in subprocess calls to improve error reporting in Allure (#14750) by @yufeiwu-nv in #14750
- [None][fix] Use renamed get_param_count_and_checkpoint_size in hybrid configs by @yufeiwu-nv in #14855
- [None][test] Remove duplicate test cases in llm_perf_core file by @yufeiwu-nv in #14749
- [None][test] Remove 28 closed-bug waive entries for main by @tensorrt-cicd in #14545
- [TRTLLM-13022][test] remove deprecated models from tests by @xinhe-nv in #14660
- [None][feat] Reserve one more slots for attention_dp in mixed mamba cache manager by @Wanli-Jiang in #14853
- [https://nvbugs/6195110][fix] Restore DeepSeek shared-weights vanilla MTP path by @zhaoyangwang-nvidia in #14457
- [#12359][feat] AutoDeploy: Support SSM replay kernel for MTP with FlashInfer by @galagam in #13725
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14857
- [None][chore] add attention module owner for VisualGen by @zhenhuaw-me in #14814
- [None][fix] release v1 KV blocks on MAX_UTILIZATION pause by @eopXD in #14723
- [None][perf] Reduce OpenAI stream postprocess overhead by @2ez4bz in #14708
- [None][fix] propagate chat prompt token ids (#14420) by @reasonsolo in #14859
- [https://nvbugs/6211193][fix] etcd listen all interfaces by @reasonsolo in #14863
- [#5247][fix] auto-detect local cnn_dailymail dataset by directory layout by @guan404ming in #13722
- [https://nvbugs/6248987][fix] Made the slow-tokenizer swap lazy and idempotent.
__init__now just sets `_slo by @tensorrt-cicd in #14846 - [None][chore] redact internal NVIDIA URLs from exec-slurm-compile skill by @ssam18 in #14538
- [None][feat] VisualGen: Attention2D + Ulysses & Multi-GPU LPIPS Evals by @juney-nvidia in #13944
- [TRTLLM-13077][feat] Decompose post_load_weights() by @chienchunhung in #14770
- [None][fix] Fix config sharing issue for Qwen3-VL by @2ez4bz in #14766
- [https://nvbugs/6104831][fix] Enforce request and buffer index lifecycle integrity by @chienchunhung in #14768
- [None][feat] Add encoder CUDA graph support to llm.encode() by @tingyangk in #14326
- [None][feat] Support Step-3.7-Flash model by @kaiyux in #14711
- [https://nvbugs/6050489][chore] unwaive tests by @bo-nv in #14866
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #14896
- [None][infra] Source code and container vulnerability fix by @yuanjingx87 in #14025
- [None][infra] Waive 11 failed cases for main in post-merge 2757 by @ZhanruiSunCh in #14925
- [https://nvbugs/5821415][test] update rtx6k test list by @xinhe-nv in #14929
- [None][fix] Update dataset identifier for cnn_dailymail to use namespaced repo in quantization scripts by @yufeiwu-nv in #14930
- [https://nvbugs/5979673][fix] Unwaive test_agent_multi_backends.py::test_run_with_different_env by @Shixiaowei02 in #14939
- [None][fix] Add nemotron-v3 as the proper nemotron-h reasoning parser by @Wanli-Jiang in #14900
- [https://nvbugs/6193836][test] Use EP=8 + attention DP for minimax_m2.5 8-GPU perf by @ruodil in #14613
- [TRTLLM-8236][infra] fix platform tag for public wheel by @niukuo in #14616
- [None][test] update bug ids in waives by @xinhe-nv in #14946
- [https://nvbugs/6244474][fix] AutoDeploy: skip explicit shape-prop after MLIR elementwise fusion by @tensorrt-cicd in #14795
- [None][infra] fix cbts json decode by @crazydemo in #14928
- [https://nvbugs/6222480][fix] Fix stress by @xinhe-nv in #14949
- [None][test] Decrease P1 models number and merge sanity test list into core by @yufeiwu-nv in #14952
- [None][perf] Use a Triton kernel for Cpp mamba hybrid state update by @VALLIS-NERIA in #14869
- [None][chore] Autodeploy unwaive 5888827, 6200112 by @galagam in #14894
- [NVBUG-6248780][fix] Add --decoupled flag to benchmark_core_model in multi-instance test by @karljang in #14888
- [TRTLLM-12870][feat] Support num_images_per_prompt for FLUX pipelines by @karljang in #14890
- [TRTLLM-12214][perf] DeepGemmFusedMoE: fuse masked gather + finalize-scale into one Triton kernel by @xwang233 in #14592
- [None][fix] Fix AutoDeploy accuracy tests by @bmarimuthu-nv in #13925
- [TRTLLMINF-69][infra] Migrate A100X-FMHA-Post-Merge-1 and A100X-Triton-Post-Merge-[1,2] to SLURM by @mlefeb01 in #14921
- [TRTLLM-11508][refactor] Merge Eagle3 and MTP-eagle one-model workers by @zhaoyangwang-nvidia in #12353
- [https://nvbugs/6143787][fix] Add
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.6)to TestQwen3 by @tensorrt-cicd in #13852 - [https://nvbugs/6248764][fix] Normalize non-sliding KV windows to full attention in AutoDeploy by @eopXD in #14906
- [https://nvbugs/6240420][fix] Clamp KV pool window sizes to max_seq_len by @eopXD in #14905
- [None][test] Fix the ci disagg perf local submit test scope too large issue to avoid HF Model not found by @fredricz-20070104 in #14989
- [None][test] remove outdated model in perf test by @ruodil in #14992
- [TRTLLM-12648][test] implement disagg cancellation injector thread by @chienchunhung in #14920
- [None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash by @bmarimuthu-nv in #14759
- [None] [waive] Waive the failed step3p7 test case due to ckpt update by @kaiyux in #14997
- [https://nvbugs/6210714][fix] Fix mamba block calculation by @VALLIS-NERIA in #14524
- [https://nvbugs/5546507][test] Remove obsolet… by @xinhe-nv in #14995
- [None][fix] AutoDeploy: Move hf_id_to_local_model_dir to function for GLM4.7 Flash test by @bmarimuthu-nv in #14999
- [TRTLLM-12893][infra] Parallelize post stages: Rerun Report, Test Coverage, and AI Failure Analysis by @ZhanruiSunCh in #14528
- [None][infra] Waive 11 failed cases for main in post-merge 2760 by @ZhanruiSunCh in #15003
- [None][fix] Uncomment Qwen3.5 and DSR1 from model registry so that they can run f… by @taylor-yb-lee in #15001
- [TRTLLM-11410][feat] Cosmos3 Support by @NVShreyas in #14824
- [https://nvbugs/5859886][fix] Remove the waiver by @ziyixiong-nv in #14948
- [https://nvbugs/6248744][fix] Added
trust_remote_code=Trueto theLLM(...)constructor and removed the… by @tensorrt-cicd in #14892 - [https://nvbugs/6160629][fix] AutoDeploy: Fix manual seed setting for standalone tests by @galagam in #14954
- [TRTLLM-12714][feat] KVCacheManagerV2: wire PyExecutor rebalance hook (single GPU, aggregated for now) by @thorjohnsen in #14578
- [None][feat] add Wan I2V generation example by @o-stoner in #14981
- [TRTLLM-12527][feat] Parallelize multi-shard visual-gen checkpoint loading and pre-fetch checking by @yibinl-nvidia in #14021
- [https://nvbugs/6250866][fix] Fix deep ep partial warp sync for gptoss shapes by @dongfengy in #14977
- [https://nvbugs/6272668][infra] Unwaive DSR1 and Qwen3.5 again by @taylor-yb-lee in #15010
- [None][feat] Afmoe trinity support by @alyosha-swamy in #13148
- [TRTLLM-13027][ci] Relocate under-using tests to right-sized stages by @QiJune in #14684
- [None][feat] Add LTX-2 visual generation example by @yibinl-nvidia in #14976
- [None][feat] AutoDeploy: Fix hardcoded configs by @taylor-yb-lee in #14943
- [#13718][feat] AutoDeploy MoE all-to-all: cache + runtime max-tokens by @greg-kwasniewski1 in #13723
- [TRTLLM-13177][doc] Add Nemotron 3 Ultra doc by @nv-guomingz in #14964
- [#10710][feat] Make explicit CLI flags take precedence over --config / --extra_llm_api_options YAML by @marinayanov in #14812
- [https://nvbugs/6260907][fix] unwaive test by @bo-nv in #15058
- [None][chore] Increase GB200-4_GPUs-PyTorch shards by @tburt-nv in #14836
- [TRTLLM-12648][test] implement disagg cancellation canary thread by @chienchunhung in #15015
- [TRTLLM-12507][feat] Cudagraph support for routed-expert MoE LoRA with Cutlass backend - Part 1 by @brb-nv in #14923
- [https://nvbugs/6245317][test] set Harmony tiktoken env for GPT-OSS disagg by @dongfengy in #14935
- [https://nvbugs/6153955][test] unwaive GPT-OSS w4 DP4 CUTLASS by @dongfengy in #14884
- [None][perf] kv_cache_manager_v2: batch block-key SHA-256 hashing by @lancelly in #14994
- [TRTLLM-13259][ci] Merge DGX_H100 DeepSeek and GptOss stages by @QiJune in #15035
- [None][infra] Waive 11 failed cases for main in post-merge 2765 by @ZhanruiSunCh in #15080
- [None][infra] Waive 3 failed cases for main in post-merge 2765 by @ZhanruiSunCh in #15082
- [None][test] waive weekly qa ci failure cases by @crazydemo in #15077
- [None][feat] AutoDeploy: propagate layer_type hint across pattern-matcher rewrites by @greg-kwasniewski1 in #14835
- [None][test] Waive 15 failed cases for main in QA CI by @tensorrt-cicd in #15056
- [None][infra] Waive 1 failed cases for main in pre-merge 41894 by @ZhanruiSunCh in #15089
- [TRTLLM-13262][ci] Move non-default-feature tests to post merge by @QiJune in #15038
- [None][feat] Enable disk cache config for KV cache v2 by @reasonsolo in #14845
- [https://nvbugs/6185446][fix] Add warmup for trtllm-gen fmha JIT kernels by @pengbowang-nv in #14851
- [https://nvbugs/6162940][chore] Unwaive fixed test by @longlee0622 in #15078
- [None][perf] Support Gemma RMSNorm + interleaved mRoPE in fused_qk_no… by @nv-guomingz in #14898
- [None][test] Half K25 Agg Multi Round to Solve Timeout Issue by @chenfeiz0326 in #15083
- [None][infra] Reduce Docker image layer count in release stage by @tburt-nv in #14972
- [#14828][feat] AutoDeploy: support multi KV cache memory pool in trtllm attention by @MrGeva in #14911
- [None][doc] Refine Nemotron Ultra doc by @nv-guomingz in #15113
New Contributors
- @DhineshPonnarasan made their first contribution in #14049
- @pst2154 made their first contribution in #13449
- @aashirvad08 made their first contribution in #13240
- @guan404ming made their first contribution in #13722
- @alyosha-swamy made their first contribution in #13148
Full Changelog: v1.3.0rc17...v1.3.0rc18