Highlights
-
Model Support
-
API
-
Feature
- Add exact multimodal KV block hashing and KV cache reuse probing (#13815, #14333)
- Add KV cache manager v2 with Python transceiver updates (#12928)
- Add disaggregated serving support with block reuse enabled for hybrid models (#14060)
- Add FlashInfer MLA attention backend support and SkipSoftmax sparse attention support for visual generation (#13428, #12947)
- Add Ring Attention and unified context parallelism for VisualGen (#13821)
- Add legacy and TensorRT-LLM 1.x modelopt quantization config support (#14088)
- Add debugging environment variables for mamba modules (#14170)
- Add single-rank MPI sleep/wakeup support and a rank-0 collective_rpc shim (#14052)
- Add opentelemetry metrics for disaggregated serving with multiple postprocessing workers (#12637)
- Support SWA scratch reuse rewind (#14412)
- Improve FMHA, FlashInfer TRTLLM-Gen, and KV cache buffer calculation paths (#14291, #12525)
- Improve fused-kernel and attention performance with shared-expert combine fusion, paged MQA logits decode tuning, LTX2 fused RMSNorm/RoPE, EAGLE3 dynamic tree kernel optimizations, and cu_seqlens conversion updates (#14306, #14133, #13985, #13426, #13566)
- Optimize beam search candidate reconstruction by skipping prompt-prefix copies (#14197)
- Update cubins to resolve the FMHA PDL issue (#14462)
- Use CUDA 13 CUTLASS DSL package (#14354)
-
Fix
- Fix disaggregated benchmark, usage propagation, and worker registration stability issues (#13347, #14177, #14289)
- Fix DeepSeek-V3 OOM handling and artifacts paths (#14232)
- Fix missing get_draft_token_length import in py_executor (#14366)
- Fix Lora load failure handling (#13517)
- Fix Kimi K2.5 speculative decoding behavior (#14379)
- Fix Qwen3HybridConfig layer_types derivation and route load_hf_model_config through AutoConfig (#13832, #14410)
- Fix CppMambaHybridCacheManager functional and performance issues (#14003)
- Fix MTP disaggregated speculative_config coverage (#14391)
- Fix KVCacheTransfer divide-by-zero and KV cache grain slot refinement issues (#13618, #14442)
- Fix memory usage during refit and EPLB config model loading (#14331, #11962)
- Fix MPI worker allocator configuration and GB300 cluster environment setup (#14152, #14460)
- Fix profiler runner exception handling with synchronized CUDA cleanup (#13469)
- Disable mamba replay by default (#14471)
-
Documentation
-
Benchmark
- Add LPIPS scoring for visual generation model regression tests (#13567)
- Add a bench_moe microbenchmark (#14507)
- Update visual generation and accuracy thresholds for Wan 2.2, Qwen3.5-4B DFlash, and Nano V3 (#14372, #14411, #14078)
- Disable ignore-eos when using speculative decoding in performance tests (#14347)
-
Test & Infra
- Split verl tests into fine-grained per-case wrappers (#14037)
- Add new stress cases (#14390)
- Clean outdated test duration entries and remove deprecated disaggregated sampler and spark test cases (#14340, #14335, #14380)
- Isolate ray tests to avoid GCS timeout in a single pytest session (#14342)
- Improve L0 retry timeout budgeting and cap infra retry attempts (#14323, #14415)
- Handle sacct errors when checking Slurm job status (#14367)
- Fix B300 MegaMoE and MoE test selection (#14362, #14401)
- Fix container scanning according to the latest security team guidance (#14430)
- Deduplicate miscellaneous unit tests on B200 (#14525)
What's Changed
- [None][chore] Update Claude Code agents and skills by @kaiyux in #14344
- [None][perf] Fuse sigmoid+mul+add shared-expert combine into one Trit… by @nv-guomingz in #14306
- [None][infra] Waive 1 failed cases for main in pre-merge 38925 by @ZhanruiSunCh in #14346
- [None][infra] Revert Mingyang back to mingyangHao in allowlist by @ZhanruiSunCh in #14349
- [None][cleanup] MistralSmall related cleanups by @2ez4bz in #14271
- [None][chore] Clean test_durations file by removing outdated items. by @nv-guomingz in #14340
- [None][infra] Waive 2 failed cases for main in post-merge 2725 by @ZhanruiSunCh in #14357
- [None][feat] Exact multimodal KV blockhashing by @venkywonka in #13815
- [None][infra] Waive 1 failed cases for main in pre-merge 38987 by @ZhanruiSunCh in #14350
- [None][feat] Update the logic of FMHA JIT path by @heyuhhh in #14291
- [None][feat] opentelemetry metrics for num_postproc_workers > 0 disagg by @karen-sy in #12637
- [TRTLLM-12385][feat] Use LPIPS score for visual gen model regression test by @yibinl-nvidia in #13567
- [None][chore] Remove closed bugs by @xinhe-nv in #14217
- [https://nvbugs/6133201][fix] Bump GEN max_num_tokens in disagg perf YAMLs by @xwang233 in #14191
- [None][feat] add single-rank MPI sleep/wakeup and rank-0 collective_rpc shim by @hhzhang16 in #14052
- [https://nvbugs/6093911][fix] Fix disagg gen-only benchmark hang under ADP router imbalance by @chienchunhung in #13347
- [None][fix] Import missing get_draft_token_length in py_executor by @nv-guomingz in #14366
- [TRTLLM-12342][feat] Ring Attention, Unified Context Parallel for VisualGen by @NVShreyas in #13821
- [None][test] Split verl tests into 19 fine-grained per-case wrappers by @Superjomn in #14037
- [TRTLLM-11547][feat] Add Qwen3.5 MTP support. by @nv-guomingz in #12646
- [https://nvbugs/6143599][fix] DeepSeek-V3 OOM and artifacts path by @dominicshanshan in #14232
- [https://nvbugs/6114141][test] Remove deprecated disagg trtllm_sampler test by @Shixiaowei02 in #14335
- [None][doc] Add Claude skill for multimodal model onboarding by @yechank-nvidia in #13842
- [https://nvbugs/6141803][fix] Skip Qwen3.5-4B tests pre-hopper by @amukkara in #14055
- [None][fix] ADP router crashes on serve when scheduling_params.attent… by @nv-guomingz in #14267
- [https://nvbugs/6185190][doc] fix invalid links in doc by @nv-guomingz in #14337
- [None][feat] Refactor to support legacy and 1.x modelopt quant config format by @Wanli-Jiang in #14088
- [None][feature] Add env variables to help debugging mamba modules. by @Wanli-Jiang in #14170
- [None][infra] Handle sacct error when checking slurm job status by @yuanjingx87 in #14367
- [https://nvbugs/6027594][fix] Unwaive testcase by @YihuiLu512 in #14383
- [None][chore] Remove unnecessary buffer to save memory during refit by @shuyixiong in #14331
- [https://nvbugs/6153638][fix] unwaive tests for testing the flaky issue by @JunyiXu-nv in #14284
- [https://nvbugs/6171743][fix] Set
PYTORCH_ALLOC_CONF=expandable_segments:Trueon MPI workers via `patch_mpi_ by @tensorrt-cicd in #14152 - [None][test] Add new stress cases by @fredricz-20070104 in #14390
- [None][feat] Gemma4 MM: native vision + audio towers by @Hudayday in #14300
- [TRTLLM-12719][cbts] Add core code related rule by @crazydemo in #14266
- [None][test] Update bug ID for test_all_optimizations_combined waiver by @mzweilz in #14402
- [None][infra] Waive 6 failed cases for main in post-merge 2726 by @ZhanruiSunCh in #14405
- [None][test] Disable ignore-eos when Spec Decoding in Perf Test by @chenfeiz0326 in #14347
- [None][fix] Isolate ray tests to avoid GCS timeout in one pytest session by @shuyixiong in #14342
- [https://nvbugs/6110638][fix] Mark AutoDeploy attention DP world sizes by GPU count by @galagam in #14148
- [None][feat] EXAONE-4.5 Support by @yechank-nvidia in #12873
- [https://nvbugs/6004530][fix] Unwaive Qwen3.5 35B A3B FP8 test case by @nv-guomingz in #14406
- [None][feat] add KV cache reuse probe by @lowsfer in #14333
- [TRTLLM-12706][perf] Optimize beam search candidate reconstruction by skipping prompt-prefix copies by @xuanzic in #14197
- [#12359][feat] AutoDeploy: MTP performance: Integrate FI kernel for extend path by @galagam in #13711
- [TRTLLMINF-89][feat] Make L0 retries timeout-budget aware by @dpitman-nvda in #14323
- [TRTLLM-12500][feat] Add support for Qwen3.5 VL MoE by @moraxu in #14164
- [None][chore] Bump version to 1.3.0rc16 by @VALLIS-NERIA in #14422
- [None][feat] Add Laguna model support (Poolside Laguna-XS.2) by @DomBrown in #13559
- [None][fix] Reduce host memory usage during EPLB config model loading. by @jthomson04 in #11962
- [https://nvbugs/6079440][test] Unwaive MTP speculative decoding test by @sunnyqgg in #14341
- [https://nvbugs/5911709][fix] Wrap lora load failures by @yibinl-nvidia in #13517
- [https://nvbugs/6175060][fix] Fix B300 MegaMoE test selection by @xxi-nv in #14362
- [None][fix] Reject incompatible KV connector configurations at construction time by @jthomson04 in #13577
- [https://nvbugs/6115562][fix] defer worker registration until HTTP server is accepting by @reasonsolo in #14289
- [None][chore] Fix Kimi_k25 with spec dec by @ziyixiong-nv in #14379
- [None][feat] KV cache manager v2 + python transceiver bug fix by @chuangz0 in #12928
- [https://nvbugs/6141606][fix] Move the
layer_typesderivation intoQwen3HybridConfig.from_hf(where `pretr by @tensorrt-cicd in #13832 - [None][fix] Fix CppMambaHybridCacheManager functional and perf issues by @VALLIS-NERIA in #14003
- [TRTLLM-35237][feat] Tune cute dsl paged MQA logits decode kernel by @limin2021 in #14133
- [https://nvbugs/6185182][fix] Adjust H20 accuracy for Qwen3.5-4B DFlash by @tensorrt-cicd in #14411
- [None][test] Remove the testcases for spark by @JennyLiu-nv in #14380
- [None][fix] Add MTP speculative_config to wideep dep48 mtp3 disagg yaml by @yingguo-trt in #14391
- [https://nvbugs/6185234][fix] route load_hf_model_config via AutoConfig by @Hudayday in #14410
- [https://nvbugs/6110074][fix] Add torch.cuda.synchronize() wrapped in try/except in the _profile_runners excep by @tensorrt-cicd in #13469
- [None][feat] Disable shared paged index in flashinfer trtllm-gen fmha kernel and unify kv cache buffer calculation with thop.attention by @yihwang-nv in #12525
- [https://nvbugs/6112497][test] Unwaive passing test by @yihwang-nv in #14387
- [None][fix] cold-start warmup for KV-aware ADP router by @lancelly in #14307
- [https://nvbugs/6185212][fix] Fix B300 MoE test list ids by @xxi-nv in #14401
- [None][infra] Waive 1 failed cases for main in pre-merge 39395 by @ZhanruiSunCh in #14455
- [https://nvbugs/6156492][fix] Fix disaggregated usage propagation by @reasonsolo in #14177
- [https://nvbugs/6106659][fix] Add Qwen3.6-27B-FP8 support. by @nv-guomingz in #14359
- [None][chore] Use CUDA 13 CUTLASS DSL package by @xxi-nv in #14354
- [None][doc] Update Gemma 4 entries in supported-models.md by @Hudayday in #14463
- [None][infra] Waive 2 failed cases for main in post-merge by @xinhe-nv in #14450
- [None][fix] Cap infra-retry budget at 2 attempts total by @dpitman-nvda in #14415
- [None][infra] Fix container scanning according to security teams latest update by @yuanjingx87 in #14430
- [https://nvbugs/6184143][fix] AutoDeploy: Fix newly added unit tests for Transformers 5.5.3 by @govind-ramnarayan in #14273
- [https://nvbugs/6185192][fix] raise Wan 2.2 VBench threshold by @o-stoner in #14372
- [TRTLLM-11320][refactor] Refactor VisualGenArgs API and registry by @zhenhuaw-me in #14175
- [None][chore] bump transformers to 5.5.4 by @longlee0622 in #14456
- [https://nvbugs/5914391][fix] Add OpenAI chat logit bias validation by @yibinl-nvidia in #13518
- [None][feat] Revert Add support for Qwen3.5 VL MoE (#14164) by @nv-guomingz in #14465
- [None][feat] Disable mamba replay by default by @tijyojwad in #14471
- [None][feat] Update cubins to resolve FMHA PDL issue by @heyuhhh in #14462
- [None][infra] Waive 1 failed cases for main in pre-merge 39582 by @ZhanruiSunCh in #14485
- [TRTLLM-12580][perf] ltx2: fused RMSNorm+RoPE across all attention paths + PE pre-shard by @luyiyun1021 in #13985
- [None][perf] EAGLE3 dynamic tree kernel optimizations by @sunnyqgg in #13426
- [https://nvbugs/6079901][fix] Avoid divide-by-zero in KVCacheTransfer… by @farazkh80 in #13618
- [None][feat] support SWA scratch reuse rewind by @lowsfer in #14412
- [#14173][tests] move autodeploy accuracy tests to post merge and use model registry by @MrGeva in #14352
- [TRTLLM-12027][feat] Disagg serving support with block reuse ON for hybrid models by @bo-nv in #14060
- [None][chore] Drop sink_token_length from PyTorch attention surface by @yuxianq in #14275
- [https://nvbugs/6185248][test] Unwaive K2.5 thinking MTP3 perf sanity test by @tianyuxbear in #14461
- [None][feat] Add FlashInfer MLA attention backend support by @Tracin in #13428
- [None][feat] Add SkipSoftmax sparse attention support for visual generation by @karljang in #12947
- [https://nvbugs/6190759][fix] set env on some gb300 cluster by @chuangz0 in #14460
- [https://nvbugs/6157131][fix] lower the GSM8K accuracy grade for Nano V3 by @tcherckez-nvidia in #14078
- [None][fix] Fix KV cache grain slot refinement by @lowsfer in #14442
- [None][test] Waive 7 failed cases for main in QA CI by @xinhe-nv in #14504
- [None][test] Waive 1 failed cases for main in QA CI by @xinhe-nv in #14503
- [None][infra] Waive 10 failed cases for main in post-merge 2733 by @ZhanruiSunCh in #14514
- [https://nvbugs/6094070][fix] Skip ray-marked integration tests when --run-ray is not set by @shikicloud in #14226
- [TRTLLM-12635][feat] add bench_moe microbenchmark by @xxi-nv in #14507
- [None][fix] Unwaive Qwen3.5 bf16 mtp on case by @nv-guomingz in #14511
- [None][fix] fix typo by @bo-nv in #14510
- [https://nvbugs/6215684][fix] Fix invalid links in deployment guide by @nv-guomingz in #14522
- [https://nvbugs/6120981][fix] Switch to cu_seqlens_to_chunk_indices_offsets_triton with total_seqlens/extra_ch by @tensorrt-cicd in #13566
- [None][infra] Waive 5 failed cases for main in post-merge 2733 by @ZhanruiSunCh in #14516
- [TRTLLM-13429][feat] Switch DeepSeek/NemotronH/Qwen3/Qwen3.5-MoE to sharding-IR canonical models by @greg-kwasniewski1 in #13478
- [TRTLLM-12942][ci] Dedup misc unit tests on B200 by @YihuiLu512 in #14525
- [None][infra] Waive 9 failed cases for main in post-merge by @xinhe-nv in #14515
New Contributors
Full Changelog: v1.3.0rc15...v1.3.0rc16