Highlights
-
Model Support
-
API
-
Feature
- Refactor time breakdown tool (visualization, generation breakdown, etc.) (#11340)
- Improve TorchSampler performance by reducing host overhead (#11315)
- Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM (#11607)
- Implement dynamic quota resize for KVCacheManager v2 (#11503)
- Add KVCache v2 MTP support (#11346)
- Enhance performance dashboard (#11506)
- Add E2E Python KV transceiver for current KV manager (step 5) (#11136)
- Refactor KV connector (#11078)
- Add GPU energy monitoring to trtllm-bench (#11397)
- Support PEFT-saved safetensors file loading (#11339)
- Improve FP8 (per-tensor) quant kernel with vectorized load/store (#11662)
- Remove non-flash-attention-style fmha_v2 kernel for Hopper (#11381)
-
Fix
- Fix missing sync before
cuMemUnmap(#11641) - Fix message truncation in Helix CP cache transmission (#11252)
- Fix GPT-OSS with non-
paged_context_fmha(#11309) - Fix multi-node
trust_remote_codehang in disaggregated serving (#11383) - Fix kwargs name (#11496)
- Accept
**kwargsinDynamicYamlWithDeepMergeSettingsSource(#11621) - Fix FP8 + skip-softmax attention accuracy issue on
fmha_v2(#11448) - Handle
Nonepriority inKVCacheEventSerializer._event_diff_to_json(#11576) - Fix WideEP gen-only benchmark hang in disaggregated serving (#11521)
- Fix cancelled disaggregated requests getting stuck in gen server (#11695)
- Fix DeepEP low-latency with DeepGEMM (#11700)
- Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype (#11165)
- Work around
F.linearperf regression for GPTOSS (#11668) - Fix illegal memory access when
max_seq_len>max_position_embeddings(#11598) - Prevent drift accumulation on
kv_lens_cuda(#11696)
- Fix missing sync before
-
Documentation
- Resolve conflicts in markdown documentation (#11255)
- Move kimi-k2-thinking deployment guide configs into config files (#11645)
- Rename
svd-nvfp4totrtllm-nvfp4in visual generation examples (#11664) - Fix 60+ broken links across docs, blogs, and examples (#11676)
- Update Qwen3-Next README server argument docs (#11682)
- Update speculative decoding docs (#11604)
- Update PR template (#11735)
- Add Qwen3.5 cookbook (#11728)
-
Test & Infra
- Enable Nemotron NVFP4 tests (#11172)
- Prepare for NumPy v2 (#11389)
- Add Python builds tests to CI pre-merge pipeline (#9943)
- Disable warmup steps for some WAN unit tests (#11616)
- Use the correct config for GPTOSS perf test (#11046)
- Disable release Spark stage during Spark cloud migration (#11402)
- Re-enable release Spark stage after Spark cloud migration (#11408)
- Fix test prefix generation for per-SM waives (#11519)
- Fix GPU memory requirement in stress test (#11404)
- Do not create timeout XML if the stage is aborted (#9777)
- Fix TritonMoE test for Qwen3_30B_A3B (#11495)
- Refactor MoE unit tests with unified ConfigurableMoE framework (#11648)
- Add comparison operators for perf regression triage (#11675)
- Add WideEP DS-R1 NVFP4 test with
attn_dpandkv_cache_reuse(#11670) - Add concurrency override and fix for 128k/8k cases (#11669)
- Support short test case matcher in disaggregated test (#11707)
- Fix multi-GPU tests (#11615)
- Export
HF_TOKENin tests (#9382) - Automatically generate attributions file (#11323)
- Update TRTLLM PLC pipeline (#11684)
- Add timeout 14400 for SeedOSS (#11269)
- Remove A100 test cases from QA perf scope (#11712)
What's Changed
- [None][chore] Enable Nemotron Super nvfp4 tests by @tcherckez-nvidia in #11172
- [#11529][perf] Replace Python-traced FP8 quantization with optimized CUDA op in AD MoE by @MrGeva in #11626
- [TRTLLM-10514][feat] Refactor time breakdown tool (visualization, generation breakdown, etc.) by @luyiyun1021 in #11340
- [None][infra] Waive failed cases for main branch on 2/23 by @EmmaQiaoCh in #11635
- [#11529][perf] AD NemotronH topk router to use the model default dtype by @MrGeva in #11623
- [None][fix] numpy v2 preparations by @Funatiq in #11389
- [#9907][infra] Add Python builds tests to CI pre-merge pipeline by @jieli-matrix in #9943
- [https://nvbugs/5921273][fix] Fix an issue where sync is missing before cuMemUnmap by @lowsfer in #11641
- [#11398][feat] AutoDeploy: flashinfer rope for GLM4.7-Flash by @taylor-yb-lee in #11524
- [None][infra] Waive failed cases for main for post-merge 2550 by @EmmaQiaoCh in #11650
- [TRTLLM-11567][feat] Added GatedDeltaNet sharding from config by @greg-kwasniewski1 in #11599
- [None][fix] Nemotron H fp4 and MTP by @NVShreyas in #11601
- [https://nvbugs/5919025][fix] Disable warmup steps for some WAN unit tests by @chang-l in #11616
- [TRTLLM-10616][feat] Add FLUX.1 and FLUX.2 text-to-image pipeline support by @karljang in #11556
- [#10243][chore] switched the default AD attention backend to trtllm by @MrGeva in #11627
- [None][chroe] Mass integration of release/1.2 - 5th by @dominicshanshan in #11636
- [None][chore] Align LlmArgs with some Pydantic best practices by @anish-shanbhag in #11158
- [None][perf] Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM by @chang-l in #11607
- [None][infra] Waive failed cases for main on 02/24 by @EmmaQiaoCh in #11665
- [https://nvbugs/5846489][perf] Apply TE's FP8 per-tensor quantization by @yumin066 in #11057
- [None][fix] Fix test prefix generation for per-sm waives by @tburt-nv in #11519
- [None][chore] Weekly mass integration of release/1.2 by @mikeiovine in #11572
- [TRTLLM-9781][infra] Don't create timeout xml if the stage is aborted by @yiqingy0 in #9777
- [None][fix] Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource… by @tcherckez-nvidia in #11621
- [https://nvbugs/5606178][fix] unwaive mamba2 two tests by @JadoTu in #11479
- [TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11648
- [None][fix] Add comparison operators for perf regression triage by @chenfeiz0326 in #11675
- [None][test] Add wideep DS-R1 nvfp4 test with attn_dp and kv_cache_reuse by @StanleySun639 in #11670
- [None][chore] Moving kimi-k2-thinking deployment guide configs to config files. by @fsaady in #11645
- [TRTINFRA-7367][infra] Automatically generate attributions file by @tburt-nv in #11323
- [None][fix] rename svd-nvfp4 to trtllm-nvfp4 in visual gen examples by @karljang in #11664
- [None] [fix] Restructure kv cache memory ratio parameters in curated .yaml config files by @xd-nv in #11511
- [None][chore] Bump version to 1.3.0rc6 by @yuanjingx87 in #11688
- [None][fix] Fix FP8 + Skip Softmax Attention accuracy issue on fmha_v2. by @bobboli in #11448
- [TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 by @lowsfer in #11503
- [#4666][fix] Handle None priority in KVCacheEventSerializer._event_diff_to_json by @wojciech-wais in #11576
- [None][test] add concurrency override and fix for 128k8k cases by @ruodil in #11669
- [TRTLLM-9904][feat] KVCache V2 MTP support by @liji-nv in #11346
- [None][test] support short test case matcher in disagg test by @ruodil in #11707
- [TRTLLM-11614][feat] Fixing multigpu tests by @greg-kwasniewski1 in #11615
- [None][docs] Fix 60+ broken links across docs, blogs, and examples by @kaiyux in #11676
- [TRTLLM-8828][infra] export HF_TOKEN in tests by @niukuo in #9382
- [None][chore] Add feature for enhance perf dashboard by @fredricz-20070104 in #11506
- [TRTLLM-11106][chore] Abstract ADPRouter interface and RankState by @lancelly in #11633
- [TRTLLM-9527][feat] E2E Python KV transceiver for current KV manager (step 5) by @chuangz0 in #11136
- [None][chore] KV Connector Refactor by @jthomson04 in #11078
- [https://nvbugs/5875514][fix] Fix WideEP gen-only benchmark hang in disaggregated serving by @peihu-nv in #11521
- [TRTLLM-10948][feat] Add GPU energy monitoring to trtllm-bench by @inciaf in #11397
- [https://nvbugs/5734983][doc] update Qwen3-Next readme of server arg by @JadoTu in #11682
- [None][infra] Waive failed cases for main on 02/25 by @EmmaQiaoCh in #11719
- [https://nvbugs/5866619][fix] Support PEFT-saved safetensors file loading by @Wanli-Jiang in #11339
- [None][fix] Quantized Eagle3 support: quantizing self.fc by @h-guo18 in #11699
- [https://nvbugs/5822983][fix] Update waives.txt to remove skipped tests for TestDeepSeekV3Lite in accuracy module by @chienchunhung in #11591
- [https://nvbugs/5845901][fix] Fix cancelled disagg requests stuck in gen server by @Tabrizian in #11695
- [TRTLLM-11087][doc] Update speculative decoding docs by @mikeiovine in #11604
- [#11529][perf] AD host time attention MD optimization for large context by @MrGeva in #11624
- [TRTLLM-11090][perf] Improve fp8 (per-tensor) quant kernel by vectorized load/store by @chang-l in #11662
- [None][infra] Update TRTLLM PLC pipeline by @yuanjingx87 in #11684
- [https://nvbugs/5884735][fix] fix deepeplowlatency with DeepGEMM by @leslie-fang25 in #11700
- [None][feat] Remove non flash attetnion style fmha_v2 kernel for hopper by @pengbowang-nv in #11381
- [https://nvbugs/5799917][fix] Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype by @rosenrodt in #11165
- [https://nvbugs/5914691][fix] WAR F.linear perf regression for GPTOSS by @dongfengy in #11668
- [None][docs] Update PR template by @chzblych in #11735
- [None][doc] Added Qwen3.5 Cookbook by @bmarimuthu-nv in #11728
- [https://nvbugs/5915550][fix] Fix illegal memory access when max_seq_len > max_position_embeddings by @brb-nv in #11598
- [https://nvbugs/5612438][fix] add timeout 14400 for SeedOSS by @zhhuang-nv in #11269
- [https://nvbugs/5821053][fix] Preventing drift accumulation on kv_lens_cuda by @ziyixiong-nv in #11696
- [None][test] Remove A100 test cases from QA perf scope by @yufeiwu-nv in #11712
New Contributors
- @xd-nv made their first contribution in #11511
- @wojciech-wais made their first contribution in #11576
- @inciaf made their first contribution in #11397
- @chienchunhung made their first contribution in #11591
Full Changelog: v1.3.0rc5...v1.3.0rc6