NVIDIA/TensorRT-LLM v1.3.0rc6 on GitHub

Highlights

Model Support
- Add FLUX.1 and FLUX.2 text-to-image pipeline support (#11556)
- Add GatedDeltaNet sharding from config (#11599)
- Add B300 (sm103) support on VLMs (#11274)
- Fix Nemotron H FP4 and MTP support (#11601)
- Add quantized Eagle3 support by quantizing self.fc (#11699)
API
- Add skip_pre_hopper flag for NVILA and Nano V2 VLMs (#11275)
- Align LlmArgs with Pydantic best practices (#11158)
- Restructure KV cache memory ratio parameters in curated YAML config files (#11511)
Feature
- Refactor time breakdown tool (visualization, generation breakdown, etc.) (#11340)
- Improve TorchSampler performance by reducing host overhead (#11315)
- Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM (#11607)
- Implement dynamic quota resize for KVCacheManager v2 (#11503)
- Add KVCache v2 MTP support (#11346)
- Enhance performance dashboard (#11506)
- Add E2E Python KV transceiver for current KV manager (step 5) (#11136)
- Refactor KV connector (#11078)
- Add GPU energy monitoring to trtllm-bench (#11397)
- Support PEFT-saved safetensors file loading (#11339)
- Improve FP8 (per-tensor) quant kernel with vectorized load/store (#11662)
- Remove non-flash-attention-style fmha_v2 kernel for Hopper (#11381)
Fix
- Fix missing sync before cuMemUnmap (#11641)
- Fix message truncation in Helix CP cache transmission (#11252)
- Fix GPT-OSS with non-paged_context_fmha (#11309)
- Fix multi-node trust_remote_code hang in disaggregated serving (#11383)
- Fix kwargs name (#11496)
- Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource (#11621)
- Fix FP8 + skip-softmax attention accuracy issue on fmha_v2 (#11448)
- Handle None priority in KVCacheEventSerializer._event_diff_to_json (#11576)
- Fix WideEP gen-only benchmark hang in disaggregated serving (#11521)
- Fix cancelled disaggregated requests getting stuck in gen server (#11695)
- Fix DeepEP low-latency with DeepGEMM (#11700)
- Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype (#11165)
- Work around F.linear perf regression for GPTOSS (#11668)
- Fix illegal memory access when max_seq_len > max_position_embeddings (#11598)
- Prevent drift accumulation on kv_lens_cuda (#11696)
Documentation
- Resolve conflicts in markdown documentation (#11255)
- Move kimi-k2-thinking deployment guide configs into config files (#11645)
- Rename svd-nvfp4 to trtllm-nvfp4 in visual generation examples (#11664)
- Fix 60+ broken links across docs, blogs, and examples (#11676)
- Update Qwen3-Next README server argument docs (#11682)
- Update speculative decoding docs (#11604)
- Update PR template (#11735)
- Add Qwen3.5 cookbook (#11728)
Test & Infra
- Enable Nemotron NVFP4 tests (#11172)
- Prepare for NumPy v2 (#11389)
- Add Python builds tests to CI pre-merge pipeline (#9943)
- Disable warmup steps for some WAN unit tests (#11616)
- Use the correct config for GPTOSS perf test (#11046)
- Disable release Spark stage during Spark cloud migration (#11402)
- Re-enable release Spark stage after Spark cloud migration (#11408)
- Fix test prefix generation for per-SM waives (#11519)
- Fix GPU memory requirement in stress test (#11404)
- Do not create timeout XML if the stage is aborted (#9777)
- Fix TritonMoE test for Qwen3_30B_A3B (#11495)
- Refactor MoE unit tests with unified ConfigurableMoE framework (#11648)
- Add comparison operators for perf regression triage (#11675)
- Add WideEP DS-R1 NVFP4 test with attn_dp and kv_cache_reuse (#11670)
- Add concurrency override and fix for 128k/8k cases (#11669)
- Support short test case matcher in disaggregated test (#11707)
- Fix multi-GPU tests (#11615)
- Export HF_TOKEN in tests (#9382)
- Automatically generate attributions file (#11323)
- Update TRTLLM PLC pipeline (#11684)
- Add timeout 14400 for SeedOSS (#11269)
- Remove A100 test cases from QA perf scope (#11712)

What's Changed

[None][chore] Enable Nemotron Super nvfp4 tests by @tcherckez-nvidia in #11172
[#11529][perf] Replace Python-traced FP8 quantization with optimized CUDA op in AD MoE by @MrGeva in #11626
[TRTLLM-10514][feat] Refactor time breakdown tool (visualization, generation breakdown, etc.) by @luyiyun1021 in #11340
[None][infra] Waive failed cases for main branch on 2/23 by @EmmaQiaoCh in #11635
[#11529][perf] AD NemotronH topk router to use the model default dtype by @MrGeva in #11623
[None][fix] numpy v2 preparations by @Funatiq in #11389
[#9907][infra] Add Python builds tests to CI pre-merge pipeline by @jieli-matrix in #9943
[https://nvbugs/5921273][fix] Fix an issue where sync is missing before cuMemUnmap by @lowsfer in #11641
[#11398][feat] AutoDeploy: flashinfer rope for GLM4.7-Flash by @taylor-yb-lee in #11524
[None][infra] Waive failed cases for main for post-merge 2550 by @EmmaQiaoCh in #11650
[TRTLLM-11567][feat] Added GatedDeltaNet sharding from config by @greg-kwasniewski1 in #11599
[None][fix] Nemotron H fp4 and MTP by @NVShreyas in #11601
[https://nvbugs/5919025][fix] Disable warmup steps for some WAN unit tests by @chang-l in #11616
[TRTLLM-10616][feat] Add FLUX.1 and FLUX.2 text-to-image pipeline support by @karljang in #11556
[#10243][chore] switched the default AD attention backend to trtllm by @MrGeva in #11627
[None][chroe] Mass integration of release/1.2 - 5th by @dominicshanshan in #11636
[None][chore] Align LlmArgs with some Pydantic best practices by @anish-shanbhag in #11158
[None][perf] Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM by @chang-l in #11607
[None][infra] Waive failed cases for main on 02/24 by @EmmaQiaoCh in #11665
[https://nvbugs/5846489][perf] Apply TE's FP8 per-tensor quantization by @yumin066 in #11057
[None][fix] Fix test prefix generation for per-sm waives by @tburt-nv in #11519
[None][chore] Weekly mass integration of release/1.2 by @mikeiovine in #11572
[TRTLLM-9781][infra] Don't create timeout xml if the stage is aborted by @yiqingy0 in #9777
[None][fix] Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource… by @tcherckez-nvidia in #11621
[https://nvbugs/5606178][fix] unwaive mamba2 two tests by @JadoTu in #11479
[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11648
[None][fix] Add comparison operators for perf regression triage by @chenfeiz0326 in #11675
[None][test] Add wideep DS-R1 nvfp4 test with attn_dp and kv_cache_reuse by @StanleySun639 in #11670
[None][chore] Moving kimi-k2-thinking deployment guide configs to config files. by @fsaady in #11645
[TRTINFRA-7367][infra] Automatically generate attributions file by @tburt-nv in #11323
[None][fix] rename svd-nvfp4 to trtllm-nvfp4 in visual gen examples by @karljang in #11664
[None] [fix] Restructure kv cache memory ratio parameters in curated .yaml config files by @xd-nv in #11511
[None][chore] Bump version to 1.3.0rc6 by @yuanjingx87 in #11688
[None][fix] Fix FP8 + Skip Softmax Attention accuracy issue on fmha_v2. by @bobboli in #11448
[TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 by @lowsfer in #11503
[#4666][fix] Handle None priority in KVCacheEventSerializer._event_diff_to_json by @wojciech-wais in #11576
[None][test] add concurrency override and fix for 128k8k cases by @ruodil in #11669
[TRTLLM-9904][feat] KVCache V2 MTP support by @liji-nv in #11346
[None][test] support short test case matcher in disagg test by @ruodil in #11707
[TRTLLM-11614][feat] Fixing multigpu tests by @greg-kwasniewski1 in #11615
[None][docs] Fix 60+ broken links across docs, blogs, and examples by @kaiyux in #11676
[TRTLLM-8828][infra] export HF_TOKEN in tests by @niukuo in #9382
[None][chore] Add feature for enhance perf dashboard by @fredricz-20070104 in #11506
[TRTLLM-11106][chore] Abstract ADPRouter interface and RankState by @lancelly in #11633
[TRTLLM-9527][feat] E2E Python KV transceiver for current KV manager (step 5) by @chuangz0 in #11136
[None][chore] KV Connector Refactor by @jthomson04 in #11078
[https://nvbugs/5875514][fix] Fix WideEP gen-only benchmark hang in disaggregated serving by @peihu-nv in #11521
[TRTLLM-10948][feat] Add GPU energy monitoring to trtllm-bench by @inciaf in #11397
[https://nvbugs/5734983][doc] update Qwen3-Next readme of server arg by @JadoTu in #11682
[None][infra] Waive failed cases for main on 02/25 by @EmmaQiaoCh in #11719
[https://nvbugs/5866619][fix] Support PEFT-saved safetensors file loading by @Wanli-Jiang in #11339
[None][fix] Quantized Eagle3 support: quantizing self.fc by @h-guo18 in #11699
[https://nvbugs/5822983][fix] Update waives.txt to remove skipped tests for TestDeepSeekV3Lite in accuracy module by @chienchunhung in #11591
[https://nvbugs/5845901][fix] Fix cancelled disagg requests stuck in gen server by @Tabrizian in #11695
[TRTLLM-11087][doc] Update speculative decoding docs by @mikeiovine in #11604
[#11529][perf] AD host time attention MD optimization for large context by @MrGeva in #11624
[TRTLLM-11090][perf] Improve fp8 (per-tensor) quant kernel by vectorized load/store by @chang-l in #11662
[None][infra] Update TRTLLM PLC pipeline by @yuanjingx87 in #11684
[https://nvbugs/5884735][fix] fix deepeplowlatency with DeepGEMM by @leslie-fang25 in #11700
[None][feat] Remove non flash attetnion style fmha_v2 kernel for hopper by @pengbowang-nv in #11381
[https://nvbugs/5799917][fix] Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype by @rosenrodt in #11165
[https://nvbugs/5914691][fix] WAR F.linear perf regression for GPTOSS by @dongfengy in #11668
[None][docs] Update PR template by @chzblych in #11735
[None][doc] Added Qwen3.5 Cookbook by @bmarimuthu-nv in #11728
[https://nvbugs/5915550][fix] Fix illegal memory access when max_seq_len > max_position_embeddings by @brb-nv in #11598
[https://nvbugs/5612438][fix] add timeout 14400 for SeedOSS by @zhhuang-nv in #11269
[https://nvbugs/5821053][fix] Preventing drift accumulation on kv_lens_cuda by @ziyixiong-nv in #11696
[None][test] Remove A100 test cases from QA perf scope by @yufeiwu-nv in #11712

New Contributors

@xd-nv made their first contribution in #11511
@wojciech-wais made their first contribution in #11576
@inciaf made their first contribution in #11397
@chienchunhung made their first contribution in #11591

Full Changelog: v1.3.0rc5...v1.3.0rc6