github NVIDIA/TensorRT-LLM v1.3.0rc6

pre-release12 hours ago

Highlights

  • Model Support

    • Add FLUX.1 and FLUX.2 text-to-image pipeline support (#11556)
    • Add GatedDeltaNet sharding from config (#11599)
    • Add B300 (sm103) support on VLMs (#11274)
    • Fix Nemotron H FP4 and MTP support (#11601)
    • Add quantized Eagle3 support by quantizing self.fc (#11699)
  • API

    • Add skip_pre_hopper flag for NVILA and Nano V2 VLMs (#11275)
    • Align LlmArgs with Pydantic best practices (#11158)
    • Restructure KV cache memory ratio parameters in curated YAML config files (#11511)
  • Feature

    • Refactor time breakdown tool (visualization, generation breakdown, etc.) (#11340)
    • Improve TorchSampler performance by reducing host overhead (#11315)
    • Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM (#11607)
    • Implement dynamic quota resize for KVCacheManager v2 (#11503)
    • Add KVCache v2 MTP support (#11346)
    • Enhance performance dashboard (#11506)
    • Add E2E Python KV transceiver for current KV manager (step 5) (#11136)
    • Refactor KV connector (#11078)
    • Add GPU energy monitoring to trtllm-bench (#11397)
    • Support PEFT-saved safetensors file loading (#11339)
    • Improve FP8 (per-tensor) quant kernel with vectorized load/store (#11662)
    • Remove non-flash-attention-style fmha_v2 kernel for Hopper (#11381)
  • Fix

    • Fix missing sync before cuMemUnmap (#11641)
    • Fix message truncation in Helix CP cache transmission (#11252)
    • Fix GPT-OSS with non-paged_context_fmha (#11309)
    • Fix multi-node trust_remote_code hang in disaggregated serving (#11383)
    • Fix kwargs name (#11496)
    • Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource (#11621)
    • Fix FP8 + skip-softmax attention accuracy issue on fmha_v2 (#11448)
    • Handle None priority in KVCacheEventSerializer._event_diff_to_json (#11576)
    • Fix WideEP gen-only benchmark hang in disaggregated serving (#11521)
    • Fix cancelled disaggregated requests getting stuck in gen server (#11695)
    • Fix DeepEP low-latency with DeepGEMM (#11700)
    • Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype (#11165)
    • Work around F.linear perf regression for GPTOSS (#11668)
    • Fix illegal memory access when max_seq_len > max_position_embeddings (#11598)
    • Prevent drift accumulation on kv_lens_cuda (#11696)
  • Documentation

    • Resolve conflicts in markdown documentation (#11255)
    • Move kimi-k2-thinking deployment guide configs into config files (#11645)
    • Rename svd-nvfp4 to trtllm-nvfp4 in visual generation examples (#11664)
    • Fix 60+ broken links across docs, blogs, and examples (#11676)
    • Update Qwen3-Next README server argument docs (#11682)
    • Update speculative decoding docs (#11604)
    • Update PR template (#11735)
    • Add Qwen3.5 cookbook (#11728)
  • Test & Infra

    • Enable Nemotron NVFP4 tests (#11172)
    • Prepare for NumPy v2 (#11389)
    • Add Python builds tests to CI pre-merge pipeline (#9943)
    • Disable warmup steps for some WAN unit tests (#11616)
    • Use the correct config for GPTOSS perf test (#11046)
    • Disable release Spark stage during Spark cloud migration (#11402)
    • Re-enable release Spark stage after Spark cloud migration (#11408)
    • Fix test prefix generation for per-SM waives (#11519)
    • Fix GPU memory requirement in stress test (#11404)
    • Do not create timeout XML if the stage is aborted (#9777)
    • Fix TritonMoE test for Qwen3_30B_A3B (#11495)
    • Refactor MoE unit tests with unified ConfigurableMoE framework (#11648)
    • Add comparison operators for perf regression triage (#11675)
    • Add WideEP DS-R1 NVFP4 test with attn_dp and kv_cache_reuse (#11670)
    • Add concurrency override and fix for 128k/8k cases (#11669)
    • Support short test case matcher in disaggregated test (#11707)
    • Fix multi-GPU tests (#11615)
    • Export HF_TOKEN in tests (#9382)
    • Automatically generate attributions file (#11323)
    • Update TRTLLM PLC pipeline (#11684)
    • Add timeout 14400 for SeedOSS (#11269)
    • Remove A100 test cases from QA perf scope (#11712)

What's Changed

  • [None][chore] Enable Nemotron Super nvfp4 tests by @tcherckez-nvidia in #11172
  • [#11529][perf] Replace Python-traced FP8 quantization with optimized CUDA op in AD MoE by @MrGeva in #11626
  • [TRTLLM-10514][feat] Refactor time breakdown tool (visualization, generation breakdown, etc.) by @luyiyun1021 in #11340
  • [None][infra] Waive failed cases for main branch on 2/23 by @EmmaQiaoCh in #11635
  • [#11529][perf] AD NemotronH topk router to use the model default dtype by @MrGeva in #11623
  • [None][fix] numpy v2 preparations by @Funatiq in #11389
  • [#9907][infra] Add Python builds tests to CI pre-merge pipeline by @jieli-matrix in #9943
  • [https://nvbugs/5921273][fix] Fix an issue where sync is missing before cuMemUnmap by @lowsfer in #11641
  • [#11398][feat] AutoDeploy: flashinfer rope for GLM4.7-Flash by @taylor-yb-lee in #11524
  • [None][infra] Waive failed cases for main for post-merge 2550 by @EmmaQiaoCh in #11650
  • [TRTLLM-11567][feat] Added GatedDeltaNet sharding from config by @greg-kwasniewski1 in #11599
  • [None][fix] Nemotron H fp4 and MTP by @NVShreyas in #11601
  • [https://nvbugs/5919025][fix] Disable warmup steps for some WAN unit tests by @chang-l in #11616
  • [TRTLLM-10616][feat] Add FLUX.1 and FLUX.2 text-to-image pipeline support by @karljang in #11556
  • [#10243][chore] switched the default AD attention backend to trtllm by @MrGeva in #11627
  • [None][chroe] Mass integration of release/1.2 - 5th by @dominicshanshan in #11636
  • [None][chore] Align LlmArgs with some Pydantic best practices by @anish-shanbhag in #11158
  • [None][perf] Use UE8M0 FP8 quant kernel for DeepGemm blockwise GEMM by @chang-l in #11607
  • [None][infra] Waive failed cases for main on 02/24 by @EmmaQiaoCh in #11665
  • [https://nvbugs/5846489][perf] Apply TE's FP8 per-tensor quantization by @yumin066 in #11057
  • [None][fix] Fix test prefix generation for per-sm waives by @tburt-nv in #11519
  • [None][chore] Weekly mass integration of release/1.2 by @mikeiovine in #11572
  • [TRTLLM-9781][infra] Don't create timeout xml if the stage is aborted by @yiqingy0 in #9777
  • [None][fix] Accept **kwargs in DynamicYamlWithDeepMergeSettingsSource… by @tcherckez-nvidia in #11621
  • [https://nvbugs/5606178][fix] unwaive mamba2 two tests by @JadoTu in #11479
  • [TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework by @xxi-nv in #11648
  • [None][fix] Add comparison operators for perf regression triage by @chenfeiz0326 in #11675
  • [None][test] Add wideep DS-R1 nvfp4 test with attn_dp and kv_cache_reuse by @StanleySun639 in #11670
  • [None][chore] Moving kimi-k2-thinking deployment guide configs to config files. by @fsaady in #11645
  • [TRTINFRA-7367][infra] Automatically generate attributions file by @tburt-nv in #11323
  • [None][fix] rename svd-nvfp4 to trtllm-nvfp4 in visual gen examples by @karljang in #11664
  • [None] [fix] Restructure kv cache memory ratio parameters in curated .yaml config files by @xd-nv in #11511
  • [None][chore] Bump version to 1.3.0rc6 by @yuanjingx87 in #11688
  • [None][fix] Fix FP8 + Skip Softmax Attention accuracy issue on fmha_v2. by @bobboli in #11448
  • [TRTLLM-7836][feat] Implement dynamic quota resize for KVCacheManager v2 by @lowsfer in #11503
  • [#4666][fix] Handle None priority in KVCacheEventSerializer._event_diff_to_json by @wojciech-wais in #11576
  • [None][test] add concurrency override and fix for 128k8k cases by @ruodil in #11669
  • [TRTLLM-9904][feat] KVCache V2 MTP support by @liji-nv in #11346
  • [None][test] support short test case matcher in disagg test by @ruodil in #11707
  • [TRTLLM-11614][feat] Fixing multigpu tests by @greg-kwasniewski1 in #11615
  • [None][docs] Fix 60+ broken links across docs, blogs, and examples by @kaiyux in #11676
  • [TRTLLM-8828][infra] export HF_TOKEN in tests by @niukuo in #9382
  • [None][chore] Add feature for enhance perf dashboard by @fredricz-20070104 in #11506
  • [TRTLLM-11106][chore] Abstract ADPRouter interface and RankState by @lancelly in #11633
  • [TRTLLM-9527][feat] E2E Python KV transceiver for current KV manager (step 5) by @chuangz0 in #11136
  • [None][chore] KV Connector Refactor by @jthomson04 in #11078
  • [https://nvbugs/5875514][fix] Fix WideEP gen-only benchmark hang in disaggregated serving by @peihu-nv in #11521
  • [TRTLLM-10948][feat] Add GPU energy monitoring to trtllm-bench by @inciaf in #11397
  • [https://nvbugs/5734983][doc] update Qwen3-Next readme of server arg by @JadoTu in #11682
  • [None][infra] Waive failed cases for main on 02/25 by @EmmaQiaoCh in #11719
  • [https://nvbugs/5866619][fix] Support PEFT-saved safetensors file loading by @Wanli-Jiang in #11339
  • [None][fix] Quantized Eagle3 support: quantizing self.fc by @h-guo18 in #11699
  • [https://nvbugs/5822983][fix] Update waives.txt to remove skipped tests for TestDeepSeekV3Lite in accuracy module by @chienchunhung in #11591
  • [https://nvbugs/5845901][fix] Fix cancelled disagg requests stuck in gen server by @Tabrizian in #11695
  • [TRTLLM-11087][doc] Update speculative decoding docs by @mikeiovine in #11604
  • [#11529][perf] AD host time attention MD optimization for large context by @MrGeva in #11624
  • [TRTLLM-11090][perf] Improve fp8 (per-tensor) quant kernel by vectorized load/store by @chang-l in #11662
  • [None][infra] Update TRTLLM PLC pipeline by @yuanjingx87 in #11684
  • [https://nvbugs/5884735][fix] fix deepeplowlatency with DeepGEMM by @leslie-fang25 in #11700
  • [None][feat] Remove non flash attetnion style fmha_v2 kernel for hopper by @pengbowang-nv in #11381
  • [https://nvbugs/5799917][fix] Recover from CUTLASS MoE doActivation perf regression for MXFP4/NVFP4 dtype by @rosenrodt in #11165
  • [https://nvbugs/5914691][fix] WAR F.linear perf regression for GPTOSS by @dongfengy in #11668
  • [None][docs] Update PR template by @chzblych in #11735
  • [None][doc] Added Qwen3.5 Cookbook by @bmarimuthu-nv in #11728
  • [https://nvbugs/5915550][fix] Fix illegal memory access when max_seq_len > max_position_embeddings by @brb-nv in #11598
  • [https://nvbugs/5612438][fix] add timeout 14400 for SeedOSS by @zhhuang-nv in #11269
  • [https://nvbugs/5821053][fix] Preventing drift accumulation on kv_lens_cuda by @ziyixiong-nv in #11696
  • [None][test] Remove A100 test cases from QA perf scope by @yufeiwu-nv in #11712

New Contributors

Full Changelog: v1.3.0rc5...v1.3.0rc6

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.