github NVIDIA/TensorRT-LLM v1.3.0rc10

pre-release6 hours ago

Highlights

  • Model Support

    • Add Qwen 3.5 NVFP4 support (#12302)
    • Fuse all-reduce with norm for Nemotron-H models (#12410)
  • API

    • Add request priority support to the LLM API (#12362)
    • Change log prob behavior to stop normalizing by default (BREAKING) (#12366)
  • Feature

    • Add CuTe DSL single-pass multi-CTA cluster top-k (#12354)
    • Account for reusable KV cache blocks in micro-batch scheduler capacity scheduling (#11637)
    • Add raster-along-M/N support for blockscaled contiguous backbone kernels in CuteDSL MoE (#12079)
    • Add stride support for conv1d and fused_sigmoid_gating_delta_rule_update (#12442)
    • Add a safe allgather implementation with chunking (#12174)
    • Add dynamic SMEM block routing in MoE (#12456)
    • Optimize mamba_mixer2.py decode performance (#11843)
    • Add PDL support to CuTE DSL top-k kernels (#12506)
    • Add FlexKV support (#12512)
    • Add a KV cache-aware ADP router for prefix-affinity request routing (#12315)
  • Fix

    • Fix KV token estimation when ADP is enabled (#12099)
    • Fix Eagle MLA target with GQA draft support (#12171)
    • Fix Qwen 3.5 3D position ID handling (#12114)
    • Switch tests to TorchSampler and fix related bugs (#12200)
    • Use ceil_div for head and size sharding (#12441)
    • Remove redundant D2H synchronization to improve performance (#12445)
    • Fix parallel WAN VAE when return_dict=True (#12460)
    • Fix Triton resmooth kernel crashes on SM100f for large MoE grids (#12397)
    • Use a model-level warmup cache key for visual generation pipelines (#12516)
    • Add NVTX annotations in sampler.py (#12459)
    • Use extra_visual_gen_options to improve visual generation routing (#12487)
  • Documentation

    • Fix outdated code references in tech blogs 2, 3, 4, 8, 9, and 11 (#12338)
    • Document temperature-adjusted logprobs in the TRT backend (#12514)
    • Update Python coding guidelines (#12439)
  • Test & Infra

    • Save unittest subtest results periodically (#11850)
    • Fix the B200 aggregated CI perf test MPI issue (#12347)
    • Fix LoRA config handling when the provided config count is below requirements (#12409)
    • Add a unit test for load_state_dict safetensors fallback (#12408)
    • Replace the skipped TRTLLM NVFP4 test in the B300 CI list (#12454)
    • Fix the ltx-2 model checkpoint issue in VBench eval tests (#12463)
    • Fix the concurrent write issue in perf tests (#12484)
    • Update dependencies to align with the NGC PyTorch 26.02 stack (#12102)
    • Consolidate PyTransceiver code (#12342)
    • Add Eagle coverage with different input/output cases on Spark (#12520)

What's Changed

  • [None][infra] Waive 4 failed cases for main in post-merge 2611 by @ZhanruiSunCh in #12433
  • [None][test] Fix lora config less than required config number by @yufeiwu-nv in #12409
  • [https://nvbugs/5916151][fix] Unwaive test_fused_moe_w4a8_nvfp4_fp8[TRTLLM] by @xxi-nv in #12400
  • [https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on. by @dominicshanshan in #12099
  • [TRTLLM-11229][infra] Save unittest subtest results periodically by @yiqingy0 in #11850
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12426
  • [https://nvbugs/5997090][fix] Fix B200 Aggregated CI Perf Test MPI Issue by @chenfeiz0326 in #12347
  • [TRTLLM-10407][perf] Add cute dsl single pass multi cta cluster topk by @limin2021 in #12354
  • [TRTLLM-11070][feat] Account for reusable KV cache blocks in micro batch scheduler capacity scheduling. by @SimengLiu-nv in #11637
  • [None][chore] Fixing guardword check by @pcastonguay in #12455
  • [None][infra] Waive 1 failed cases for main in post-merge 2610 by @ZhanruiSunCh in #12434
  • [None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel by @liyuhannnnn in #12079
  • [None][fix] Switch tests to TorchSampler and fix bugs by @Funatiq in #12200
  • [TRTLLM-10061][fix] Use ceil_div for head/size calculations by @VALLIS-NERIA in #12441
  • [TRTLLM-10061][feat] Add stride support for conv1d and fused_sigmoid_gating_delta_rule_update by @VALLIS-NERIA in #12442
  • [None][fix] Eagle: MLA Target + GQA Draft by @IzzyPutterman in #12171
  • [None][doc] fix outdated code references in tech blogs 2, 3, 4, 8, 9, 11 by @schetlur-nv in #12338
  • [TRTLLM-11471][feat] Add safe version of allgather with chunking by @chienchunhung in #12174
  • [None][perf] add Dynamic SMEM block routing in MOE by @jiahanc in #12456
  • [TRTLLM-11544][feat] Add Qwen 3.5 supporting(NVFP4). by @nv-guomingz in #12302
  • [https://nvbugs/5997090][fix] Add Disagg Perf Test back as MPI Issue has been fixed by @chenfeiz0326 in #12458
  • [https://nvbugs/5841976][fix] Remove test_fused_moe_alltoall_fp4[DeepEP] from waives by @xxi-nv in #12405
  • [None][infra] Waive 2 failed cases for main in post-merge 2613 by @ZhanruiSunCh in #12473
  • [https://nvbugs/5866619][test] Add unit test for load_state_dict safetensors fallback by @crazydemo in #12408
  • [None][feat] Fuse all_reduce with norm for nemotron_h models by @Wanli-Jiang in #12410
  • [None][infra] Update CI allowed list by @yuanjingx87 in #12488
  • [https://nvbugs/6013562][test] Update waive by @xinhe-nv in #12492
  • [None][feat] Small optimizations for mamba_mixer2.py decode by @hnover-nv in #11843
  • [None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @hyukn in #12494
  • [#11526][chore] AutoDeploy accuracy tests: Use Llama3.1-8B-Instruct official checkpoints by @galagam in #12285
  • [https://nvbugs/6007285][fix] Replace skipped TRTLLM NVFP4 test in B300 CI list by @xxi-nv in #12454
  • [https://nvbugs/5983390][fix] Remove redundant D2H sync to optimize perf by @hyukn in #12445
  • [https://nvbugs/5987470][fix] BREAKING: Do not normalize log probs by default by @achartier in #12366
  • [TRTLLM-11622][fix] fix parallel WAN vae when return_dict=True by @NVShreyas in #12460
  • [None][infra] Waive pre-merge failed 5090 test by @yuanjingx87 in #12486
  • [None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @bo-nv in #12518
  • [None][chore] Fix ltx-2 Model Checkpoint Issue in VBench Eval Tests by @yibinl-nvidia in #12463
  • [https://nvbugs/5962591][fix] Fix Triton resmooth kernel crash on SM100f for large MoE grids by @Barry-Delaney in #12397
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12495
  • [None][doc] Document temperature-adjusted logprobs in TRT backend by @achartier in #12514
  • [None][feat] Add PDL support to CuTE DSL top-k kernels by @limin2021 in #12506
  • [None][infra] Waive 4 failed cases for main in post-merge 2617 by @ZhanruiSunCh in #12536
  • [None][doc] Update Python coding guidelines. by @hnover-nv in #12439
  • [#12290][fix] Qwen 3.5 fix 3d position ID handling by @bmarimuthu-nv in #12114
  • [TRTLLM-10820][infra] Update dependencies to align with NGC PyTorch 26.02 stack by @EmmaQiaoCh in #12102
  • [https://nvbugs/6015329][fix] Use model-level warmup cache key for visual gen pipelines by @karljang in #12516
  • [TRTLLM-9523][chore] PyTransceiver code consolidation by @Shixiaowei02 in #12342
  • [None][test] Add different input-output of eagle cases on Spark by @JennyLiu-nv in #12520
  • [https://nvbugs/6011086][fix] Fix Perf Test's Concurrent Write Issue by @chenfeiz0326 in #12484
  • [None][fix] NVTX annotation in sampler.py by @ixlmar in #12459
  • [https://nvbugs/5998489][feat] Adding support for request priority in LLM API by @pcastonguay in #12362
  • [None][feat] Add support for FlexKV by @pcastonguay in #12512
  • [None][feat] KV cache-aware ADP router for prefix-affinity request routing by @lancelly in #12315
  • [https://nvbugs/6008183][fix] Use extra_visual_gen_options to help de… by @JunyiXu-nv in #12487
  • [None][test] Waive a flaky test case on Dis-agg serving with Nemotron… by @nv-guomingz in #12578
  • [None][chore] Bump version to 1.3.0rc10 by @yuanjingx87 in #12511
  • [None][chore] Fixing guardword check by @VALLIS-NERIA in #12579

Full Changelog: v1.3.0rc9...v1.3.0rc10

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.