NVIDIA/TensorRT-LLM v1.3.0rc10 on GitHub

Highlights

Model Support
- Add Qwen 3.5 NVFP4 support (#12302)
- Fuse all-reduce with norm for Nemotron-H models (#12410)
API
- Add request priority support to the LLM API (#12362)
- Change log prob behavior to stop normalizing by default (BREAKING) (#12366)
Feature
- Add CuTe DSL single-pass multi-CTA cluster top-k (#12354)
- Account for reusable KV cache blocks in micro-batch scheduler capacity scheduling (#11637)
- Add raster-along-M/N support for blockscaled contiguous backbone kernels in CuteDSL MoE (#12079)
- Add stride support for conv1d and fused_sigmoid_gating_delta_rule_update (#12442)
- Add a safe allgather implementation with chunking (#12174)
- Add dynamic SMEM block routing in MoE (#12456)
- Optimize mamba_mixer2.py decode performance (#11843)
- Add PDL support to CuTE DSL top-k kernels (#12506)
- Add FlexKV support (#12512)
- Add a KV cache-aware ADP router for prefix-affinity request routing (#12315)
Fix
- Fix KV token estimation when ADP is enabled (#12099)
- Fix Eagle MLA target with GQA draft support (#12171)
- Fix Qwen 3.5 3D position ID handling (#12114)
- Switch tests to TorchSampler and fix related bugs (#12200)
- Use ceil_div for head and size sharding (#12441)
- Remove redundant D2H synchronization to improve performance (#12445)
- Fix parallel WAN VAE when return_dict=True (#12460)
- Fix Triton resmooth kernel crashes on SM100f for large MoE grids (#12397)
- Use a model-level warmup cache key for visual generation pipelines (#12516)
- Add NVTX annotations in sampler.py (#12459)
- Use extra_visual_gen_options to improve visual generation routing (#12487)
Documentation
- Fix outdated code references in tech blogs 2, 3, 4, 8, 9, and 11 (#12338)
- Document temperature-adjusted logprobs in the TRT backend (#12514)
- Update Python coding guidelines (#12439)
Test & Infra
- Save unittest subtest results periodically (#11850)
- Fix the B200 aggregated CI perf test MPI issue (#12347)
- Fix LoRA config handling when the provided config count is below requirements (#12409)
- Add a unit test for load_state_dict safetensors fallback (#12408)
- Replace the skipped TRTLLM NVFP4 test in the B300 CI list (#12454)
- Fix the ltx-2 model checkpoint issue in VBench eval tests (#12463)
- Fix the concurrent write issue in perf tests (#12484)
- Update dependencies to align with the NGC PyTorch 26.02 stack (#12102)
- Consolidate PyTransceiver code (#12342)
- Add Eagle coverage with different input/output cases on Spark (#12520)

What's Changed

[None][infra] Waive 4 failed cases for main in post-merge 2611 by @ZhanruiSunCh in #12433
[None][test] Fix lora config less than required config number by @yufeiwu-nv in #12409
[https://nvbugs/5916151][fix] Unwaive test_fused_moe_w4a8_nvfp4_fp8[TRTLLM] by @xxi-nv in #12400
[https://nvbugs/5963423][fix] Fix kv token estimation when ADP is on. by @dominicshanshan in #12099
[TRTLLM-11229][infra] Save unittest subtest results periodically by @yiqingy0 in #11850
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12426
[https://nvbugs/5997090][fix] Fix B200 Aggregated CI Perf Test MPI Issue by @chenfeiz0326 in #12347
[TRTLLM-10407][perf] Add cute dsl single pass multi cta cluster topk by @limin2021 in #12354
[TRTLLM-11070][feat] Account for reusable KV cache blocks in micro batch scheduler capacity scheduling. by @SimengLiu-nv in #11637
[None][chore] Fixing guardword check by @pcastonguay in #12455
[None][infra] Waive 1 failed cases for main in post-merge 2610 by @ZhanruiSunCh in #12434
[None][feat] CuteDSL MOE: Add raster along M/N support for blockscaled contiguous backbone kernel by @liyuhannnnn in #12079
[None][fix] Switch tests to TorchSampler and fix bugs by @Funatiq in #12200
[TRTLLM-10061][fix] Use ceil_div for head/size calculations by @VALLIS-NERIA in #12441
[TRTLLM-10061][feat] Add stride support for conv1d and fused_sigmoid_gating_delta_rule_update by @VALLIS-NERIA in #12442
[None][fix] Eagle: MLA Target + GQA Draft by @IzzyPutterman in #12171
[None][doc] fix outdated code references in tech blogs 2, 3, 4, 8, 9, 11 by @schetlur-nv in #12338
[TRTLLM-11471][feat] Add safe version of allgather with chunking by @chienchunhung in #12174
[None][perf] add Dynamic SMEM block routing in MOE by @jiahanc in #12456
[TRTLLM-11544][feat] Add Qwen 3.5 supporting(NVFP4). by @nv-guomingz in #12302
[https://nvbugs/5997090][fix] Add Disagg Perf Test back as MPI Issue has been fixed by @chenfeiz0326 in #12458
[https://nvbugs/5841976][fix] Remove test_fused_moe_alltoall_fp4[DeepEP] from waives by @xxi-nv in #12405
[None][infra] Waive 2 failed cases for main in post-merge 2613 by @ZhanruiSunCh in #12473
[https://nvbugs/5866619][test] Add unit test for load_state_dict safetensors fallback by @crazydemo in #12408
[None][feat] Fuse all_reduce with norm for nemotron_h models by @Wanli-Jiang in #12410
[None][infra] Update CI allowed list by @yuanjingx87 in #12488
[https://nvbugs/6013562][test] Update waive by @xinhe-nv in #12492
[None][feat] Small optimizations for mamba_mixer2.py decode by @hnover-nv in #11843
[None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @hyukn in #12494
[#11526][chore] AutoDeploy accuracy tests: Use Llama3.1-8B-Instruct official checkpoints by @galagam in #12285
[https://nvbugs/6007285][fix] Replace skipped TRTLLM NVFP4 test in B300 CI list by @xxi-nv in #12454
[https://nvbugs/5983390][fix] Remove redundant D2H sync to optimize perf by @hyukn in #12445
[https://nvbugs/5987470][fix] BREAKING: Do not normalize log probs by default by @achartier in #12366
[TRTLLM-11622][fix] fix parallel WAN vae when return_dict=True by @NVShreyas in #12460
[None][infra] Waive pre-merge failed 5090 test by @yuanjingx87 in #12486
[None][infra] Waive flaky DeepSeekV3Lite disagg serving test by @bo-nv in #12518
[None][chore] Fix ltx-2 Model Checkpoint Issue in VBench Eval Tests by @yibinl-nvidia in #12463
[https://nvbugs/5962591][fix] Fix Triton resmooth kernel crash on SM100f for large MoE grids by @Barry-Delaney in #12397
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12495
[None][doc] Document temperature-adjusted logprobs in TRT backend by @achartier in #12514
[None][feat] Add PDL support to CuTE DSL top-k kernels by @limin2021 in #12506
[None][infra] Waive 4 failed cases for main in post-merge 2617 by @ZhanruiSunCh in #12536
[None][doc] Update Python coding guidelines. by @hnover-nv in #12439
[#12290][fix] Qwen 3.5 fix 3d position ID handling by @bmarimuthu-nv in #12114
[TRTLLM-10820][infra] Update dependencies to align with NGC PyTorch 26.02 stack by @EmmaQiaoCh in #12102
[https://nvbugs/6015329][fix] Use model-level warmup cache key for visual gen pipelines by @karljang in #12516
[TRTLLM-9523][chore] PyTransceiver code consolidation by @Shixiaowei02 in #12342
[None][test] Add different input-output of eagle cases on Spark by @JennyLiu-nv in #12520
[https://nvbugs/6011086][fix] Fix Perf Test's Concurrent Write Issue by @chenfeiz0326 in #12484
[None][fix] NVTX annotation in sampler.py by @ixlmar in #12459
[https://nvbugs/5998489][feat] Adding support for request priority in LLM API by @pcastonguay in #12362
[None][feat] Add support for FlexKV by @pcastonguay in #12512
[None][feat] KV cache-aware ADP router for prefix-affinity request routing by @lancelly in #12315
[https://nvbugs/6008183][fix] Use extra_visual_gen_options to help de… by @JunyiXu-nv in #12487
[None][test] Waive a flaky test case on Dis-agg serving with Nemotron… by @nv-guomingz in #12578
[None][chore] Bump version to 1.3.0rc10 by @yuanjingx87 in #12511
[None][chore] Fixing guardword check by @VALLIS-NERIA in #12579

Full Changelog: v1.3.0rc9...v1.3.0rc10