NVIDIA/TensorRT-LLM v1.3.0rc5 on GitHub

Highlights

Model Support
- Add support for Qwen3.5 with AutoDeploy (#11394)
- Read mamba_ssm_cache_dtype from HF config when set to auto (#11582)
- Add NVFP4 dynamic quantization support for visual_gen models (#11563)
API
- Use new index API; add block scale support; fix max sequence length estimation; add flash MLA support (#11334)
- Add dynamic LLMAPI defaults system (#11035)
- Use smg-grpc-proto package for gRPC proto definitions (#11578)
- Move SaveHiddenStates spec-dec mode to one model (#11241)
Feature
- Add cache transfer setup for Mamba states (#10934)
- Optimize MoE export by tracing with reduced experts and expanding graph (#11504)
- Add new Helix kernels for MNNVL-based codepath (#11433)
- Add line_profiler tool for host overhead analysis (#11232)
- Enable multi-stream MoE; add multi-stream MLA attention (#11520)
- Add MoE all-to-all paradigm (#10985)
- Add support for multi instances in Triton backend with PyTorch backend (#11153)
- Add KV cache metrics to MetricsCollector for more Prometheus metrics (#11243)
- Account for reusable KV cache blocks in capacity calculation (#11490)
- Add CUDA graphs, torch compile, NVTX, and warmup for Visual Gen (#11554)
- Make preprocessing async (#11459)
- Split up TorchSampler.Store (#11566)
Fix
- Fix multimodal placeholder counts (#11461)
- Add cacheSaltID property to BlockKey serialization (#11457)
- Fix cache transceiver (#11409)
- Declare the variable in the correct scope (#11066)
- Fix spec-dec mode flag and related C++ requirements (#10996)
- Fix Qwen3-VL-Dense/MoE accuracy drop (#11134)
- Complete WAR for popen in QA env (#11214)
- Improve error message for mismatched MPI world size (#11294)
- Use the torch_dtype set by ModelOpt (#11525)
- Fix silent MPI failures on models with custom tokenizers (#11399)
- Fix Nemotron issues (#11425)
- Fix pipeline parallelism + disaggregated serving (#11509)
- Fix broken LLMAPI config (#11571)
- Fix illegal memory access with Helix CP=64 (#11593)
- Validate requests outside sampling loop (#11584)
- Correct chunked prefill handling in TorchSampler (#11544)
- Fix SpecDec sampling seed (#11081)
- Prevent NIXL agent name collision in containerized disaggregated serving (#11552)
Documentation
- Add doc for TRTLLM AIGV initial release (#11489)
- Update hardware support (#10719)
- Add documentation on configuring CPU affinity in TRT-LLM (#10678)
- Add warning about 2-model MTP deprecation (#11043)
- Update media file paths in Skip Softmax blog (#11540)
- Update TAVA architecture diagrams for visual gen flow and auto deploy flow (#11523)
- Add Qwen3.5 and GLM 4.7 Flash to support matrix (#11594)
Benchmark
- Add ctx-only and gen-only disaggregated perf tests (#11361)
Test & Infra
- Add CUTEDSL MoE backend for DeepSeek R1 NVFP4 checkpoint in stress test (#10920)
- Update MIG tests (#11014)
- Fix Slurm job name (#11265)
- Ensure TorchSampler does not sync (#11508)
- Revert MoE unit tests refactor: add unified ConfigurableMoE test framework (#11532)
- Re-upgrade GHA for blossom-ci workflow (#11483)
- Stop using remotes in the Conan install build step (#11516)
- Update PLC pipeline (#11547, #11597)
- Fix testdb file for l0_b200_multi_gpus_perf_sanity (#11603)
- Add visual_gen CODEOWNERS paths (#11606)

What's Changed

[None][chore] Adjust waive to avoid sm parsing by @tburt-nv in #11518
[None][chore] Optimize MOE export by tracing with reduced experts and expanding graph by @suyoggupta in #11504
[#11170][fix] Fix for mm placeholder counts by @2ez4bz in #11461
[None][feat] Add new helix kernels for MNNVL-based codepath by @brb-nv in #11433
[TRTLLM-11016][fix] Add cacheSaltID property to BlockKey serialization code by @thorjohnsen in #11457
[https://nvbugs/5880261][fix] fix cacheTransceiver by @chuangz0 in #11409
[None][doc] Add doc for TRTLLM AIGV initial release by @chang-l in #11489
[TRTLLM-10851][feat] Add line_profiler tool for host overhead analysis. by @hyukn in #11232
[None][chroe] Mass integration of release/1.2 - 4th by @dominicshanshan in #11500
[None][feat] Use new index api, add block scale support, fix max_seq_len esitmation, add flash mla support by @yizhang-nv in #11334
[#11455][bug] Use the torch_dtype set by ModelOpt by @tcherckez-nvidia in #11525
[#10345][perf] Enable multi-stream MOE for super. Also adds multi-stream MLA attn by @suyoggupta in #11520
[TRTLLM-10030][test] ensure that TorchSampler does not sync by @ixlmar in #11508
[None][revert] - Revert "[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework" by @chzblych in #11532
[None][fix] Better error message for mismatched MPI world size by @jthomson04 in #11294
[#11109][feat] AutoDeploy: GLM 4.7 Flash Improvements by @bmarimuthu-nv in #11414
[None][doc] Update media files path in Skip Softmax blog. by @bobboli in #11540
[#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs by @bmarimuthu-nv in #11327
[None][chore] Waive failing pre-merge test by @brb-nv in #11551
[None][chore] Waive moe fp4 test by @brb-nv in #11558
[None][chore] Bump version to 1.3.0rc5 by @yuanjingx87 in #11557
[TRTLLM-10845][feat] Add dynamic llmapi defaults system by @venkywonka in #11035
[https://nvbugs/5888464][fix] Stop using remotes in the Conan install build step by @tburt-nv in #11516
[None][chore] TAVA architecture diagram updates for visual gen flow and auto deploy flow by @yibinl-nvidia in #11523
[TRTLLM-10064][feat] MoE all-to-all paradigm by @greg-kwasniewski1 in #10985
[TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tests by @chenfeiz0326 in #11361
[TRTLLM-10037][chore] Re-upgrade GHA for blossom-ci workflow by @dpitman-nvda in #11483
[None][feat] Add support for multi instances in Triton backend with pytorch backend by @achartier in #11153
[None][fix] Fix silent MPI failures on models with custom tokenizers by @jthomson04 in #11399
[None][infra] PLC pipeline update by @yuanjingx87 in #11547
[TRTLLM-10827][feat] Add KV Cache metrics to MetricsCollector for more Prometheus metrics by @yijingl-nvidia in #11243
[https://nvbugs/5880313][fix] Fix pp + disagg by @Tabrizian in #11509
[None][infra] Waive unittest that consistently timed out by @yuanjingx87 in #11580
[TRTLLM-1543][feat] Account for reusable KV cache blocks in capacity … by @SimengLiu-nv in #11490
[None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup by @NVShreyas in #11554
[TRTLLM-9040][perf] Make preprocessing async by @2ez4bz in #11459
[#11440] [feat] AutoDeploy : Support Qwen3.5 by @bmarimuthu-nv in #11394
[#11292][feat] use smg-grpc-proto package for gRPC proto definitions by @CatherineSue in #11578
[None][doc] Add Qwen3.5, GLM 4.7 Flash to support matrix by @bmarimuthu-nv in #11594
[None][feat] AutoDeploy: Add nemotron v2 acc test by @nvchenghaoz in #11429
[#11569][fix] Fix broken LLMAPI config by @2ez4bz in #11571
[None][chore] split up TorchSampler.Store by @ixlmar in #11566
[None][fix] Read mamba_ssm_cache_dtype from HF config when set to auto by @tomeras91 in #11582
[https://nvbugs/5914959][fix] Fix illegal memory access with Helix CP=64 by @brb-nv in #11593
[#10243][feat] Add TRT-LLM attention backend to AutoDeploy by @MrGeva in #11430
[TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model by @mikeiovine in #11241
[TRTLLM-10197][feat] Cache Transfer Setup for Mamba States by @NVShreyas in #10934
[TRTLLM-11069][fix] validate requests outside sampling loop by @ixlmar in #11584
[None][fix] correct chunked prefill handling in TorchSampler by @ixlmar in #11544
[None][feat] Add NVFP4 dynamic quantization support for visual_gen models by @chang-l in #11563
[None][fix] Nemotron Super fix by @IzzyPutterman in #11425
[None][fix] SpecDec: Sampling seed fix by @IzzyPutterman in #11081
[None][chore] Waive failing post merge by @pcastonguay in #11600
[None][fix] fix testdb file for l0_b200_multi_gpus_perf_sanity by @yuanjingx87 in #11603
[https://nvbugs/5896216][fix] Prevent NIXL agent name collision in containerized disaggregated serving by @nv-yna in #11552
[None][infra] add visual_gen codeowners paths by @venkywonka in #11606
[None][infra] Waive unittest that timed out by @yuanjingx87 in #11605
[None][infra] PLC pipeline update by @yuanjingx87 in #11597
[None][chore] Waive failing WAN tests due to missing warmup_steps by @chang-l in #11617
[None][chore] Waiving more tests by @pcastonguay in #11613
[None][chore] Multi gpu sbsa waives by @pcastonguay in #11629

New Contributors

@yijingl-nvidia made their first contribution in #11243

Full Changelog: v1.3.0rc4...v1.3.0rc5