Highlights
-
Model Support
-
API
-
Feature
- Add cache transfer setup for Mamba states (#10934)
- Optimize MoE export by tracing with reduced experts and expanding graph (#11504)
- Add new Helix kernels for MNNVL-based codepath (#11433)
- Add
line_profilertool for host overhead analysis (#11232) - Enable multi-stream MoE; add multi-stream MLA attention (#11520)
- Add MoE all-to-all paradigm (#10985)
- Add support for multi instances in Triton backend with PyTorch backend (#11153)
- Add KV cache metrics to
MetricsCollectorfor more Prometheus metrics (#11243) - Account for reusable KV cache blocks in capacity calculation (#11490)
- Add CUDA graphs, torch compile, NVTX, and warmup for Visual Gen (#11554)
- Make preprocessing async (#11459)
- Split up
TorchSampler.Store(#11566)
-
Fix
- Fix multimodal placeholder counts (#11461)
- Add
cacheSaltIDproperty toBlockKeyserialization (#11457) - Fix cache transceiver (#11409)
- Declare the variable in the correct scope (#11066)
- Fix spec-dec mode flag and related C++ requirements (#10996)
- Fix Qwen3-VL-Dense/MoE accuracy drop (#11134)
- Complete WAR for
popenin QA env (#11214) - Improve error message for mismatched MPI world size (#11294)
- Use the
torch_dtypeset by ModelOpt (#11525) - Fix silent MPI failures on models with custom tokenizers (#11399)
- Fix Nemotron issues (#11425)
- Fix pipeline parallelism + disaggregated serving (#11509)
- Fix broken LLMAPI config (#11571)
- Fix illegal memory access with Helix CP=64 (#11593)
- Validate requests outside sampling loop (#11584)
- Correct chunked prefill handling in
TorchSampler(#11544) - Fix SpecDec sampling seed (#11081)
- Prevent NIXL agent name collision in containerized disaggregated serving (#11552)
-
Documentation
- Add doc for TRTLLM AIGV initial release (#11489)
- Update hardware support (#10719)
- Add documentation on configuring CPU affinity in TRT-LLM (#10678)
- Add warning about 2-model MTP deprecation (#11043)
- Update media file paths in Skip Softmax blog (#11540)
- Update TAVA architecture diagrams for visual gen flow and auto deploy flow (#11523)
- Add Qwen3.5 and GLM 4.7 Flash to support matrix (#11594)
-
Benchmark
- Add ctx-only and gen-only disaggregated perf tests (#11361)
-
Test & Infra
- Add CUTEDSL MoE backend for DeepSeek R1 NVFP4 checkpoint in stress test (#10920)
- Update MIG tests (#11014)
- Fix Slurm job name (#11265)
- Ensure
TorchSamplerdoes not sync (#11508) - Revert MoE unit tests refactor: add unified ConfigurableMoE test framework (#11532)
- Re-upgrade GHA for blossom-ci workflow (#11483)
- Stop using remotes in the Conan install build step (#11516)
- Update PLC pipeline (#11547, #11597)
- Fix testdb file for
l0_b200_multi_gpus_perf_sanity(#11603) - Add
visual_genCODEOWNERS paths (#11606)
What's Changed
- [None][chore] Adjust waive to avoid sm parsing by @tburt-nv in #11518
- [None][chore] Optimize MOE export by tracing with reduced experts and expanding graph by @suyoggupta in #11504
- [#11170][fix] Fix for mm placeholder counts by @2ez4bz in #11461
- [None][feat] Add new helix kernels for MNNVL-based codepath by @brb-nv in #11433
- [TRTLLM-11016][fix] Add cacheSaltID property to BlockKey serialization code by @thorjohnsen in #11457
- [https://nvbugs/5880261][fix] fix cacheTransceiver by @chuangz0 in #11409
- [None][doc] Add doc for TRTLLM AIGV initial release by @chang-l in #11489
- [TRTLLM-10851][feat] Add line_profiler tool for host overhead analysis. by @hyukn in #11232
- [None][chroe] Mass integration of release/1.2 - 4th by @dominicshanshan in #11500
- [None][feat] Use new index api, add block scale support, fix max_seq_len esitmation, add flash mla support by @yizhang-nv in #11334
- [#11455][bug] Use the torch_dtype set by ModelOpt by @tcherckez-nvidia in #11525
- [#10345][perf] Enable multi-stream MOE for super. Also adds multi-stream MLA attn by @suyoggupta in #11520
- [TRTLLM-10030][test] ensure that TorchSampler does not sync by @ixlmar in #11508
- [None][revert] - Revert "[TRTLLM-9108][feat] refactor MoE unit tests: add unified ConfigurableMoE test framework" by @chzblych in #11532
- [None][fix] Better error message for mismatched MPI world size by @jthomson04 in #11294
- [#11109][feat] AutoDeploy: GLM 4.7 Flash Improvements by @bmarimuthu-nv in #11414
- [None][doc] Update media files path in Skip Softmax blog. by @bobboli in #11540
- [#11318][infra] AutoDeploy: Add fused rope kernel - triton_rope_on_interleaved_qk_inputs by @bmarimuthu-nv in #11327
- [None][chore] Waive failing pre-merge test by @brb-nv in #11551
- [None][chore] Waive moe fp4 test by @brb-nv in #11558
- [None][chore] Bump version to 1.3.0rc5 by @yuanjingx87 in #11557
- [TRTLLM-10845][feat] Add dynamic llmapi defaults system by @venkywonka in #11035
- [https://nvbugs/5888464][fix] Stop using remotes in the Conan install build step by @tburt-nv in #11516
- [None][chore] TAVA architecture diagram updates for visual gen flow and auto deploy flow by @yibinl-nvidia in #11523
- [TRTLLM-10064][feat] MoE all-to-all paradigm by @greg-kwasniewski1 in #10985
- [TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tests by @chenfeiz0326 in #11361
- [TRTLLM-10037][chore] Re-upgrade GHA for blossom-ci workflow by @dpitman-nvda in #11483
- [None][feat] Add support for multi instances in Triton backend with pytorch backend by @achartier in #11153
- [None][fix] Fix silent MPI failures on models with custom tokenizers by @jthomson04 in #11399
- [None][infra] PLC pipeline update by @yuanjingx87 in #11547
- [TRTLLM-10827][feat] Add KV Cache metrics to MetricsCollector for more Prometheus metrics by @yijingl-nvidia in #11243
- [https://nvbugs/5880313][fix] Fix pp + disagg by @Tabrizian in #11509
- [None][infra] Waive unittest that consistently timed out by @yuanjingx87 in #11580
- [TRTLLM-1543][feat] Account for reusable KV cache blocks in capacity … by @SimengLiu-nv in #11490
- [None][feat] Visual Gen: add cuda graphs; torch compile; nvtx; warmup by @NVShreyas in #11554
- [TRTLLM-9040][perf] Make preprocessing async by @2ez4bz in #11459
- [#11440] [feat] AutoDeploy : Support Qwen3.5 by @bmarimuthu-nv in #11394
- [#11292][feat] use smg-grpc-proto package for gRPC proto definitions by @CatherineSue in #11578
- [None][doc] Add Qwen3.5, GLM 4.7 Flash to support matrix by @bmarimuthu-nv in #11594
- [None][feat] AutoDeploy: Add nemotron v2 acc test by @nvchenghaoz in #11429
- [#11569][fix] Fix broken LLMAPI config by @2ez4bz in #11571
- [None][chore] split up TorchSampler.Store by @ixlmar in #11566
- [None][fix] Read mamba_ssm_cache_dtype from HF config when set to auto by @tomeras91 in #11582
- [https://nvbugs/5914959][fix] Fix illegal memory access with Helix CP=64 by @brb-nv in #11593
- [#10243][feat] Add TRT-LLM attention backend to AutoDeploy by @MrGeva in #11430
- [TRTLLM-10857][chore] Move SaveHiddenStates spec dec mode to 1 model by @mikeiovine in #11241
- [TRTLLM-10197][feat] Cache Transfer Setup for Mamba States by @NVShreyas in #10934
- [TRTLLM-11069][fix] validate requests outside sampling loop by @ixlmar in #11584
- [None][fix] correct chunked prefill handling in TorchSampler by @ixlmar in #11544
- [None][feat] Add NVFP4 dynamic quantization support for visual_gen models by @chang-l in #11563
- [None][fix] Nemotron Super fix by @IzzyPutterman in #11425
- [None][fix] SpecDec: Sampling seed fix by @IzzyPutterman in #11081
- [None][chore] Waive failing post merge by @pcastonguay in #11600
- [None][fix] fix testdb file for l0_b200_multi_gpus_perf_sanity by @yuanjingx87 in #11603
- [https://nvbugs/5896216][fix] Prevent NIXL agent name collision in containerized disaggregated serving by @nv-yna in #11552
- [None][infra] add visual_gen codeowners paths by @venkywonka in #11606
- [None][infra] Waive unittest that timed out by @yuanjingx87 in #11605
- [None][infra] PLC pipeline update by @yuanjingx87 in #11597
- [None][chore] Waive failing WAN tests due to missing warmup_steps by @chang-l in #11617
- [None][chore] Waiving more tests by @pcastonguay in #11613
- [None][chore] Multi gpu sbsa waives by @pcastonguay in #11629
New Contributors
- @yijingl-nvidia made their first contribution in #11243
Full Changelog: v1.3.0rc4...v1.3.0rc5