NVIDIA/TensorRT-LLM v0.21.0rc2 on GitHub

Announcement Highlights:

Model Support
Features
- MoE trtllm backend kernel update (#5183)
- Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
- Enable overlap scheduler between draft forwards (#4802)
- Enable EPLB to existing MoE models (#5203)
- Add support for fp8 rowwise quantization for trt workflow(#4876)
- large-scale EP(part 7: DeepEP integration) (#4792)
- Add multi-node support for Triton with pytorch backend (#5172)
- Implement model-agnostic one-engine eagle3 (#4778)
- Support headDim 256 for blackwell fmha kernels (#5164)
- Port customized kernels with public cutlass version (#5027)
- Optimize KV Cache Reuse for MLA (#4869)
- Add attention dp support to MTP relaxed acceptance (#5119)
- trtllmGen MoE routing: added support for top groups and top K bounds (#4063)
- Add TRTLLM Sampler log probs support (#4836)
- Enable Finalize + Allreduce + add + rmsnorm fusion (#4756)
- port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
- Add multimodal hashing support (image hashing) (#4145)
- Include the executor's max batch size in CUDA graph batch size list (#4843)
API
- Remove decoder request from decoder interface (#5129)
- Enhance the llm args pytorch config part 3(torch_compile_config) (#5032)
- Remove the redundant use_kv_cache field from PytorchConfig (#5031)
- Move allreduce_strategy from committed api to reference (#5147)
Bug Fixes
- Stop drafting when we hit the draft model's max seq len (#4879)
- Unload XQA cubins early to avoid static lifetime (#5133)
- Fix OOM because of unnecessary mha workspace (#5056)
- Remove duplicated trust_remote_code knob from trtllm-serve (#5143)
- Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs (#4961)
- Fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
- Fix XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264)
- Fix cuda driver link issue with driver version less than 12.3 (#5025)
- Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod (#5026)
- Fix warmup phase batch size out of range. (#4986)
Benchmark
- Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130)
- Support post_proc for bench (#5122)
Performance
- Avoid dynamic import overhead in is_llm_response with duck typing (#5110)
- Removing initializing ptuning buffers to zero (#4915)
Infrastructure
- Move all test cases of TensorRT backend into post merge (#5186)
- Upload imageTag info to artifactory and add ngc_staging to save ngc image (#4764)
- Add a bot run option for detailed logs (#4390)
- Add timeout and retry for wget in docker image build (#5035)
- Change cutlass version back to 4.0 (#5041)
Documentation
- Fix invalid links for trtllm-serve doc (#5145)
- Added documentation for enable_trtllm_sampler. (#4990)
- Add disaggregated serving section to models doc (#4877)
Known Issues
- multi-GPU model support on RTX Pro 6000

What's Changed

ci: [nvbugs/5280806] Unwaive unittests/_torch. by @yuxianq in #4951
fix: Fix warmup phase batch size out of range. by @hyukn in #4986
[nvbug/5314469][feat] Include the executor's max batch size in CUDA g… by @mikeiovine in #4843
[nvbug 5283506] fix: Fix spec decode triton test by @pcastonguay in #4845
chore: Refine weight prefetching. by @yuxianq in #4893
chore: Change cutlass version back to 4.0 by @hyukn in #5041
[TRTLLM-5007][feat] Add multimodal hashing support (image hashing) by @chang-l in #4145
[https://nvbugs/5332927] Waive new tests by @tburt-nv in #5051
[TRTLLM-5518] doc: Adding disaggregated serving section to models doc by @pcastonguay in #4877
feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler by @dcampora in #4828
test: add more disaggregated serving tests into QA testlist by @StanleySun639 in #5036
perf: Removing initializing ptuning buffers to zero by @pcastonguay in #4915
chore: Waive CI failure. by @SimengLiu-nv in #5069
CI: waive test_ad_build_small_multi by @QiJune in #5071
[fix] Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod by @xiaoweiw-nv in #5026
fix cuda driver link issue with driver version less than 12.3 by @dongxuy04 in #5025
Waive L0 test by @yiqingy0 in #5067
[nvbug 5325284][fix] Increase Nemotron-H warmup request robustness by @tomeras91 in #4954
Waive L0 test by @yiqingy0 in #5077
chore: cleanup GDS Cmake interface by @achartier in #4928
fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. by @yuxianq in #4890
[TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion by @zongfeijing in #4756
[CI] waive failing L0 test by @liji-nv in #5089
Mxfp8xmxfp4 by @Tracin in #4978
[AutoDeploy] Merge Feature Branch Week 3 by @lucaslie in #5054
test: add unit tests for Llama4 min_latency code by @nvpohanh in #4980
Doc: Add info about stop words appearing in output by @Linda-Stadter in #4956
[fix] Fix test_attention_mla by @jinyangyuan-nvidia in #5084
CI: Allow run by @IzzyPutterman in #5101
[fix] Unwaive test_llama_eagle3 by @mikeiovine in #5042
fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. by @bobboli in #4264
test: conditional disagg and cache aware balancing for deepseek v3 by @zhengd-nv in #4522
[https://nvbugspro.nvidia.com/bug/5332927][fix] Fix the bug in the routing unit test by @ChristinaZ in #5065
infra: Add timeout and retry for wget in docker image build by @ZhanruiSunCh in #5035
Waive L0 tests by @yiqingy0 in #5111
chore: Merge remaining changes from feat/large-ep branch to main by @syuoni in #5039
[TRTLLM-4995][feat] TRTLLM Sampler log probs support by @dcampora in #4836
chore: bump version to 0.21.0rc2 by @ZhanruiSunCh in #5112
test: add more llama_v3.3_70b cases in perf test by @ruodil in #4979
[fix] Fix llama4 min latency by @liji-nv in #5117
[TRTLLM-5082] - Add a bot run option for detailed logs by @yiqingy0 in #4390
test: skip disaggregated tests on arm by @xinhe-nv in #5070
[chore] 2025-06-10 update allowlist by @tburt-nv in #5102
[TRTLLM-5581][infra] Update Module Owners by @poweiw in #5052
chore: rename IOFormatter to BaseCacheFormatter by @zhengd-nv in #5068
fix: limit process pool size when prefetching by @zhengd-nv in #5088
Use backend to replace macro to control enablement of MNNVL all reduce by @HuiGao-NV in #4635
Solve underallocation in VSWA+/VGQA by @netanel-haber in #4667
[nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade by @lucaslie in #5106
test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ by @venkywonka in #5066
Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs by @moraxu in #4961
update the free_gpu_mem_fraction for H100 qwen3 qa test by @byshiue in #5114
[test] Use LLM API for Nemotron-H correctness test by @tomeras91 in #5097
test: set enable_attention_dp to False for non-deepseek models and add more cases for llama_v3.1/3.3 70b fp8 models by @ruodil in #5149
[TRTLLM-4932] Add Llama-3.1-Nemotron-Nano-8B-v1-FP8 accuracy tests by @moraxu in #4933
Fix logprobs issues. by @dcampora in #5136
chore: fix typo in tests by @lfr-0531 in #5092
[fix] Do not reuse dummy request KVCache by @liji-nv in #4804
infra: upload imageTag info to artifactory and add ngc_staging to save ngc image by @ZhanruiSunCh in #4764
doc:fix invalid links for trtllm-serve doc by @nv-guomingz in #5145
test: waive the NIXL related tests by @Shixiaowei02 in #5153
enh(doc): Add ci-overview in docs/source/reference/ by @venkywonka in #5137
doc: Added documentation for enable_trtllm_sampler. by @dcampora in #4990
fix:remove duplicated trust_remote_code knob from trtllm-serve by @nv-guomingz in #5143
fix:https://nvbugs/5298661 by @nv-guomingz in #5022
fix: Updates to yarn implementation by @brb-nv in #5105
Move allreduce_strategy from committed api to reference by @HuiGao-NV in #5147
[nvbug/5334370][fix] Fix one model EAGLE3 by @mikeiovine in #5134
[fix][test] report individual unittests results to jenkins by @omera-nv in #5116
chore: Include prompt_token_ids only for context-only disagg requests by @pcastonguay in #5055
None: fix OOM because of unnecessary mha workspace by @ttyio in #5056
[feat] trtllmGen MoE routing: added support for top groups and top K bounds by @MatthiasKohl in #4063
[TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance by @lfr-0531 in #5119
fix: [nvbugs/5324229] Fix broken WInt4AFP8FusedMoEMethod since FusedMoE refactor. by @yuxianq in #4930
[feat] Optimize KV Cache Reuse for MLA by @zhhuang-nv in #4869
test: add more cases for rtx_pro_6000_se and add option kv_cache_dtype in perf test by @ruodil in #5083
tests: update tests for b200 by @xinhe-nv in #5180
[fix]: Fall back to HMAC to Avoid IPC Serialization Churn by @yibinl-nvidia in #5074
[fix] Reenable test return logits by @dcampora in #5160
Add two MTP disaggregated test by @Tabrizian in #4546
[test] Update timeout params in QA test list by @crazydemo in #5124
chore: gracefully exit disagg process in tests; better startup and logging by @zhengd-nv in #5109
[fix] Fix comment to pass guardwords check by @MatthiasKohl in #5191
[nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime by @lowsfer in #5133
refactoring: port customized kernels with public cutlass version by @yunruis in #5027
refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig by @nv-guomingz in #5031
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5178
feat: Basic skeleton for Gemma3 VLM by @brb-nv in #5108
add doc for open-sourced cutlass kernels by @yunruis in #5194
fix: fix license bug by @yunruis in #5200
UCXX uses only the ucp_feature_tag to avoid certain issues on specific platforms by @chuangz0 in #4994
CI: move all test cases of TensorRT backend into post merge by @QiJune in #5186
[https://nvbugspro.nvidia.com/bug/5295470] support headDim 256 for blackwell fmha kernels by @PerkzZheng in #5164
[nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len by @mikeiovine in #4879
[feat] Implement model-agnostic one-engine eagle3 by @nv-yilinf in #4778
fix: Fix waive list by @syuoni in #5205
feat: add multi-node support for Triton with pytorch backend by @achartier in #5172
optimize memset before alltoall communication by @dongxuy04 in #5188
refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) by @nv-guomingz in #5032
Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL by @yunruis in #4560
refactor: Speculative decoding buffers by @Funatiq in #5091
feat: large-scale EP(part 7: DeepEP integration) by @yuantailing in #4792
linting(python): Enable ruff on more files (wave 1/N) by @2ez4bz in #5140
feat: Add support for fp8 rowwise quantization by @achartier in #4876
chore: improve disagg test failure detection by @ixlmar in #4738
perf: avoid dynamic import overhead in is_llm_response with duck typing by @tongyuantongyu in #5110
feat: Enable EPLB to existing MoE models by @syuoni in #5203
feat: Support post_proc for bench by @kaiyux in #5122
fix: fix cuda graph max batch size for spec decoding cases. by @lfr-0531 in #5076
[fix][test] Speedup Nemotron NAS unittests by @omera-nv in #5202
use cu for fmha_v2 by @qsang-nv in #4694
[TRTLLM-4983] feat: enable overlap scheduler between draft forwards by @lfr-0531 in #4802
Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow by @amitz-nv in #5130
fix: build_config in TorchLlmArgs and avoid arbitrary args by @Superjomn in #4972
[fix] Fix Llama4 min-latency import error by @nv-yilinf in #5209
test: Add json_mode_eval for guided decoding evaluation by @syuoni in #5179
test: add more cases for llama_v3.3/3.1 70b fp8 and set enable_attention_dp to false to non-deepseek models by @ruodil in #5155
test: add llama4 models for perf test by @ruodil in #5187
test: Add fixture to skip tests based on MPI world size by @yizhang-nv in #5028
feat: Add w4a8_mxfp4_fp8 quantization recipe. by @Tracin in #4867
[Stress test] Add DeepSeek-R1 stress test by @Wanli-Jiang in #5033
refactor: Scheduling based on KV cache state by @Funatiq in #4865
use file lock to avoid port conflict by @chuangz0 in #5123
feat: MoE trtllm backend kernel update by @rosenrodt in #5183
refactor: remove decoder request from decoder interface by @Funatiq in #5129
Waive L0 tests by @yiqingy0 in #5233

New Contributors

@Linda-Stadter made their first contribution in #4956
@ttyio made their first contribution in #5056
@MatthiasKohl made their first contribution in #4063
@yuantailing made their first contribution in #4792
@2ez4bz made their first contribution in #5140

Full Changelog: v0.21.0rc1...v0.21.0rc2