Announcement Highlights:
- Model Support
- Features
- MoE trtllm backend kernel update (#5183)
- Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
- Enable overlap scheduler between draft forwards (#4802)
- Enable EPLB to existing MoE models (#5203)
- Add support for fp8 rowwise quantization for trt workflow(#4876)
- large-scale EP(part 7: DeepEP integration) (#4792)
- Add multi-node support for Triton with pytorch backend (#5172)
- Implement model-agnostic one-engine eagle3 (#4778)
- Support headDim 256 for blackwell fmha kernels (#5164)
- Port customized kernels with public cutlass version (#5027)
- Optimize KV Cache Reuse for MLA (#4869)
- Add attention dp support to MTP relaxed acceptance (#5119)
- trtllmGen MoE routing: added support for top groups and top K bounds (#4063)
- Add TRTLLM Sampler log probs support (#4836)
- Enable Finalize + Allreduce + add + rmsnorm fusion (#4756)
- port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
- Add multimodal hashing support (image hashing) (#4145)
- Include the executor's max batch size in CUDA graph batch size list (#4843)
- API
- Bug Fixes
- Stop drafting when we hit the draft model's max seq len (#4879)
- Unload XQA cubins early to avoid static lifetime (#5133)
- Fix OOM because of unnecessary mha workspace (#5056)
- Remove duplicated trust_remote_code knob from trtllm-serve (#5143)
- Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs (#4961)
- Fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
- Fix XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264)
- Fix cuda driver link issue with driver version less than 12.3 (#5025)
- Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod (#5026)
- Fix warmup phase batch size out of range. (#4986)
- Benchmark
- Performance
- Infrastructure
- Documentation
- Known Issues
- multi-GPU model support on RTX Pro 6000
What's Changed
- ci: [nvbugs/5280806] Unwaive unittests/_torch. by @yuxianq in #4951
- fix: Fix warmup phase batch size out of range. by @hyukn in #4986
- [nvbug/5314469][feat] Include the executor's max batch size in CUDA g… by @mikeiovine in #4843
- [nvbug 5283506] fix: Fix spec decode triton test by @pcastonguay in #4845
- chore: Refine weight prefetching. by @yuxianq in #4893
- chore: Change cutlass version back to 4.0 by @hyukn in #5041
- [TRTLLM-5007][feat] Add multimodal hashing support (image hashing) by @chang-l in #4145
- [https://nvbugs/5332927] Waive new tests by @tburt-nv in #5051
- [TRTLLM-5518] doc: Adding disaggregated serving section to models doc by @pcastonguay in #4877
- feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler by @dcampora in #4828
- test: add more disaggregated serving tests into QA testlist by @StanleySun639 in #5036
- perf: Removing initializing ptuning buffers to zero by @pcastonguay in #4915
- chore: Waive CI failure. by @SimengLiu-nv in #5069
- CI: waive test_ad_build_small_multi by @QiJune in #5071
- [fix] Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod by @xiaoweiw-nv in #5026
- fix cuda driver link issue with driver version less than 12.3 by @dongxuy04 in #5025
- Waive L0 test by @yiqingy0 in #5067
- [nvbug 5325284][fix] Increase Nemotron-H warmup request robustness by @tomeras91 in #4954
- Waive L0 test by @yiqingy0 in #5077
- chore: cleanup GDS Cmake interface by @achartier in #4928
- fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. by @yuxianq in #4890
- [TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion by @zongfeijing in #4756
- [CI] waive failing L0 test by @liji-nv in #5089
- Mxfp8xmxfp4 by @Tracin in #4978
- [AutoDeploy] Merge Feature Branch Week 3 by @lucaslie in #5054
- test: add unit tests for Llama4 min_latency code by @nvpohanh in #4980
- Doc: Add info about stop words appearing in output by @Linda-Stadter in #4956
- [fix] Fix test_attention_mla by @jinyangyuan-nvidia in #5084
- CI: Allow run by @IzzyPutterman in #5101
- [fix] Unwaive test_llama_eagle3 by @mikeiovine in #5042
- fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. by @bobboli in #4264
- test: conditional disagg and cache aware balancing for deepseek v3 by @zhengd-nv in #4522
- [https://nvbugspro.nvidia.com/bug/5332927][fix] Fix the bug in the routing unit test by @ChristinaZ in #5065
- infra: Add timeout and retry for wget in docker image build by @ZhanruiSunCh in #5035
- Waive L0 tests by @yiqingy0 in #5111
- chore: Merge remaining changes from feat/large-ep branch to main by @syuoni in #5039
- [TRTLLM-4995][feat] TRTLLM Sampler log probs support by @dcampora in #4836
- chore: bump version to 0.21.0rc2 by @ZhanruiSunCh in #5112
- test: add more llama_v3.3_70b cases in perf test by @ruodil in #4979
- [fix] Fix llama4 min latency by @liji-nv in #5117
- [TRTLLM-5082] - Add a bot run option for detailed logs by @yiqingy0 in #4390
- test: skip disaggregated tests on arm by @xinhe-nv in #5070
- [chore] 2025-06-10 update allowlist by @tburt-nv in #5102
- [TRTLLM-5581][infra] Update Module Owners by @poweiw in #5052
- chore: rename IOFormatter to BaseCacheFormatter by @zhengd-nv in #5068
- fix: limit process pool size when prefetching by @zhengd-nv in #5088
- Use backend to replace macro to control enablement of MNNVL all reduce by @HuiGao-NV in #4635
- Solve underallocation in VSWA+/VGQA by @netanel-haber in #4667
- [nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade by @lucaslie in #5106
- test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ by @venkywonka in #5066
- Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs by @moraxu in #4961
- update the free_gpu_mem_fraction for H100 qwen3 qa test by @byshiue in #5114
- [test] Use LLM API for Nemotron-H correctness test by @tomeras91 in #5097
- test: set enable_attention_dp to False for non-deepseek models and add more cases for llama_v3.1/3.3 70b fp8 models by @ruodil in #5149
- [TRTLLM-4932] Add Llama-3.1-Nemotron-Nano-8B-v1-FP8 accuracy tests by @moraxu in #4933
- Fix logprobs issues. by @dcampora in #5136
- chore: fix typo in tests by @lfr-0531 in #5092
- [fix] Do not reuse dummy request KVCache by @liji-nv in #4804
- infra: upload imageTag info to artifactory and add ngc_staging to save ngc image by @ZhanruiSunCh in #4764
- doc:fix invalid links for trtllm-serve doc by @nv-guomingz in #5145
- test: waive the NIXL related tests by @Shixiaowei02 in #5153
- enh(doc): Add
ci-overview
indocs/source/reference/
by @venkywonka in #5137 - doc: Added documentation for enable_trtllm_sampler. by @dcampora in #4990
- fix:remove duplicated trust_remote_code knob from trtllm-serve by @nv-guomingz in #5143
- fix:https://nvbugs/5298661 by @nv-guomingz in #5022
- fix: Updates to yarn implementation by @brb-nv in #5105
- Move allreduce_strategy from committed api to reference by @HuiGao-NV in #5147
- [nvbug/5334370][fix] Fix one model EAGLE3 by @mikeiovine in #5134
- [fix][test] report individual unittests results to jenkins by @omera-nv in #5116
- chore: Include prompt_token_ids only for context-only disagg requests by @pcastonguay in #5055
- None: fix OOM because of unnecessary mha workspace by @ttyio in #5056
- [feat] trtllmGen MoE routing: added support for top groups and top K bounds by @MatthiasKohl in #4063
- [TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance by @lfr-0531 in #5119
- fix: [nvbugs/5324229] Fix broken WInt4AFP8FusedMoEMethod since FusedMoE refactor. by @yuxianq in #4930
- [feat] Optimize KV Cache Reuse for MLA by @zhhuang-nv in #4869
- test: add more cases for rtx_pro_6000_se and add option kv_cache_dtype in perf test by @ruodil in #5083
- tests: update tests for b200 by @xinhe-nv in #5180
- [fix]: Fall back to HMAC to Avoid IPC Serialization Churn by @yibinl-nvidia in #5074
- [fix] Reenable test return logits by @dcampora in #5160
- Add two MTP disaggregated test by @Tabrizian in #4546
- [test] Update timeout params in QA test list by @crazydemo in #5124
- chore: gracefully exit disagg process in tests; better startup and logging by @zhengd-nv in #5109
- [fix] Fix comment to pass guardwords check by @MatthiasKohl in #5191
- [nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime by @lowsfer in #5133
- refactoring: port customized kernels with public cutlass version by @yunruis in #5027
- refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig by @nv-guomingz in #5031
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5178
- feat: Basic skeleton for Gemma3 VLM by @brb-nv in #5108
- add doc for open-sourced cutlass kernels by @yunruis in #5194
- fix: fix license bug by @yunruis in #5200
- UCXX uses only the ucp_feature_tag to avoid certain issues on specific platforms by @chuangz0 in #4994
- CI: move all test cases of TensorRT backend into post merge by @QiJune in #5186
- [https://nvbugspro.nvidia.com/bug/5295470] support headDim 256 for blackwell fmha kernels by @PerkzZheng in #5164
- [nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len by @mikeiovine in #4879
- [feat] Implement model-agnostic one-engine eagle3 by @nv-yilinf in #4778
- fix: Fix waive list by @syuoni in #5205
- feat: add multi-node support for Triton with pytorch backend by @achartier in #5172
- optimize memset before alltoall communication by @dongxuy04 in #5188
- refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) by @nv-guomingz in #5032
- Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL by @yunruis in #4560
- refactor: Speculative decoding buffers by @Funatiq in #5091
- feat: large-scale EP(part 7: DeepEP integration) by @yuantailing in #4792
- linting(python): Enable ruff on more files (wave 1/N) by @2ez4bz in #5140
- feat: Add support for fp8 rowwise quantization by @achartier in #4876
- chore: improve disagg test failure detection by @ixlmar in #4738
- perf: avoid dynamic import overhead in is_llm_response with duck typing by @tongyuantongyu in #5110
- feat: Enable EPLB to existing MoE models by @syuoni in #5203
- feat: Support post_proc for bench by @kaiyux in #5122
- fix: fix cuda graph max batch size for spec decoding cases. by @lfr-0531 in #5076
- [fix][test] Speedup Nemotron NAS unittests by @omera-nv in #5202
- use cu for fmha_v2 by @qsang-nv in #4694
- [TRTLLM-4983] feat: enable overlap scheduler between draft forwards by @lfr-0531 in #4802
- Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow by @amitz-nv in #5130
- fix: build_config in TorchLlmArgs and avoid arbitrary args by @Superjomn in #4972
- [fix] Fix Llama4 min-latency import error by @nv-yilinf in #5209
- test: Add json_mode_eval for guided decoding evaluation by @syuoni in #5179
- test: add more cases for llama_v3.3/3.1 70b fp8 and set enable_attention_dp to false to non-deepseek models by @ruodil in #5155
- test: add llama4 models for perf test by @ruodil in #5187
- test: Add fixture to skip tests based on MPI world size by @yizhang-nv in #5028
- feat: Add w4a8_mxfp4_fp8 quantization recipe. by @Tracin in #4867
- [Stress test] Add DeepSeek-R1 stress test by @Wanli-Jiang in #5033
- refactor: Scheduling based on KV cache state by @Funatiq in #4865
- use file lock to avoid port conflict by @chuangz0 in #5123
- feat: MoE trtllm backend kernel update by @rosenrodt in #5183
- refactor: remove decoder request from decoder interface by @Funatiq in #5129
- Waive L0 tests by @yiqingy0 in #5233
New Contributors
- @Linda-Stadter made their first contribution in #4956
- @ttyio made their first contribution in #5056
- @MatthiasKohl made their first contribution in #4063
- @yuantailing made their first contribution in #4792
- @2ez4bz made their first contribution in #5140
Full Changelog: v0.21.0rc1...v0.21.0rc2