github NVIDIA/TensorRT-LLM v0.21.0rc2

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-release2 months ago

Announcement Highlights:

  • Model Support
  • Features
    • MoE trtllm backend kernel update (#5183)
    • Add w4a8_mxfp4_fp8 quantization recipe. (#4867)
    • Enable overlap scheduler between draft forwards (#4802)
    • Enable EPLB to existing MoE models (#5203)
    • Add support for fp8 rowwise quantization for trt workflow(#4876)
    • large-scale EP(part 7: DeepEP integration) (#4792)
    • Add multi-node support for Triton with pytorch backend (#5172)
    • Implement model-agnostic one-engine eagle3 (#4778)
    • Support headDim 256 for blackwell fmha kernels (#5164)
    • Port customized kernels with public cutlass version (#5027)
    • Optimize KV Cache Reuse for MLA (#4869)
    • Add attention dp support to MTP relaxed acceptance (#5119)
    • trtllmGen MoE routing: added support for top groups and top K bounds (#4063)
    • Add TRTLLM Sampler log probs support (#4836)
    • Enable Finalize + Allreduce + add + rmsnorm fusion (#4756)
    • port MakeDecodingBatchInputOutput to python in TRTLLMSampler (#4828)
    • Add multimodal hashing support (image hashing) (#4145)
    • Include the executor's max batch size in CUDA graph batch size list (#4843)
  • API
    • Remove decoder request from decoder interface (#5129)
    • Enhance the llm args pytorch config part 3(torch_compile_config) (#5032)
    • Remove the redundant use_kv_cache field from PytorchConfig (#5031)
    • Move allreduce_strategy from committed api to reference (#5147)
  • Bug Fixes
    • Stop drafting when we hit the draft model's max seq len (#4879)
    • Unload XQA cubins early to avoid static lifetime (#5133)
    • Fix OOM because of unnecessary mha workspace (#5056)
    • Remove duplicated trust_remote_code knob from trtllm-serve (#5143)
    • Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs (#4961)
    • Fix AutoDeploy for PyTorch 25.05 dependency upgrade (#5106)
    • Fix XQA is not enabled when history_length < kMinHistoryTokensPerBlock. (#4264)
    • Fix cuda driver link issue with driver version less than 12.3 (#5025)
    • Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod (#5026)
    • Fix warmup phase batch size out of range. (#4986)
  • Benchmark
    • Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow (#5130)
    • Support post_proc for bench (#5122)
  • Performance
    • Avoid dynamic import overhead in is_llm_response with duck typing (#5110)
    • Removing initializing ptuning buffers to zero (#4915)
  • Infrastructure
    • Move all test cases of TensorRT backend into post merge (#5186)
    • Upload imageTag info to artifactory and add ngc_staging to save ngc image (#4764)
    • Add a bot run option for detailed logs (#4390)
    • Add timeout and retry for wget in docker image build (#5035)
    • Change cutlass version back to 4.0 (#5041)
  • Documentation
    • Fix invalid links for trtllm-serve doc (#5145)
    • Added documentation for enable_trtllm_sampler. (#4990)
    • Add disaggregated serving section to models doc (#4877)
  • Known Issues
    • multi-GPU model support on RTX Pro 6000

What's Changed

  • ci: [nvbugs/5280806] Unwaive unittests/_torch. by @yuxianq in #4951
  • fix: Fix warmup phase batch size out of range. by @hyukn in #4986
  • [nvbug/5314469][feat] Include the executor's max batch size in CUDA g… by @mikeiovine in #4843
  • [nvbug 5283506] fix: Fix spec decode triton test by @pcastonguay in #4845
  • chore: Refine weight prefetching. by @yuxianq in #4893
  • chore: Change cutlass version back to 4.0 by @hyukn in #5041
  • [TRTLLM-5007][feat] Add multimodal hashing support (image hashing) by @chang-l in #4145
  • [https://nvbugs/5332927] Waive new tests by @tburt-nv in #5051
  • [TRTLLM-5518] doc: Adding disaggregated serving section to models doc by @pcastonguay in #4877
  • feat: port MakeDecodingBatchInputOutput to python in TRTLLMSampler by @dcampora in #4828
  • test: add more disaggregated serving tests into QA testlist by @StanleySun639 in #5036
  • perf: Removing initializing ptuning buffers to zero by @pcastonguay in #4915
  • chore: Waive CI failure. by @SimengLiu-nv in #5069
  • CI: waive test_ad_build_small_multi by @QiJune in #5071
  • [fix] Fix W4A8 weight loading error in WInt4AFP8FusedMoEMethod by @xiaoweiw-nv in #5026
  • fix cuda driver link issue with driver version less than 12.3 by @dongxuy04 in #5025
  • Waive L0 test by @yiqingy0 in #5067
  • [nvbug 5325284][fix] Increase Nemotron-H warmup request robustness by @tomeras91 in #4954
  • Waive L0 test by @yiqingy0 in #5077
  • chore: cleanup GDS Cmake interface by @achartier in #4928
  • fix: pytorch_backend_config is deprecated in update_llm_args_with_extra_dict. by @yuxianq in #4890
  • [TRTLLM-3927] [feat] Finalize + Allreduce + add + rmsnorm fusion by @zongfeijing in #4756
  • [CI] waive failing L0 test by @liji-nv in #5089
  • Mxfp8xmxfp4 by @Tracin in #4978
  • [AutoDeploy] Merge Feature Branch Week 3 by @lucaslie in #5054
  • test: add unit tests for Llama4 min_latency code by @nvpohanh in #4980
  • Doc: Add info about stop words appearing in output by @Linda-Stadter in #4956
  • [fix] Fix test_attention_mla by @jinyangyuan-nvidia in #5084
  • CI: Allow run by @IzzyPutterman in #5101
  • [fix] Unwaive test_llama_eagle3 by @mikeiovine in #5042
  • fix: XQA is not enabled when history_length < kMinHistoryTokensPerBlock. by @bobboli in #4264
  • test: conditional disagg and cache aware balancing for deepseek v3 by @zhengd-nv in #4522
  • [https://nvbugspro.nvidia.com/bug/5332927][fix] Fix the bug in the routing unit test by @ChristinaZ in #5065
  • infra: Add timeout and retry for wget in docker image build by @ZhanruiSunCh in #5035
  • Waive L0 tests by @yiqingy0 in #5111
  • chore: Merge remaining changes from feat/large-ep branch to main by @syuoni in #5039
  • [TRTLLM-4995][feat] TRTLLM Sampler log probs support by @dcampora in #4836
  • chore: bump version to 0.21.0rc2 by @ZhanruiSunCh in #5112
  • test: add more llama_v3.3_70b cases in perf test by @ruodil in #4979
  • [fix] Fix llama4 min latency by @liji-nv in #5117
  • [TRTLLM-5082] - Add a bot run option for detailed logs by @yiqingy0 in #4390
  • test: skip disaggregated tests on arm by @xinhe-nv in #5070
  • [chore] 2025-06-10 update allowlist by @tburt-nv in #5102
  • [TRTLLM-5581][infra] Update Module Owners by @poweiw in #5052
  • chore: rename IOFormatter to BaseCacheFormatter by @zhengd-nv in #5068
  • fix: limit process pool size when prefetching by @zhengd-nv in #5088
  • Use backend to replace macro to control enablement of MNNVL all reduce by @HuiGao-NV in #4635
  • Solve underallocation in VSWA+/VGQA by @netanel-haber in #4667
  • [nvbugs/5331013] fix AutoDeploy for PyTorch 25.05 dependency upgrade by @lucaslie in #5106
  • test(perf): Add remaining Llama-Nemotron perftests (nano, super, ultra) + extras ✨ by @venkywonka in #5066
  • Fix Llama-3_3-Nemotron-Super-49B-v1 FP8 accuracy threshold configs by @moraxu in #4961
  • update the free_gpu_mem_fraction for H100 qwen3 qa test by @byshiue in #5114
  • [test] Use LLM API for Nemotron-H correctness test by @tomeras91 in #5097
  • test: set enable_attention_dp to False for non-deepseek models and add more cases for llama_v3.1/3.3 70b fp8 models by @ruodil in #5149
  • [TRTLLM-4932] Add Llama-3.1-Nemotron-Nano-8B-v1-FP8 accuracy tests by @moraxu in #4933
  • Fix logprobs issues. by @dcampora in #5136
  • chore: fix typo in tests by @lfr-0531 in #5092
  • [fix] Do not reuse dummy request KVCache by @liji-nv in #4804
  • infra: upload imageTag info to artifactory and add ngc_staging to save ngc image by @ZhanruiSunCh in #4764
  • doc:fix invalid links for trtllm-serve doc by @nv-guomingz in #5145
  • test: waive the NIXL related tests by @Shixiaowei02 in #5153
  • enh(doc): Add ci-overview in docs/source/reference/ by @venkywonka in #5137
  • doc: Added documentation for enable_trtllm_sampler. by @dcampora in #4990
  • fix:remove duplicated trust_remote_code knob from trtllm-serve by @nv-guomingz in #5143
  • fix:https://nvbugs/5298661 by @nv-guomingz in #5022
  • fix: Updates to yarn implementation by @brb-nv in #5105
  • Move allreduce_strategy from committed api to reference by @HuiGao-NV in #5147
  • [nvbug/5334370][fix] Fix one model EAGLE3 by @mikeiovine in #5134
  • [fix][test] report individual unittests results to jenkins by @omera-nv in #5116
  • chore: Include prompt_token_ids only for context-only disagg requests by @pcastonguay in #5055
  • None: fix OOM because of unnecessary mha workspace by @ttyio in #5056
  • [feat] trtllmGen MoE routing: added support for top groups and top K bounds by @MatthiasKohl in #4063
  • [TRTLLM-5278][feat] Add attention dp support to MTP relaxed acceptance by @lfr-0531 in #5119
  • fix: [nvbugs/5324229] Fix broken WInt4AFP8FusedMoEMethod since FusedMoE refactor. by @yuxianq in #4930
  • [feat] Optimize KV Cache Reuse for MLA by @zhhuang-nv in #4869
  • test: add more cases for rtx_pro_6000_se and add option kv_cache_dtype in perf test by @ruodil in #5083
  • tests: update tests for b200 by @xinhe-nv in #5180
  • [fix]: Fall back to HMAC to Avoid IPC Serialization Churn by @yibinl-nvidia in #5074
  • [fix] Reenable test return logits by @dcampora in #5160
  • Add two MTP disaggregated test by @Tabrizian in #4546
  • [test] Update timeout params in QA test list by @crazydemo in #5124
  • chore: gracefully exit disagg process in tests; better startup and logging by @zhengd-nv in #5109
  • [fix] Fix comment to pass guardwords check by @MatthiasKohl in #5191
  • [nvbug 5333996 ][fix] Unload XQA cubins early to avoid static lifetime by @lowsfer in #5133
  • refactoring: port customized kernels with public cutlass version by @yunruis in #5027
  • refactor [BREAKING CHANGE]:: remove the redundant use_kv_cache field from PytorchConfig by @nv-guomingz in #5031
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5178
  • feat: Basic skeleton for Gemma3 VLM by @brb-nv in #5108
  • add doc for open-sourced cutlass kernels by @yunruis in #5194
  • fix: fix license bug by @yunruis in #5200
  • UCXX uses only the ucp_feature_tag to avoid certain issues on specific platforms by @chuangz0 in #4994
  • CI: move all test cases of TensorRT backend into post merge by @QiJune in #5186
  • [https://nvbugspro.nvidia.com/bug/5295470] support headDim 256 for blackwell fmha kernels by @PerkzZheng in #5164
  • [nvbug/5319281][fix] Stop drafting when we hit the draft model's max seq len by @mikeiovine in #4879
  • [feat] Implement model-agnostic one-engine eagle3 by @nv-yilinf in #4778
  • fix: Fix waive list by @syuoni in #5205
  • feat: add multi-node support for Triton with pytorch backend by @achartier in #5172
  • optimize memset before alltoall communication by @dongxuy04 in #5188
  • refactor [BREAKING CHANGE]: enhance the llm args pytorch config part 3(torch_compile_config) by @nv-guomingz in #5032
  • Feat/ds r1 min latency opt round3, add router gemm, fused a gemm, PDL by @yunruis in #4560
  • refactor: Speculative decoding buffers by @Funatiq in #5091
  • feat: large-scale EP(part 7: DeepEP integration) by @yuantailing in #4792
  • linting(python): Enable ruff on more files (wave 1/N) by @2ez4bz in #5140
  • feat: Add support for fp8 rowwise quantization by @achartier in #4876
  • chore: improve disagg test failure detection by @ixlmar in #4738
  • perf: avoid dynamic import overhead in is_llm_response with duck typing by @tongyuantongyu in #5110
  • feat: Enable EPLB to existing MoE models by @syuoni in #5203
  • feat: Support post_proc for bench by @kaiyux in #5122
  • fix: fix cuda graph max batch size for spec decoding cases. by @lfr-0531 in #5076
  • [fix][test] Speedup Nemotron NAS unittests by @omera-nv in #5202
  • use cu for fmha_v2 by @qsang-nv in #4694
  • [TRTLLM-4983] feat: enable overlap scheduler between draft forwards by @lfr-0531 in #4802
  • Enable trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA in PyT flow by @amitz-nv in #5130
  • fix: build_config in TorchLlmArgs and avoid arbitrary args by @Superjomn in #4972
  • [fix] Fix Llama4 min-latency import error by @nv-yilinf in #5209
  • test: Add json_mode_eval for guided decoding evaluation by @syuoni in #5179
  • test: add more cases for llama_v3.3/3.1 70b fp8 and set enable_attention_dp to false to non-deepseek models by @ruodil in #5155
  • test: add llama4 models for perf test by @ruodil in #5187
  • test: Add fixture to skip tests based on MPI world size by @yizhang-nv in #5028
  • feat: Add w4a8_mxfp4_fp8 quantization recipe. by @Tracin in #4867
  • [Stress test] Add DeepSeek-R1 stress test by @Wanli-Jiang in #5033
  • refactor: Scheduling based on KV cache state by @Funatiq in #4865
  • use file lock to avoid port conflict by @chuangz0 in #5123
  • feat: MoE trtllm backend kernel update by @rosenrodt in #5183
  • refactor: remove decoder request from decoder interface by @Funatiq in #5129
  • Waive L0 tests by @yiqingy0 in #5233

New Contributors

Full Changelog: v0.21.0rc1...v0.21.0rc2

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.