NVIDIA/TensorRT-LLM v0.20.0rc3 on GitHub

Highlights

Model Support
- Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
- Support Gemma3-1b-it in PyTorch workflow (#3999)
Features
- Adopt new logprob definition in PyTorch flow (#4057)
- Support multiple LoRA adapters and TP (#3885)
- Add Piecewise CUDA Graph support (#3804)
- Add KV cache-aware router for disaggregated serving (#3831)
- Enable per-request stats with PyTorch backend (#4156)
- Support DeepSeek-R1 W4A8 on Hopper (#4123)
- Enable chunked context for FlashInfer (#4132)
- Support KV cache reuse for MLA (#3571)
API
- Allow overriding CLI arguments with YAML file in trtllm-serve (#4164)
- Remove deprecated GptSession/V1 from TRT workflow (#4092)
Bug Fixes
- Fix attention DP bug on Qwen3 MoE model (#4141)
- Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
Benchmark
- Remove deprecated Python runtime benchmark (#4171)
- Add benchmark support for scaffolding (#4286)
Performance
Infrastructure
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.04-py3 (#4049)
- The dependent TensorRT version is updated to 10.10.0 (#4049)
- The dependent CUDA version is updated to 12.9.0 (#4049)
- The dependent public PyTorch version is updated to 2.7.0.
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
Documentation
Known Issues

What's Changed

feat: adopt new logprob definition in PyTorch flow by @tongyuantongyu in #4057
infra: Add NIXL into the Dockerfile by @Shixiaowei02 in #3981
feat: support multi lora adapters and TP by @shaharmor98 in #3885
feat: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4080
Cherry-pick trtllm-gen from feat/llama4 to main by @chenfeiz0326 in #4086
[fix] [AutoDeploy] flashinfer usage on H100 by @lucaslie in #4162
fix: Fix incorrect conversion of Gen TPS/user by @FrankD412 in #4112
[fix] Fix llama4 + eagle3 by @mikeiovine in #3998
Support RingAttention in the BertAttention plugin and the DiT model by @ChunhuanLin in #3661
fix: alltoall padding for chunked MoE by @dongxuy04 in #4157
[feat] Allow overriding cli args with yaml file in trtllm-serve by @pcastonguay in #4164
[TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model by @byshiue in #4141
chore: Clean up the legacy DeepseekAllreudceFusionOp. by @hyukn in #4081
test: add qwen3 and disaggregated serving accuracy tests to qa test list by @StanleySun639 in #4083
[TRTLLM-3105][feat] Add Piecewise CUDA Graph Support by @yizhang-nv in #3804
fix: change pp broadcast pattern for LPs by @hchings in #4130
[#4085][fix] Fix apply_per_channel_scale for extremely large input sequence length. by @StudyingShao in #4089
[nvbug/5262268][fix] Fix trtllm-bench for llama 4 by @mikeiovine in #4104
chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse by @hyukn in #4169
[https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization case for pre-ada by @crazydemo in #4095
test: move mistral / mixtral test cases in QA test list into the new accuracy test suite by @crazydemo in #3440
test: Add fp8kv to DS-v3-lite integration tests. by @bobboli in #3950
[fix] Fix relaxed acceptance to support enabling it in context phase by @lfr-0531 in #4126
test: skip tests on b200 by @xinhe-nv in #3913
infra: Fix pipeline step error in post merge by @ZhanruiSunCh in #3948
fix: library path of nixl by @Shixiaowei02 in #4184
test: amend default pytorch extra-llm-api-config.yml in perf test by @ruodil in #4176
[fix] Fix add_dummy_requests for spec decoding cases by @lfr-0531 in #4084
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4165
feat: support task collection for to collect information (#3328) by @WeiHaocheng in #3824
Cherry-pick: Use multi-threading to load MoE expert weights by @chenfeiz0326 in #4137
test: amend regex match for perf throughput by @ruodil in #4186
chore: reduce size of the docker images by @MartinMarciniszyn in #3990
[fix] trtllm-gen mla kernel warnings by @zhhuang-nv in #4119
chore: Deprecate evaltool by @Tracin in #4173
[fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue by @mikeiovine in #4069
perf: [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. by @FrankD412 in #3875
Refactor: Restructure C++ tests for better modularisation of non-shared code by @DomBrown in #4027
Updating the multimodal models README to add steps for running phi-4-multimodal instruct by @mayani-nv in #3932
fix: draft target README and assertion for logits-based acceptance by @mayani-nv in #4167
Add initial list of CODEOWNERS by @kevinch-nv in #4105
chore: PR to fix the formatting errors by @mayani-nv in #4200
test: Remove CNN Dailymail tasks in favor of GSM8K by @syuoni in #4187
[CI] waive two multi-gpu test cases by @QiJune in #4206
[CI] update pytorch only file list by @QiJune in #4210
chore:update modelopt to 0.29 by @nv-guomingz in #4150
[Infra] Waive L0 test by @yiqingy0 in #4212
remove cache_transceiver_prealloc_size by @chuangz0 in #4153
[TRTQA-2802][fix]: add --host for mgmn serve examples script by @xinhe-nv in #4175
tests: https://nvbugs/5219534 remove failed tests from test list by @xinhe-nv in #4113
test: add llama_3.2_1B model and fix for test lora script issue by @ruodil in #4139
chore: Update CODEOWNERS by @Funatiq in #4221
[https://nvbugspro.nvidia.com/bug/5270564][test] skip per-hopper for llama4 by @crazydemo in #4211
[TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller by @dc3671 in #4151
Feat: support exporting softmax statistics and update the kernel-selection heuristic by @PerkzZheng in #4155
infra: [TRTLLM-325] Prepare for NGC release - multiplatform build by @MartinMarciniszyn in #4191
[feat] Support HyperCLOVAX-SEED-Text language part by @yechank-nvidia in #3902
feat: Support the Structural Tag in guided decoding by @Ubospica in #4066
feat: add kv cache aware router by @zhengd-nv in #3831
refactor: Allow models to override apply_qk_norm. by @yuxianq in #4078
[https://nvbugs/5214229] [fix] Unwaive lm_head quantization case by @syuoni in #4222
doc: update switcher.json config by @niukuo in #4220
Revert "Add initial list of CODEOWNERS (#4105)" by @Funatiq in #4234
[TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake by @Fridah-nv in #4199
fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper by @yuxianq in #4227
Feat: Variable-Beam-Width-Search (VBWS) part4 by @wili-65535 in #3979
[TRTLLM-5081] [test] Align parametrize_with_ids to the pytest behavior by @syuoni in #4090
fix: reshape token_ids for lp in torch backend by @hchings in #4239
feat: Add heuristic for GroupRMSNorm kernel selection. by @SimengLiu-nv in #4047
[TRTLLM-5050][feat] Enable per-request stats with PyT backend by @pcastonguay in #4156
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4203
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4205
test: fix for perf test script issue by @ruodil in #4230
doc: update qwen3 document by @byshiue in #4246
feat: Prefetch safetensors files before loading them by @nvpohanh in #4140
Fix Pipeline Parallelism in Llama4 by @v-shobhit in #4106
[https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled by @PerkzZheng in #4101
[Infra][TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04 by @yiqingy0 in #4049
[https://nvbugs/5220763] [test] Unwaive Mixtral FP8 TP2 test by @syuoni in #4252
[nvbugs/5268808][fix] Fix the potential out-of-range-access issue of allreduce workspace. by @hyukn in #4159
Waive stress test. by @dominicshanshan in #4262
[TRTLLM-5233][feat]: Add chunking to PyT heuristic for trtllm-bench. by @FrankD412 in #4133
[Infra] Waive L0 test by @yiqingy0 in #4268
[Infra] Waive L0 test by @yiqingy0 in #4269
feat: Support Mistral Small 3.1 24B VLM in TRT workflow by @brb-nv in #4183
Waive disagg kv cache load balancer test by @Tabrizian in #4276
fix: Merge PP overlap and non-overlap executor loop by @amukkara in #3878
test: Validate FP8 and LoRA for Gemma3 by @brb-nv in #3670
chore: bump version to 0.20.0rc3 by @ZhanruiSunCh in #4261
[TRTLLM-5188] fix: [AutoDeploy] unwaive AD build test by @Fridah-nv in #4273
[chore] update CI allowlist 2025-05-13 by @tburt-nv in #4278
[fix] Enable pp tests by @yizhang-nv in #3978
feat: Support Gemma3-1b-it in Pytorch workflow by @brb-nv in #3999
CI: add fp8/fp4 ci on Qwen3-30B-A3B by @byshiue in #4266
test: Add UT for moe trtllmgen by @zongfeijing in #4258
[TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper by @Barry-Delaney in #4123
[Infra] Waive L0 test by @yiqingy0 in #4295
feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #3851
tests: PyTorch multimodal using keyword match by @amukkara in #4215
[bug/5247505] fix: CP accuracy on Blackwell by @DylanChen-NV in #4188
test: [CI] remove closed bugs by @xinhe-nv in #4207
Add test case for kv memory estimation by @HuiGao-NV in #4158
chore: Remove deprecated Python runtime benchmark by @kaiyux in #4171
fix: Eagle decoding in TRT flow by @Funatiq in #4229
[Infra] - Update the upstream PyTorch dependency to 2.7.0 by @chzblych in #4235
Added tests for Llama3.1-70B-BF16 on SM120 by @farazkh80 in #4198
feat: [AutoDeploy] DSV3 mla attn ref op by @sugunav14 in #4272
[TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow by @Funatiq in #4092
[fix] Remove stale cublas heuristics by @hlu1 in #4326
[doc] Add tensorrtllm_backend serving documentation in the Deepseek-V3 README by @SimengLiu-nv in #4338
Revert "feat: Low Precision Allreduce for PCIe based GPU" by @QiJune in #4340
infra: open source fmha v2 kernels by @qsang-nv in #4185
[feat] Enable chunked context for flashinfer by @mikeiovine in #4132
[TRTLLM-2795] feat: Add yarn support for other models in trt-flow by @uchihatmtkinu in #3840
infra: Down the gcc toolset version from 13 to 11 by @ZhanruiSunCh in #4114
fix:https://nvbugs/5234033 enable starcoder trt-flow with transforme… by @nv-guomingz in #3909
[test] Reorganize TestDeepSeekR1::test_nvfp4_8gpus by @hlu1 in #4346
[test] add qa test mentioned in docs by @crazydemo in #4248
feat:[AutoDeploy] Update MoE pattern matcher to drop expert selection logic by @Fridah-nv in #3283
[https://nvbugs/5277113][fix]genai-perf API change stress test by @dominicshanshan in #4300
Breaking change: perf: Enable scheduling overlap by default by @kaiyux in #4174
feat: support kv cache reuse for MLA by @zhhuang-nv in #3571
test: FIX test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus by @xinhe-nv in #4283
Add allreduce and rmsnorm fusion for qwen3 by @zongfeijing in #4304
chore: reduce code duplication by @ixlmar in #4297
fix: better method to help torch find nvtx3 by @tongyuantongyu in #4110
[fix] test_no_kv_cache_reuse for overlap_scheduler by @zhhuang-nv in #4350
test: add qa test list for rtx5090 and rtx_pro_6000 by @StanleySun639 in #4254
Revert "[test] add qa test mentioned in docs" by @chzblych in #4355
refactor: use x is None instead of x == None. by @yuxianq in #4244
test(perf): Add Phi-4-mini-instruct to perf tests by @venkywonka in #4267
enh: Enable option in trtllm-bench build subcommand to avoid loading weights by @venkywonka in #4142
feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism by @yuxianq in #4034
fix: update checks that broke medusa tests when use_py_session=True by @hchings in #4339
Move Triton backend to TRT-LLM main by @Tabrizian in #3549
feat: enhance trtllm serve multimodal by @yechank-nvidia in #3757
[AutoDeploy] fix: disable overlap scheduler until supported by @lucaslie in #4365
[TRTLLM-5054][fix] Removing repeated loading of input processor by @rakib-hasan in #4161
[AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile by @suyoggupta in #4240
[CI] update multi-gpu test triggering file list by @QiJune in #4378
doc: Add docstring for Attention and MLA module. by @yuxianq in #4354
Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow by @StudyingShao in #4348
[CI] waive test_chunked_prefill test cases by @QiJune in #4380
update README version by @ZhanruiSunCh in #4381
feat: support benchmark on scaffolding (#3328) by @WeiHaocheng in #4286
test: add kv cache aware test cases to qa test list by @StanleySun639 in #4257
[TRTLLM 4571] Support dynamic per-tensor FP8 by @Tracin in #4250
[fix] Fixed incorrect mixed precision MoE conversion by @Barry-Delaney in #4351
test: [CI] remove closed bugs by @xinhe-nv in #4345
fix: support TensorRT 10.11+ in FindTensorRT.cmake by @tongyuantongyu in #4353
Change the method to calculate kv memory size in tests by @HuiGao-NV in #4332
chore: improve log-level setting UX by @ixlmar in #4352
chore: Mass Integration 0.19 by @dcampora in #4255
Fix test_fused_moe_w4afp8 by @StudyingShao in #4393
[TRTLLM-4886][infra]Try another timeout opt to exit test thread directly instead of gracefully by @EmmaQiaoCh in #4341
feat: TRT-LLM Gen integration for BMM and MoE refactoring by @nekorobov in #4280
[CI] waive accuracy/test_cli_flow.py::TestTinyLlama1_1BChat::test_pp4 by @liji-nv in #4397
doc： DS r1 min latency blog by @Kefeng-Duan in #4386
feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) by @Fridah-nv in #3638
refactor: Copy sequence lengths once in decoder setup by @Funatiq in #4102
[AutoDeploy] configurable cache resize by @lucaslie in #4372
fix: Fix chat template kwargs bug. by @Tracin in #4387
fix: improve PyExecutor resource allocations by @ixlmar in #4299
API Breaking Change + Readability: "decoder"->"sampler" by @netanel-haber in #4121
[AutoDeploy] fix: proper process group clean up by @lucaslie in #4373
[AutoDeploy] eager pattern matcher new pattern by @lucaslie in #4370
[Deepseek] Add accuracy test references for fp8 kvcache by @hlu1 in #4374
perf: Eliminate the need for attention DP padding when possible by @jinyangyuan-nvidia in #3439
test: Waive tests for nvbugs/5286795. by @yuxianq in #4409
Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) by @venkywonka in #4195
infra: [TRTLLM-5072] Add SBSA release images by @ZhanruiSunCh in #4231
[Infra] - Terminate the Slurm job if node does not come online in 2 hours by @yuanjingx87 in #4334
Removing the outdated argument by @rakib-hasan in #4408
fix: Remove real size allocation by @kaiyux in #4396
add changes for fp8, nemotron-nas, API by @shaharmor98 in #4180
[Infra][Docs] - Some clean-up for the CI pipeline and docs by @chzblych in #4419
[https://nvbugspro.nvidia.com/bug/5243740][fix] deduce default max_tokens for trtllm-serve by @LinPoly in #4265

New Contributors

@chenfeiz0326 made their first contribution in #4086
@ChunhuanLin made their first contribution in #3661
@StudyingShao made their first contribution in #4089
@Ubospica made their first contribution in #4066
@v-shobhit made their first contribution in #4106
@qsang-nv made their first contribution in #4185
@uchihatmtkinu made their first contribution in #3840
@ixlmar made their first contribution in #4297
@nekorobov made their first contribution in #4280

Full Changelog: v0.20.0rc2...v0.20.0rc3