github NVIDIA/TensorRT-LLM v0.20.0rc3

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-release3 months ago

Highlights

  • Model Support
    • Support Mistral Small 3.1 24B VLM in TRT workflow (#4183)
    • Support Gemma3-1b-it in PyTorch workflow (#3999)
  • Features
    • Adopt new logprob definition in PyTorch flow (#4057)
    • Support multiple LoRA adapters and TP (#3885)
    • Add Piecewise CUDA Graph support (#3804)
    • Add KV cache-aware router for disaggregated serving (#3831)
    • Enable per-request stats with PyTorch backend (#4156)
    • Support DeepSeek-R1 W4A8 on Hopper (#4123)
    • Enable chunked context for FlashInfer (#4132)
    • Support KV cache reuse for MLA (#3571)
  • API
    • Allow overriding CLI arguments with YAML file in trtllm-serve (#4164)
    • Remove deprecated GptSession/V1 from TRT workflow (#4092)
  • Bug Fixes
    • Fix attention DP bug on Qwen3 MoE model (#4141)
    • Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
  • Benchmark
    • Remove deprecated Python runtime benchmark (#4171)
    • Add benchmark support for scaffolding (#4286)
  • Performance
  • Infrastructure
    • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.04-py3 (#4049)
    • The dependent TensorRT version is updated to 10.10.0 (#4049)
    • The dependent CUDA version is updated to 12.9.0 (#4049)
    • The dependent public PyTorch version is updated to 2.7.0.
    • The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI (#4235)
  • Documentation
  • Known Issues

What's Changed

  • feat: adopt new logprob definition in PyTorch flow by @tongyuantongyu in #4057
  • infra: Add NIXL into the Dockerfile by @Shixiaowei02 in #3981
  • feat: support multi lora adapters and TP by @shaharmor98 in #3885
  • feat: Fallback to NCCL for various patterns when input size is large. by @hyukn in #4080
  • Cherry-pick trtllm-gen from feat/llama4 to main by @chenfeiz0326 in #4086
  • [fix] [AutoDeploy] flashinfer usage on H100 by @lucaslie in #4162
  • fix: Fix incorrect conversion of Gen TPS/user by @FrankD412 in #4112
  • [fix] Fix llama4 + eagle3 by @mikeiovine in #3998
  • Support RingAttention in the BertAttention plugin and the DiT model by @ChunhuanLin in #3661
  • fix: alltoall padding for chunked MoE by @dongxuy04 in #4157
  • [feat] Allow overriding cli args with yaml file in trtllm-serve by @pcastonguay in #4164
  • [TRTLLM-5147][Qwen3] fix: fix bug of attention dp on qwen3_moe model by @byshiue in #4141
  • chore: Clean up the legacy DeepseekAllreudceFusionOp. by @hyukn in #4081
  • test: add qwen3 and disaggregated serving accuracy tests to qa test list by @StanleySun639 in #4083
  • [TRTLLM-3105][feat] Add Piecewise CUDA Graph Support by @yizhang-nv in #3804
  • fix: change pp broadcast pattern for LPs by @hchings in #4130
  • [#4085][fix] Fix apply_per_channel_scale for extremely large input sequence length. by @StudyingShao in #4089
  • [nvbug/5262268][fix] Fix trtllm-bench for llama 4 by @mikeiovine in #4104
  • chore: Fix pipeline break caused by previous PR (#4081) rebase + pipeline reuse by @hyukn in #4169
  • [https://nvbugspro.nvidia.com/bug/5260676]test: skip fp8 quantization case for pre-ada by @crazydemo in #4095
  • test: move mistral / mixtral test cases in QA test list into the new accuracy test suite by @crazydemo in #3440
  • test: Add fp8kv to DS-v3-lite integration tests. by @bobboli in #3950
  • [fix] Fix relaxed acceptance to support enabling it in context phase by @lfr-0531 in #4126
  • test: skip tests on b200 by @xinhe-nv in #3913
  • infra: Fix pipeline step error in post merge by @ZhanruiSunCh in #3948
  • fix: library path of nixl by @Shixiaowei02 in #4184
  • test: amend default pytorch extra-llm-api-config.yml in perf test by @ruodil in #4176
  • [fix] Fix add_dummy_requests for spec decoding cases by @lfr-0531 in #4084
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4165
  • feat: support task collection for to collect information (#3328) by @WeiHaocheng in #3824
  • Cherry-pick: Use multi-threading to load MoE expert weights by @chenfeiz0326 in #4137
  • test: amend regex match for perf throughput by @ruodil in #4186
  • chore: reduce size of the docker images by @MartinMarciniszyn in #3990
  • [fix] trtllm-gen mla kernel warnings by @zhhuang-nv in #4119
  • chore: Deprecate evaltool by @Tracin in #4173
  • [fix][nvbug/5244009] Fix llama 4 test lists/scout accuracy issue by @mikeiovine in #4069
  • perf: [TRTLLM-4717][perf] Set CUDA graph max batch size and padding in throughput benchmark. by @FrankD412 in #3875
  • Refactor: Restructure C++ tests for better modularisation of non-shared code by @DomBrown in #4027
  • Updating the multimodal models README to add steps for running phi-4-multimodal instruct by @mayani-nv in #3932
  • fix: draft target README and assertion for logits-based acceptance by @mayani-nv in #4167
  • Add initial list of CODEOWNERS by @kevinch-nv in #4105
  • chore: PR to fix the formatting errors by @mayani-nv in #4200
  • test: Remove CNN Dailymail tasks in favor of GSM8K by @syuoni in #4187
  • [CI] waive two multi-gpu test cases by @QiJune in #4206
  • [CI] update pytorch only file list by @QiJune in #4210
  • chore:update modelopt to 0.29 by @nv-guomingz in #4150
  • [Infra] Waive L0 test by @yiqingy0 in #4212
  • remove cache_transceiver_prealloc_size by @chuangz0 in #4153
  • [TRTQA-2802][fix]: add --host for mgmn serve examples script by @xinhe-nv in #4175
  • tests: https://nvbugs/5219534 remove failed tests from test list by @xinhe-nv in #4113
  • test: add llama_3.2_1B model and fix for test lora script issue by @ruodil in #4139
  • chore: Update CODEOWNERS by @Funatiq in #4221
  • [https://nvbugspro.nvidia.com/bug/5270564][test] skip per-hopper for llama4 by @crazydemo in #4211
  • [TRTLLM-4911] feat(scaffolding): make sampling_params only setable by controller by @dc3671 in #4151
  • Feat: support exporting softmax statistics and update the kernel-selection heuristic by @PerkzZheng in #4155
  • infra: [TRTLLM-325] Prepare for NGC release - multiplatform build by @MartinMarciniszyn in #4191
  • [feat] Support HyperCLOVAX-SEED-Text language part by @yechank-nvidia in #3902
  • feat: Support the Structural Tag in guided decoding by @Ubospica in #4066
  • feat: add kv cache aware router by @zhengd-nv in #3831
  • refactor: Allow models to override apply_qk_norm. by @yuxianq in #4078
  • [https://nvbugs/5214229] [fix] Unwaive lm_head quantization case by @syuoni in #4222
  • doc: update switcher.json config by @niukuo in #4220
  • Revert "Add initial list of CODEOWNERS (#4105)" by @Funatiq in #4234
  • [TRTLLM-5188] fix: [AutoDeploy] update output shape of prepare_fused_mha_metadata_fake by @Fridah-nv in #4199
  • fix: Reset planned states to avoid memory leak in TrtllmAttentionWrapper by @yuxianq in #4227
  • Feat: Variable-Beam-Width-Search (VBWS) part4 by @wili-65535 in #3979
  • [TRTLLM-5081] [test] Align parametrize_with_ids to the pytest behavior by @syuoni in #4090
  • fix: reshape token_ids for lp in torch backend by @hchings in #4239
  • feat: Add heuristic for GroupRMSNorm kernel selection. by @SimengLiu-nv in #4047
  • [TRTLLM-5050][feat] Enable per-request stats with PyT backend by @pcastonguay in #4156
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4203
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4205
  • test: fix for perf test script issue by @ruodil in #4230
  • doc: update qwen3 document by @byshiue in #4246
  • feat: Prefetch safetensors files before loading them by @nvpohanh in #4140
  • Fix Pipeline Parallelism in Llama4 by @v-shobhit in #4106
  • [https://nvbugspro.nvidia.com/bug/5238626] illegal memory address when running llama 4 with cuda graph enabled by @PerkzZheng in #4101
  • [Infra][TRTLLM-4374] Upgrade TRT 10.10.0 GA, CUDA 12.9 GA and DLFW 25.04 by @yiqingy0 in #4049
  • [https://nvbugs/5220763] [test] Unwaive Mixtral FP8 TP2 test by @syuoni in #4252
  • [nvbugs/5268808][fix] Fix the potential out-of-range-access issue of allreduce workspace. by @hyukn in #4159
  • Waive stress test. by @dominicshanshan in #4262
  • [TRTLLM-5233][feat]: Add chunking to PyT heuristic for trtllm-bench. by @FrankD412 in #4133
  • [Infra] Waive L0 test by @yiqingy0 in #4268
  • [Infra] Waive L0 test by @yiqingy0 in #4269
  • feat: Support Mistral Small 3.1 24B VLM in TRT workflow by @brb-nv in #4183
  • Waive disagg kv cache load balancer test by @Tabrizian in #4276
  • fix: Merge PP overlap and non-overlap executor loop by @amukkara in #3878
  • test: Validate FP8 and LoRA for Gemma3 by @brb-nv in #3670
  • chore: bump version to 0.20.0rc3 by @ZhanruiSunCh in #4261
  • [TRTLLM-5188] fix: [AutoDeploy] unwaive AD build test by @Fridah-nv in #4273
  • [chore] update CI allowlist 2025-05-13 by @tburt-nv in #4278
  • [fix] Enable pp tests by @yizhang-nv in #3978
  • feat: Support Gemma3-1b-it in Pytorch workflow by @brb-nv in #3999
  • CI: add fp8/fp4 ci on Qwen3-30B-A3B by @byshiue in #4266
  • test: Add UT for moe trtllmgen by @zongfeijing in #4258
  • [TRTLLM-3330][feat] Support DeepSeek-R1 W4A8 on Hopper by @Barry-Delaney in #4123
  • [Infra] Waive L0 test by @yiqingy0 in #4295
  • feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #3851
  • tests: PyTorch multimodal using keyword match by @amukkara in #4215
  • [bug/5247505] fix: CP accuracy on Blackwell by @DylanChen-NV in #4188
  • test: [CI] remove closed bugs by @xinhe-nv in #4207
  • Add test case for kv memory estimation by @HuiGao-NV in #4158
  • chore: Remove deprecated Python runtime benchmark by @kaiyux in #4171
  • fix: Eagle decoding in TRT flow by @Funatiq in #4229
  • [Infra] - Update the upstream PyTorch dependency to 2.7.0 by @chzblych in #4235
  • Added tests for Llama3.1-70B-BF16 on SM120 by @farazkh80 in #4198
  • feat: [AutoDeploy] DSV3 mla attn ref op by @sugunav14 in #4272
  • [TRTLLM-5171] chore: Remove GptSession/V1 from TRT workflow by @Funatiq in #4092
  • [fix] Remove stale cublas heuristics by @hlu1 in #4326
  • [doc] Add tensorrtllm_backend serving documentation in the Deepseek-V3 README by @SimengLiu-nv in #4338
  • Revert "feat: Low Precision Allreduce for PCIe based GPU" by @QiJune in #4340
  • infra: open source fmha v2 kernels by @qsang-nv in #4185
  • [feat] Enable chunked context for flashinfer by @mikeiovine in #4132
  • [TRTLLM-2795] feat: Add yarn support for other models in trt-flow by @uchihatmtkinu in #3840
  • infra: Down the gcc toolset version from 13 to 11 by @ZhanruiSunCh in #4114
  • fix:https://nvbugs/5234033 enable starcoder trt-flow with transforme… by @nv-guomingz in #3909
  • [test] Reorganize TestDeepSeekR1::test_nvfp4_8gpus by @hlu1 in #4346
  • [test] add qa test mentioned in docs by @crazydemo in #4248
  • feat:[AutoDeploy] Update MoE pattern matcher to drop expert selection logic by @Fridah-nv in #3283
  • [https://nvbugs/5277113][fix]genai-perf API change stress test by @dominicshanshan in #4300
  • Breaking change: perf: Enable scheduling overlap by default by @kaiyux in #4174
  • feat: support kv cache reuse for MLA by @zhhuang-nv in #3571
  • test: FIX test_ptp_quickstart_advanced_deepseek_v3_2nodes_8gpus by @xinhe-nv in #4283
  • Add allreduce and rmsnorm fusion for qwen3 by @zongfeijing in #4304
  • chore: reduce code duplication by @ixlmar in #4297
  • fix: better method to help torch find nvtx3 by @tongyuantongyu in #4110
  • [fix] test_no_kv_cache_reuse for overlap_scheduler by @zhhuang-nv in #4350
  • test: add qa test list for rtx5090 and rtx_pro_6000 by @StanleySun639 in #4254
  • Revert "[test] add qa test mentioned in docs" by @chzblych in #4355
  • refactor: use x is None instead of x == None. by @yuxianq in #4244
  • test(perf): Add Phi-4-mini-instruct to perf tests by @venkywonka in #4267
  • enh: Enable option in trtllm-bench build subcommand to avoid loading weights by @venkywonka in #4142
  • feat: [nvbugs/5261055][nvbugs/5170160] non-invasive pipeline parallelism by @yuxianq in #4034
  • fix: update checks that broke medusa tests when use_py_session=True by @hchings in #4339
  • Move Triton backend to TRT-LLM main by @Tabrizian in #3549
  • feat: enhance trtllm serve multimodal by @yechank-nvidia in #3757
  • [AutoDeploy] fix: disable overlap scheduler until supported by @lucaslie in #4365
  • [TRTLLM-5054][fix] Removing repeated loading of input processor by @rakib-hasan in #4161
  • [AutoDeploy]feat: Add an AutoDeploy compile backend that only calls torch.compile by @suyoggupta in #4240
  • [CI] update multi-gpu test triggering file list by @QiJune in #4378
  • doc: Add docstring for Attention and MLA module. by @yuxianq in #4354
  • Fix bias shape in weightOnlyGroupwiseQuantMatmulPlugin for TRT workflow by @StudyingShao in #4348
  • [CI] waive test_chunked_prefill test cases by @QiJune in #4380
  • update README version by @ZhanruiSunCh in #4381
  • feat: support benchmark on scaffolding (#3328) by @WeiHaocheng in #4286
  • test: add kv cache aware test cases to qa test list by @StanleySun639 in #4257
  • [TRTLLM 4571] Support dynamic per-tensor FP8 by @Tracin in #4250
  • [fix] Fixed incorrect mixed precision MoE conversion by @Barry-Delaney in #4351
  • test: [CI] remove closed bugs by @xinhe-nv in #4345
  • fix: support TensorRT 10.11+ in FindTensorRT.cmake by @tongyuantongyu in #4353
  • Change the method to calculate kv memory size in tests by @HuiGao-NV in #4332
  • chore: improve log-level setting UX by @ixlmar in #4352
  • chore: Mass Integration 0.19 by @dcampora in #4255
  • Fix test_fused_moe_w4afp8 by @StudyingShao in #4393
  • [TRTLLM-4886][infra]Try another timeout opt to exit test thread directly instead of gracefully by @EmmaQiaoCh in #4341
  • feat: TRT-LLM Gen integration for BMM and MoE refactoring by @nekorobov in #4280
  • [CI] waive accuracy/test_cli_flow.py::TestTinyLlama1_1BChat::test_pp4 by @liji-nv in #4397
  • doc: DS r1 min latency blog by @Kefeng-Duan in #4386
  • feat: [AutoDeploy] update rope matcher with minor variants (Deepseek) by @Fridah-nv in #3638
  • refactor: Copy sequence lengths once in decoder setup by @Funatiq in #4102
  • [AutoDeploy] configurable cache resize by @lucaslie in #4372
  • fix: Fix chat template kwargs bug. by @Tracin in #4387
  • fix: improve PyExecutor resource allocations by @ixlmar in #4299
  • API Breaking Change + Readability: "decoder"->"sampler" by @netanel-haber in #4121
  • [AutoDeploy] fix: proper process group clean up by @lucaslie in #4373
  • [AutoDeploy] eager pattern matcher new pattern by @lucaslie in #4370
  • [Deepseek] Add accuracy test references for fp8 kvcache by @hlu1 in #4374
  • perf: Eliminate the need for attention DP padding when possible by @jinyangyuan-nvidia in #3439
  • test: Waive tests for nvbugs/5286795. by @yuxianq in #4409
  • Extend the Llama-Nemotron-Nano-8B perf-integration-tests (cpp) by @venkywonka in #4195
  • infra: [TRTLLM-5072] Add SBSA release images by @ZhanruiSunCh in #4231
  • [Infra] - Terminate the Slurm job if node does not come online in 2 hours by @yuanjingx87 in #4334
  • Removing the outdated argument by @rakib-hasan in #4408
  • fix: Remove real size allocation by @kaiyux in #4396
  • add changes for fp8, nemotron-nas, API by @shaharmor98 in #4180
  • [Infra][Docs] - Some clean-up for the CI pipeline and docs by @chzblych in #4419
  • [https://nvbugspro.nvidia.com/bug/5243740][fix] deduce default max_tokens for trtllm-serve by @LinPoly in #4265

New Contributors

Full Changelog: v0.20.0rc2...v0.20.0rc3

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.