NVIDIA/TensorRT-LLM v0.21.0rc0 on GitHub

Highlights

Model Support
Features
- Support for large-scale EP (#4384, #4495 , #4615)
- Added chunked attention kernels (#4291, #4394)
- ScaffoldingLLM now supports MCP (#4410)
- Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
- Integrated Hopper chunked attention kernels (#4330)
- Enabled TRT backend for Python runtime in disaggregated service (#4243)
- Added FP8 block-scale GEMM support on SM89 (#4481)
- Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
- Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
- Vanilla MOE added (#4682)
- Fused QKNorm + RoPE integration (#4611)
- Fabric Memory support for KV Cache Transfer (#4717)
API
Bug Fixes
- Resolved Torch compile issue for DeepSeek V3 (#3952)
- Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
- Removed duplicate tokenization in generation server (#4492)
- Fixed cancel request handling for attentionDP (#4648)
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
- Fixed queued request statistics (#4806)
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
- Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
Benchmark
- Added all_reduce.py benchmark script for testing (#4537)
Performance
Infrastructure
- Integrated NGC image into Makefile automation and documentation (#4400)
- Built Triton for ARM architecture (#4456)
- Added triton release container (#4455)
- Refactored Docker build image (Groovy) and added NGC image support (#4294)
- Upgraded Cutlass to version 4.0 (#4794)
Documentation
- Updated descriptions for NGC Docker images (#4702, #4705)
Known Issues
- Two important fixes are NOT included in this release, but they are already in the main branch
  - Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693)
  - Fixed the failure of the LLMAPI benchmark caused by a serialization issue (#4835)

What's Changed

Refine doc by @juney-nvidia in #4420
Refine doc by @juney-nvidia in #4421
refine doc by @juney-nvidia in #4422
Remove vila test by @Tabrizian in #4376
[TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
tests: add qa test mentioned in docs by @crazydemo in #4357
[Infra] - Always push the release images in the post-merge job by @chzblych in #4426
tests: Add test cases for rcca cases by @crazydemo in #4347
chore: cleanup perf_evaluator code by @Superjomn in #3833
feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
feat: NIXL interface integration by @Shixiaowei02 in #3934
Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
fix: temp disable the problem test by @Shixiaowei02 in #4445
Add llama4 disagg accuracy tests by @Tabrizian in #4336
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
[Docs] - Reapply #4220 by @chzblych in #4434
[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
[Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
[TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
[AutoDeploy] HF factory improvements by @lucaslie in #4371
chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
infra: Add qwen3 235B tests into QA by @byshiue in #4483
feat: large-scale EP(part 2: MoE Load Balancer - core utilities) by @dongxuy04 in #4384
[TRTLLM-5085][fix] Nemotron H correctness test by @tomeras91 in #4444
[Docs] - Add date and commit info by @chzblych in #4448
fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4428
fix: replace the image links in the blog by @Shixiaowei02 in #4489
fix: Fix TRTLLMSampler beam width bug. by @dcampora in #4473
refactor: Unify request order in TRT and PyTorch workflow by @Funatiq in #4096
[TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug by @nvrohanv in #4290
Build Triton for arm by @Tabrizian in #4456
test: [CI] remove closed bugs by @xinhe-nv in #4417
test(perf): Add remaining Phi-4-mini-instruct perf tests by @venkywonka in #4443
feat: conditional disaggregation in disagg server by @zhengd-nv in #3974
perf: Fuse gemm setup function for SM90/SM100 MOE plugin path by @djns99 in #4146
fix: skip weights defined in create_weights for pp. by @yuxianq in #4447
Feat: add chunked-attention kernels on Blackwell by @PerkzZheng in #4394
fix [nvbug/5220766]: llmapi-launch add add trtllm-bench test with engine building by @Superjomn in #4091
[TRTLLM-5000][feat] Pytorch implementation of ngram drafter by @thorjohnsen in #3936
test: NIXL single process test by @Shixiaowei02 in #4486
Chore: waive torch compile test cases of deepseek v3 lite by @QiJune in #4508
Feat: add deep_gemm swapab Kernel by @ruoqianguo in #4430
unwaive some disagg test by @chuangz0 in #4476
Clean: fmha codes by @PerkzZheng in #4496
tests: add llama 3.3 70b 2 nodes tests by @xinhe-nv in #4391
CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite by @QiJune in #4520
test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases by @ruodil in #4285
docs: update the introduction for scaffolding by @WeiHaocheng in #4360
test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4527
tests: add qwene fp4 tests into QA test list & update sanity test list by @xinhe-nv in #4478
feat: large-scale EP(part 3: refactor - FusedMoe for redundant expert) by @dongxuy04 in #4495
refactor: DisaggExecutorTest by @Funatiq in #4398
chore: clean ucx and nixl mirror. by @nv-guomingz in #4531
Add pytorch backend team by @kevinch-nv in #4405
test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) by @venkywonka in #4499
Adding two-shot allreduce kernel and mnnvl multicasting buffer by @zongfeijing in #4216
test: Split test_simple into mpi_utils and cache transceiver tests for DGX by @DomBrown in #4451
fix: TRT-LLM Gen dtype declaration by @nekorobov in #4503
chore: remove extra PYTHONPATH by @achartier in #4453
Agent interface impl for NIXL by @chuangz0 in #4125
chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs by @Superjomn in #3823
[TRTLLM-4932] Add CLI accuracy tests for Phi-4-mini-instruct by @moraxu in #4415
chore: Add all_reduce.py benchmark script to test by @kaiyux in #4537
feat: add dataset support for benchmark_core_model with LLMAPI by @achartier in #4457
fix[nvbug-5228840]: Remove test cases of feature not supported anymore by @HuiGao-NV in #3972
feat: add health_generate route to openai serving (Cherry-pick #3856) by @kaiyux in #4349
Add tritonrelease container by @Tabrizian in #4455
cache_transceiver_config by @chuangz0 in #4556
test: waive hanging cases for perf test by @ruodil in #4562
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4549
Chore: clean up _merge_dummy_request method of PyExecutor by @QiJune in #4438
fix sequence data race by @chuangz0 in #4565
fix: Move cv2 import to load_video function by @Funatiq in #4541
test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) by @venkywonka in #4446
[nvbug/5285881][fix] Fix chunked prefill + overlap scheduler by @mikeiovine in #4402
[feat] Integrate Hopper chunked attention kernels by @mikeiovine in #4330
chroe:clean useless flag by @nv-guomingz in #4567
Chore: clean up _gather_dp_requests_num method of PyExecutor by @QiJune in #4571
fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer by @dongxuy04 in #4573
Scaffoldingllm supports MCP by @wu1du2 in #4410
[feat][TRTLLM-5018] Dis serving python runtime trt backend by @pcastonguay in #4243
chore: clean-up for header file. by @nv-guomingz in #4540
[https://nvbugspro.nvidia.com/bug/5181262] [test] Unwaive Mistral Nemo test by @syuoni in #4515
[feat] support fp8 blockscale gemm on sm89 by @CarstyYou in #4481
fix: Fix moe_ep_groups/moe_cluster_groups in Mapping. by @yuxianq in #4555
[https://nvbugs/5297775] fix: Correct memory guard for large MOE tests to account for TP space by @djns99 in #4553
fix: [nvbugs/5066257] serialization improvments by @coldwaterq in #3869
[Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP by @byshiue in #4575
[doc]: add mtp tech blog by @lfr-0531 in #4580
chore: fix bug of llama lora test by @byshiue in #4566
perf: Add fused q_norm/k_norm/RoPE for Qwen3. by @bobboli in #4482
Waive L0 test by @yiqingy0 in #4609
Update the GH main page to expose tech blogs by @juney-nvidia in #4610
Qwen3 supports TRTLLM FP4 MoE backend by @rosenrodt in #4530
[TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA by @zhhuang-nv in #4535
[nvbugs/5301492] ci: waive test_workers_kv_cache_aware_router by @Funatiq in #4617
Update CODEOWNERS for PyTorch backend - runtime component by @juney-nvidia in #4620
[nvbug/5028235][fix]pytest bindings tokens logtis comparison. by @dominicshanshan in #4424
refactor: CreateNewDecoderRequests by @Funatiq in #4452
fix: rename some terms by @lowsfer in #4534
Fix invalid testcase name by @chzblych in #4626
fix: datatype check in the cache transmission by @chuangz0 in #4606
[Fix][Deepseek] Fix bugs in TestDeepSeekR1 by @hlu1 in #4413
[TRTLLM-5327] - Add scan stage by @yiqingy0 in #4602
[#4633][doc] Fixed typo in scaffolding README.md by @amemov in #4634
Update main README.md with the LLaMA4 perf news by @juney-nvidia in #4636
Fix snake case format by @shaharmor98 in #4559
fix: Update approved list to fix pipeline tests after rebasing by @yibinl-nvidia in #4640
Feat: add sliding-window-attention generation-phase kernels on Blackwell by @PerkzZheng in #4564
feat: Skip sampler for intermediate pp stages. by @yuxianq in #4514
Waive L0 tests by @yiqingy0 in #4645
Chore: refine shutdown signal of PyExecutor by @QiJune in #4614
chore: sort llm request state enums in chronological order by @zhengd-nv in #4607
[TRTLLM-4535][infra]: Add marker TIMEOUT for test level by @EmmaQiaoCh in #3905
fix: Handle additional model outputs based on pipeline parallel rank by @Funatiq in #4498
[TRTLLM-5327] - Fix guardwords scan step by @yiqingy0 in #4654
fix: Remove duplicate tokenization in generation server by @Shunkangz in #4492
[nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608) by @Funatiq in #4621
Chore: introduce RequestQueueItem class instead of using tuple by @QiJune in #4649
feat: large-scale EP(part 4: Static EP load balancer integration) by @syuoni in #4615
Add files into scan ignoreList by @yiqingy0 in #4663
[Infra] - Multi-GPU testing support with Slurm by @yuanjingx87 in #4454
fix disagg config params by @chuangz0 in #4646
[Test] - Waive RTX Pro 6000 Slurm testing by @chzblych in #4672
fix fmha v2 tests by @qsang-nv in #4661
test: rcca https://nvbugs/5223130 by @xinhe-nv in #4510
[NVBUG 5301980] Fix fp4 gemm padding. by @Tracin in #4662
[Test] - Correct waive the Slurm test stage by @chzblych in #4677
Chore: only pad one dummy request for attention dp scenario by @QiJune in #4664
Waive L0 tests by @yiqingy0 in #4686
feat: better build_wheel.py venv handling by @tongyuantongyu in #4525
[Infra][TRTLLM-3929] Rerun failure tests by @yiqingy0 in #3264
[AutoDeploy] Increased Model Coverage Mass Migration Week 1 by @lucaslie in #4468
fix: fmha_v2 compilation by @PerkzZheng in #4659
test: [CI] remove closed bugs by @xinhe-nv in #4638
refactor: extract and reuse filter_weights. by @yuxianq in #4681
fix: fix dsr1 min lat cga ar rate drop(0.2) by @yunruis in #4561
Update the description for NGC docker images (#4671) by @MartinMarciniszyn in #4702
feat: Add vanilla MOE. by @yuxianq in #4682
Fix handle cancel request for attentionDP by @Shunkangz in #4648
feat: Integration of Fused QKNorm+RoPE. by @bobboli in #4611
[TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend by @LinPoly in #4623
doc: Document the docker release image on NGC by @MartinMarciniszyn in #4705
Fix: hang on disagg when MNNVL two-shot AllReduce is enabled by @kaiyux in #4678
Mass-integration 0.20 to main by @amirkl94 in #4577
Add missing serialization classes by @Tabrizian in #4642
Fix rerun step by @yiqingy0 in #4715
feat: forward exceptions to Python and catch OOMs by @ixlmar in #4497
chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs by @Superjomn in #4603
chore: remove extra paths to find binaries by @achartier in #4706
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4688
tests: [https://nvbugspro.nvidia.com/bug/5289908] run maverick bf16 on blackwell by @crazydemo in #4722
chore: Clean up cpp runtime by @Funatiq in #4449
chore: add -f to pkill calls by @achartier in #4711
feat: support packed weights in vanilla moe by @yuxianq in #4719
chore: [nvbug_5273941] unwaive test_llm_loading_from_ckpt_for_tp2 by @hchings in #4725
feature: KV Cache GPUDirect Storage by @arthurrasmusson in #3209
[fix] add back rtx6000pro tests by @yuanjingx87 in #4679
chore: rename ExecutorBindingsWorker/Proxy by @Superjomn in #4716
Waive L0 test by @yiqingy0 in #4748
CI: move post-merge multi GPU test of PyTorch backend to H200 by @QiJune in #4733
infra: [TRTLLM-5247][TRTLLM-5248][TRTLLM-5249] Refactor docker build image groovy and support NGC images by @ZhanruiSunCh in #4294
test: remove perf test l40s/l20 oom test cases and unwaive tests by @ruodil in #4755
fix: test trtllm-bench mgmn by @Superjomn in #4613
[feat] add b200 support via slurm by @yuanjingx87 in #4709
Chore: fuse _merge_requests method into _fetch_new_requests method by @QiJune in #4689
[fix] Eagle-2 LLMAPI pybind argument fix. by @jhaotingc in #3967
[feat] Support RULER + chunked prefill in lm-eval-harness by @mikeiovine in #4592
refactor: unique_ptr instead of shared_ptr by @Funatiq in #4697
Cherry pick feat/llama4 to main by @nv-yilinf in #4739
[Architecture] Redesign Linear module by @hlu1 in #4721
[perf] Reduce the workspace size of FP4 activation scales for MoE by @jinyangyuan-nvidia in #4303
Added code owners for AutoDeploy by @juney-nvidia in #4769
chore: fix llm_root when LLM_ROOT is not set by @achartier in #4741
[JIRA-5226219][fix] Fix Bug in KV cache manager by @thorjohnsen in #4596
test: skip test_llm_hf_gemma_quantization_1gpu_vswa on A100 by @xinhe-nv in #4779
test: Waive test_llm_loading_from_ckpt_for_tp2 by @syuoni in #4797
Fabric Memory for KV Cache Transfer by @chuangz0 in #4717
fix: random fail of cache router test by @zhengd-nv in #4597
feat: estimate GPU mem. usage w/ minimal KV cache by @ixlmar in #4574
fix: iteration logging and typing in PyExecutor by @ixlmar in #4734
[TRTLLM-5516] perf: replicate dummy request for cuda graph padding by @QiJune in #4729
[feat] support sharegpt downloading in benchmark_serving by @LinPoly in #4578
fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. by @yuxianq in #4758
DeepSeek R1 throughut optimization tech blog for Blackwell GPUs by @litaotju in #4791
Expose new tech blog about DSR1 throughput optimization to the main R… by @juney-nvidia in #4803
[fix] Fix Llama 3.3 70b EAGLE by @mikeiovine in #4772
[Infra]Remove some old keyword by @EmmaQiaoCh in #4552
opt: the perormance for dist-agg streaming generation by @Superjomn in #4214
fix: re-enable tp/pp for quickstart_advanced.py. by @yuxianq in #4766
[nvbug 5305210] Resolve nvbug 5305210 by @DomBrown in #4759
fix: large-scale EP - EP load balancer with MTP layer and route offset by EP rank by @syuoni in #4767
[TRTLLM-4987][feat] Support context logits in TRTLLMSampler by @dcampora in #4538
[fix] Fix SamplingParams check on n and best_of by @syuoni in #4655
Check test names in waive list by @EmmaQiaoCh in #4292
[AutoDeploy] Increased Model Coverage Mass Migration Week 2 by @lucaslie in #4817
CI: Performance regression tests update by @amirkl94 in #3531
[TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H by @tomeras91 in #4494
'entered copyBlock' format string expects %s, pass string rather than int by @netanel-haber in #4820
fix: fix accuracy and illegal memory access issues when using mtp + attention dp by @lfr-0531 in #4379
feat: large-scale EP(part 5: Static EP load balancer with offline statistics) by @syuoni in #4695
[fix] Fix llama4 min-latency mode by @nv-yilinf in #4810
[Infra] - Minor clean-up and test Ubuntu mirrors by @chzblych in #4829
fix: [https://nvbugspro.nvidia.com/bug/5273945] Unwaive tests for bug-5273945 by @lfr-0531 in #4832
[fix] Fix Llama4 guradwords failures by @nv-yilinf in #4844
[TRTLLM-5502][infra] Add github action to identify if PR is from community by @poweiw in #4824

New Contributors

@AdamzNV made their first contribution in #4425
@nvrohanv made their first contribution in #4290
@thorjohnsen made their first contribution in #3936
@ruoqianguo made their first contribution in #4430
@wu1du2 made their first contribution in #4410
@CarstyYou made their first contribution in #4481
@coldwaterq made their first contribution in #3869
@rosenrodt made their first contribution in #4530
@amemov made their first contribution in #4634
@arthurrasmusson made their first contribution in #3209
@jhaotingc made their first contribution in #3967
@nv-yilinf made their first contribution in #4739
@poweiw made their first contribution in #4824

Full Changelog: v0.20.0rc3...v0.21.0rc0