github NVIDIA/TensorRT-LLM v0.21.0rc0

latest releases: v1.1.0rc3, v1.1.0rc2, v1.1.0rc1...
pre-release3 months ago

Highlights

  • Model Support
  • Features
    • Support for large-scale EP (#4384, #4495 , #4615)
    • Added chunked attention kernels (#4291, #4394)
    • ScaffoldingLLM now supports MCP (#4410)
    • Integrated NIXL into the communication layer of the disaggregated service (#3934, #4125)
    • Integrated Hopper chunked attention kernels (#4330)
    • Enabled TRT backend for Python runtime in disaggregated service (#4243)
    • Added FP8 block-scale GEMM support on SM89 (#4481)
    • Qwen3 FP4 MoE TRTLLM backend for low-latency (#4530)
    • Introduced sliding-window attention kernels for the generation phase on Blackwell (#4564)
    • Vanilla MOE added (#4682)
    • Fused QKNorm + RoPE integration (#4611)
    • Fabric Memory support for KV Cache Transfer (#4717)
  • API
  • Bug Fixes
    • Resolved Torch compile issue for DeepSeek V3 (#3952)
    • Fixed trtllm-llmapi-launch for single-node, single-GPU setups (#4428)
    • Removed duplicate tokenization in generation server (#4492)
    • Fixed cancel request handling for attentionDP (#4648)
    • Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
    • Fixed queued request statistics (#4806)
    • Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
    • Resolved accuracy and illegal memory access issues with MTP + attention DP (#4379)
  • Benchmark
    • Added all_reduce.py benchmark script for testing (#4537)
  • Performance
  • Infrastructure
    • Integrated NGC image into Makefile automation and documentation (#4400)
    • Built Triton for ARM architecture (#4456)
    • Added triton release container (#4455)
    • Refactored Docker build image (Groovy) and added NGC image support (#4294)
    • Upgraded Cutlass to version 4.0 (#4794)
  • Documentation
    • Updated descriptions for NGC Docker images (#4702, #4705)
  • Known Issues
    • Two important fixes are NOT included in this release, but they are already in the main branch
      • Fix the bug of setting attention_chunk_size and enable chunked-attention in the generation-phase by default (#4693)
      • Fixed the failure of the LLMAPI benchmark caused by a serialization issue (#4835)

What's Changed

  • Refine doc by @juney-nvidia in #4420
  • Refine doc by @juney-nvidia in #4421
  • refine doc by @juney-nvidia in #4422
  • Remove vila test by @Tabrizian in #4376
  • [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
  • tests: add qa test mentioned in docs by @crazydemo in #4357
  • [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
  • tests: Add test cases for rcca cases by @crazydemo in #4347
  • chore: cleanup perf_evaluator code by @Superjomn in #3833
  • feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
  • fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
  • Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
  • fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
  • [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
  • feat: NIXL interface integration by @Shixiaowei02 in #3934
  • Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
  • Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
  • fix: temp disable the problem test by @Shixiaowei02 in #4445
  • Add llama4 disagg accuracy tests by @Tabrizian in #4336
  • [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
  • [Docs] - Reapply #4220 by @chzblych in #4434
  • [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
  • [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
  • test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
  • fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
  • feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
  • [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
  • test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
  • [AutoDeploy] HF factory improvements by @lucaslie in #4371
  • chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
  • doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
  • infra: Add qwen3 235B tests into QA by @byshiue in #4483
  • feat: large-scale EP(part 2: MoE Load Balancer - core utilities) by @dongxuy04 in #4384
  • [TRTLLM-5085][fix] Nemotron H correctness test by @tomeras91 in #4444
  • [Docs] - Add date and commit info by @chzblych in #4448
  • fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4428
  • fix: replace the image links in the blog by @Shixiaowei02 in #4489
  • fix: Fix TRTLLMSampler beam width bug. by @dcampora in #4473
  • refactor: Unify request order in TRT and PyTorch workflow by @Funatiq in #4096
  • [TRTLLM-5273]feat/Use full attention mask if Llama3 is used as encoder and fix EarlyStopDecoder unsqueeze bug by @nvrohanv in #4290
  • Build Triton for arm by @Tabrizian in #4456
  • test: [CI] remove closed bugs by @xinhe-nv in #4417
  • test(perf): Add remaining Phi-4-mini-instruct perf tests by @venkywonka in #4443
  • feat: conditional disaggregation in disagg server by @zhengd-nv in #3974
  • perf: Fuse gemm setup function for SM90/SM100 MOE plugin path by @djns99 in #4146
  • fix: skip weights defined in create_weights for pp. by @yuxianq in #4447
  • Feat: add chunked-attention kernels on Blackwell by @PerkzZheng in #4394
  • fix [nvbug/5220766]: llmapi-launch add add trtllm-bench test with engine building by @Superjomn in #4091
  • [TRTLLM-5000][feat] Pytorch implementation of ngram drafter by @thorjohnsen in #3936
  • test: NIXL single process test by @Shixiaowei02 in #4486
  • Chore: waive torch compile test cases of deepseek v3 lite by @QiJune in #4508
  • Feat: add deep_gemm swapab Kernel by @ruoqianguo in #4430
  • unwaive some disagg test by @chuangz0 in #4476
  • Clean: fmha codes by @PerkzZheng in #4496
  • tests: add llama 3.3 70b 2 nodes tests by @xinhe-nv in #4391
  • CI: waive test_fp8_block_scales_4gpus of deepseek v3 lite by @QiJune in #4520
  • test: remove enable_overlap_schedule in pytorch config and set enable_chunked prefill to be true for isl>2048 cases by @ruodil in #4285
  • docs: update the introduction for scaffolding by @WeiHaocheng in #4360
  • test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4527
  • tests: add qwene fp4 tests into QA test list & update sanity test list by @xinhe-nv in #4478
  • feat: large-scale EP(part 3: refactor - FusedMoe for redundant expert) by @dongxuy04 in #4495
  • refactor: DisaggExecutorTest by @Funatiq in #4398
  • chore: clean ucx and nixl mirror. by @nv-guomingz in #4531
  • Add pytorch backend team by @kevinch-nv in #4405
  • test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) by @venkywonka in #4499
  • Adding two-shot allreduce kernel and mnnvl multicasting buffer by @zongfeijing in #4216
  • test: Split test_simple into mpi_utils and cache transceiver tests for DGX by @DomBrown in #4451
  • fix: TRT-LLM Gen dtype declaration by @nekorobov in #4503
  • chore: remove extra PYTHONPATH by @achartier in #4453
  • Agent interface impl for NIXL by @chuangz0 in #4125
  • chore: Partition LlmArgs into TorchLlmArgs and TrtLlmArgs by @Superjomn in #3823
  • [TRTLLM-4932] Add CLI accuracy tests for Phi-4-mini-instruct by @moraxu in #4415
  • chore: Add all_reduce.py benchmark script to test by @kaiyux in #4537
  • feat: add dataset support for benchmark_core_model with LLMAPI by @achartier in #4457
  • fix[nvbug-5228840]: Remove test cases of feature not supported anymore by @HuiGao-NV in #3972
  • feat: add health_generate route to openai serving (Cherry-pick #3856) by @kaiyux in #4349
  • Add tritonrelease container by @Tabrizian in #4455
  • cache_transceiver_config by @chuangz0 in #4556
  • test: waive hanging cases for perf test by @ruodil in #4562
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4549
  • Chore: clean up _merge_dummy_request method of PyExecutor by @QiJune in #4438
  • fix sequence data race by @chuangz0 in #4565
  • fix: Move cv2 import to load_video function by @Funatiq in #4541
  • test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) by @venkywonka in #4446
  • [nvbug/5285881][fix] Fix chunked prefill + overlap scheduler by @mikeiovine in #4402
  • [feat] Integrate Hopper chunked attention kernels by @mikeiovine in #4330
  • chroe:clean useless flag by @nv-guomingz in #4567
  • Chore: clean up _gather_dp_requests_num method of PyExecutor by @QiJune in #4571
  • fix[nvbug-5295425]: [TRTLLM-5385] fix race condition in MoeLoadBalancer by @dongxuy04 in #4573
  • Scaffoldingllm supports MCP by @wu1du2 in #4410
  • [feat][TRTLLM-5018] Dis serving python runtime trt backend by @pcastonguay in #4243
  • chore: clean-up for header file. by @nv-guomingz in #4540
  • [https://nvbugspro.nvidia.com/bug/5181262] [test] Unwaive Mistral Nemo test by @syuoni in #4515
  • [feat] support fp8 blockscale gemm on sm89 by @CarstyYou in #4481
  • fix: Fix moe_ep_groups/moe_cluster_groups in Mapping. by @yuxianq in #4555
  • [https://nvbugs/5297775] fix: Correct memory guard for large MOE tests to account for TP space by @djns99 in #4553
  • fix: [nvbugs/5066257] serialization improvments by @coldwaterq in #3869
  • [Fix][Qwen3] fix bug of qwen3 fp4 workflow with EP by @byshiue in #4575
  • [doc]: add mtp tech blog by @lfr-0531 in #4580
  • chore: fix bug of llama lora test by @byshiue in #4566
  • perf: Add fused q_norm/k_norm/RoPE for Qwen3. by @bobboli in #4482
  • Waive L0 test by @yiqingy0 in #4609
  • Update the GH main page to expose tech blogs by @juney-nvidia in #4610
  • Qwen3 supports TRTLLM FP4 MoE backend by @rosenrodt in #4530
  • [TRTLLM-5070][feat] Support FP8 KV Cache Reuse for MLA by @zhhuang-nv in #4535
  • [nvbugs/5301492] ci: waive test_workers_kv_cache_aware_router by @Funatiq in #4617
  • Update CODEOWNERS for PyTorch backend - runtime component by @juney-nvidia in #4620
  • [nvbug/5028235][fix]pytest bindings tokens logtis comparison. by @dominicshanshan in #4424
  • refactor: CreateNewDecoderRequests by @Funatiq in #4452
  • fix: rename some terms by @lowsfer in #4534
  • Fix invalid testcase name by @chzblych in #4626
  • fix: datatype check in the cache transmission by @chuangz0 in #4606
  • [Fix][Deepseek] Fix bugs in TestDeepSeekR1 by @hlu1 in #4413
  • [TRTLLM-5327] - Add scan stage by @yiqingy0 in #4602
  • [#4633][doc] Fixed typo in scaffolding README.md by @amemov in #4634
  • Update main README.md with the LLaMA4 perf news by @juney-nvidia in #4636
  • Fix snake case format by @shaharmor98 in #4559
  • fix: Update approved list to fix pipeline tests after rebasing by @yibinl-nvidia in #4640
  • Feat: add sliding-window-attention generation-phase kernels on Blackwell by @PerkzZheng in #4564
  • feat: Skip sampler for intermediate pp stages. by @yuxianq in #4514
  • Waive L0 tests by @yiqingy0 in #4645
  • Chore: refine shutdown signal of PyExecutor by @QiJune in #4614
  • chore: sort llm request state enums in chronological order by @zhengd-nv in #4607
  • [TRTLLM-4535][infra]: Add marker TIMEOUT for test level by @EmmaQiaoCh in #3905
  • fix: Handle additional model outputs based on pipeline parallel rank by @Funatiq in #4498
  • [TRTLLM-5327] - Fix guardwords scan step by @yiqingy0 in #4654
  • fix: Remove duplicate tokenization in generation server by @Shunkangz in #4492
  • [nvbugs/5274894] fix: Sort requests for functional correctness and performance (adapted from #4608) by @Funatiq in #4621
  • Chore: introduce RequestQueueItem class instead of using tuple by @QiJune in #4649
  • feat: large-scale EP(part 4: Static EP load balancer integration) by @syuoni in #4615
  • Add files into scan ignoreList by @yiqingy0 in #4663
  • [Infra] - Multi-GPU testing support with Slurm by @yuanjingx87 in #4454
  • fix disagg config params by @chuangz0 in #4646
  • [Test] - Waive RTX Pro 6000 Slurm testing by @chzblych in #4672
  • fix fmha v2 tests by @qsang-nv in #4661
  • test: rcca https://nvbugs/5223130 by @xinhe-nv in #4510
  • [NVBUG 5301980] Fix fp4 gemm padding. by @Tracin in #4662
  • [Test] - Correct waive the Slurm test stage by @chzblych in #4677
  • Chore: only pad one dummy request for attention dp scenario by @QiJune in #4664
  • Waive L0 tests by @yiqingy0 in #4686
  • feat: better build_wheel.py venv handling by @tongyuantongyu in #4525
  • [Infra][TRTLLM-3929] Rerun failure tests by @yiqingy0 in #3264
  • [AutoDeploy] Increased Model Coverage Mass Migration Week 1 by @lucaslie in #4468
  • fix: fmha_v2 compilation by @PerkzZheng in #4659
  • test: [CI] remove closed bugs by @xinhe-nv in #4638
  • refactor: extract and reuse filter_weights. by @yuxianq in #4681
  • fix: fix dsr1 min lat cga ar rate drop(0.2) by @yunruis in #4561
  • Update the description for NGC docker images (#4671) by @MartinMarciniszyn in #4702
  • feat: Add vanilla MOE. by @yuxianq in #4682
  • Fix handle cancel request for attentionDP by @Shunkangz in #4648
  • feat: Integration of Fused QKNorm+RoPE. by @bobboli in #4611
  • [TRTLLM-1658][feat] Enable multiple response in trtllm-serve for TRT backend by @LinPoly in #4623
  • doc: Document the docker release image on NGC by @MartinMarciniszyn in #4705
  • Fix: hang on disagg when MNNVL two-shot AllReduce is enabled by @kaiyux in #4678
  • Mass-integration 0.20 to main by @amirkl94 in #4577
  • Add missing serialization classes by @Tabrizian in #4642
  • Fix rerun step by @yiqingy0 in #4715
  • feat: forward exceptions to Python and catch OOMs by @ixlmar in #4497
  • chore [BREAKING CHANGE]: Flatten PyTorchConfig knobs into TorchLlmArgs by @Superjomn in #4603
  • chore: remove extra paths to find binaries by @achartier in #4706
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4688
  • tests: [https://nvbugspro.nvidia.com/bug/5289908] run maverick bf16 on blackwell by @crazydemo in #4722
  • chore: Clean up cpp runtime by @Funatiq in #4449
  • chore: add -f to pkill calls by @achartier in #4711
  • feat: support packed weights in vanilla moe by @yuxianq in #4719
  • chore: [nvbug_5273941] unwaive test_llm_loading_from_ckpt_for_tp2 by @hchings in #4725
  • feature: KV Cache GPUDirect Storage by @arthurrasmusson in #3209
  • [fix] add back rtx6000pro tests by @yuanjingx87 in #4679
  • chore: rename ExecutorBindingsWorker/Proxy by @Superjomn in #4716
  • Waive L0 test by @yiqingy0 in #4748
  • CI: move post-merge multi GPU test of PyTorch backend to H200 by @QiJune in #4733
  • infra: [TRTLLM-5247][TRTLLM-5248][TRTLLM-5249] Refactor docker build image groovy and support NGC images by @ZhanruiSunCh in #4294
  • test: remove perf test l40s/l20 oom test cases and unwaive tests by @ruodil in #4755
  • fix: test trtllm-bench mgmn by @Superjomn in #4613
  • [feat] add b200 support via slurm by @yuanjingx87 in #4709
  • Chore: fuse _merge_requests method into _fetch_new_requests method by @QiJune in #4689
  • [fix] Eagle-2 LLMAPI pybind argument fix. by @jhaotingc in #3967
  • [feat] Support RULER + chunked prefill in lm-eval-harness by @mikeiovine in #4592
  • refactor: unique_ptr instead of shared_ptr by @Funatiq in #4697
  • Cherry pick feat/llama4 to main by @nv-yilinf in #4739
  • [Architecture] Redesign Linear module by @hlu1 in #4721
  • [perf] Reduce the workspace size of FP4 activation scales for MoE by @jinyangyuan-nvidia in #4303
  • Added code owners for AutoDeploy by @juney-nvidia in #4769
  • chore: fix llm_root when LLM_ROOT is not set by @achartier in #4741
  • [JIRA-5226219][fix] Fix Bug in KV cache manager by @thorjohnsen in #4596
  • test: skip test_llm_hf_gemma_quantization_1gpu_vswa on A100 by @xinhe-nv in #4779
  • test: Waive test_llm_loading_from_ckpt_for_tp2 by @syuoni in #4797
  • Fabric Memory for KV Cache Transfer by @chuangz0 in #4717
  • fix: random fail of cache router test by @zhengd-nv in #4597
  • feat: estimate GPU mem. usage w/ minimal KV cache by @ixlmar in #4574
  • fix: iteration logging and typing in PyExecutor by @ixlmar in #4734
  • [TRTLLM-5516] perf: replicate dummy request for cuda graph padding by @QiJune in #4729
  • [feat] support sharegpt downloading in benchmark_serving by @LinPoly in #4578
  • fix: [nvbugs/5310520] disable embed_tokens's TP when DP enabled for llama model. by @yuxianq in #4758
  • DeepSeek R1 throughut optimization tech blog for Blackwell GPUs by @litaotju in #4791
  • Expose new tech blog about DSR1 throughput optimization to the main R… by @juney-nvidia in #4803
  • [fix] Fix Llama 3.3 70b EAGLE by @mikeiovine in #4772
  • [Infra]Remove some old keyword by @EmmaQiaoCh in #4552
  • opt: the perormance for dist-agg streaming generation by @Superjomn in #4214
  • fix: re-enable tp/pp for quickstart_advanced.py. by @yuxianq in #4766
  • [nvbug 5305210] Resolve nvbug 5305210 by @DomBrown in #4759
  • fix: large-scale EP - EP load balancer with MTP layer and route offset by EP rank by @syuoni in #4767
  • [TRTLLM-4987][feat] Support context logits in TRTLLMSampler by @dcampora in #4538
  • [fix] Fix SamplingParams check on n and best_of by @syuoni in #4655
  • Check test names in waive list by @EmmaQiaoCh in #4292
  • [AutoDeploy] Increased Model Coverage Mass Migration Week 2 by @lucaslie in #4817
  • CI: Performance regression tests update by @amirkl94 in #3531
  • [TRTLLM-4783][feat] Mamba2 kernel updates for Nemotron-H by @tomeras91 in #4494
  • 'entered copyBlock' format string expects %s, pass string rather than int by @netanel-haber in #4820
  • fix: fix accuracy and illegal memory access issues when using mtp + attention dp by @lfr-0531 in #4379
  • feat: large-scale EP(part 5: Static EP load balancer with offline statistics) by @syuoni in #4695
  • [fix] Fix llama4 min-latency mode by @nv-yilinf in #4810
  • [Infra] - Minor clean-up and test Ubuntu mirrors by @chzblych in #4829
  • fix: [https://nvbugspro.nvidia.com/bug/5273945] Unwaive tests for bug-5273945 by @lfr-0531 in #4832
  • [fix] Fix Llama4 guradwords failures by @nv-yilinf in #4844
  • [TRTLLM-5502][infra] Add github action to identify if PR is from community by @poweiw in #4824

New Contributors

Full Changelog: v0.20.0rc3...v0.21.0rc0

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.