NVIDIA/TensorRT-LLM v1.0.0rc0 on GitHub

Announcement Highlights:

Model Support
Features
- Add EAGLE3 support for Qwen3 (#5206)
- Add Piecewise cuda graph support for MLA (#4467)
- feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
- Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
- Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
- Add LLGuidance Support for PyTorch Backend (#5214)
- Fusion finalize and allreduce for qwenmoe model (#5223)
- Support stream_interval (#5284)
API
- Add llm args to tune python gc threshold (#5141)
- Introduce ResourceManagerType enum for resource management (#5246)
- BREAKING CHANGEchore: make pytorch LLM the default (#5312)
- Remove TrtGptModelOptionalParams (#5165)
Bug Fixes
- Fix trtllm-llmapi-launch multiple LLM instances (#4727)
- Fix the deterministic issue in the MTP Eagle path (#5285)
- Fix: missing clientId when serialize and deserialize response (#5231)
Benchmark
Performance
- Optimize MoE supplementary kernels for large-scale EP (#5215)
- Improve performance of XQA-MLA for sm120 (#5087)
Infrastructure
- Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885)
- Add Multi-node CI testing support via Slurm (#4771)
Documentation
- Add document of benchmarking for Qwen3 (#5158)
- Update contributing md for internal developers (#5250)
- blog: Disaggregated Serving in TensorRT-LLM (#5353)
- Update mtp documents (#5387)
Known Issues
- multi-GPU model support on RTX Pro 6000

What's Changed

test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5221
[test] split nemotron test cases from examples_test_list by @crazydemo in #5238
Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in #5235
[feat] Add llm args to tune python gc threshold by @nv-yilinf in #5141
[TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in #5128
[TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in #4558
chore: Waive CI failure. by @SimengLiu-nv in #5252
[infra] Make test_chunked_prefill faster by @mikeiovine in #5248
Update internal cutlass commit. by @Tracin in #5228
test: add more pytorch cases in perf test by @ruodil in #5237
Fix: https://nvbugs/5345720 by @QiJune in #5259
test: [CI] remove closed bugs by @xinhe-nv in #5218
[TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in #5215
fix mla test by @qsang-nv in #5240
doc: add document of benchmarking for Qwen3 by @byshiue in #5158
update setup.py for special cases by @qsang-nv in #5227
move some test cases of TensorRT backend back by @QiJune in #5232
[feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in #5206
[TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in #5073
CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in #5272
refactor: Unify decoder test with e2e worklfow by @Funatiq in #5239
[feat] Piecewise cuda graph support for MLA by @liji-nv in #4467
chore: Mass integration of release/0.20 by @amirkl94 in #5082
[TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5207
None - Some clean-ups for the automation pipeline by @chzblych in #5245
Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in #5224
delete cubins by @qsang-nv in #5274
infra[TRTLLM-5635] remove package stage in CI build by @niukuo in #5075
[Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in #4885
[chore] Remove BaseDraftTokenManager by @mikeiovine in #5251
[infra] Report CI authorization errors to PR by @tburt-nv in #5175
Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in #5298
refactor: Update decoder buffer and logits management by @Funatiq in #4450
fix: only set _mpi_session if world_size is > 1 by @achartier in #5253
update LlmRequest.is_dummy property by @QiJune in #5283
test: update qa test list by @crazydemo in #5305
CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in #5275
[fix][test] move deepseek single gpu tests to post merge by @omera-nv in #5280
Waive L0 tests by @yiqingy0 in #5308
feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in #4971
chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in #4900
[feat]: improve performance of XQA-MLA for sm120 by @lowsfer in #5087
doc:update contributing md for internal developers by @nv-guomingz in #5250
test: cherry-pick deepseek rcca cases in main branch by @ruodil in #5307
[TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in #5139
CI: fix TensorRT H200 tests by @QiJune in #5301
[TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in #5159
chore: Refine printed info of CHECK_TYPE. by @bobboli in #5295
refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in #5246
chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in #5309
test: correct unittest rerun behavior by @tongyuantongyu in #5273
Fix rerun step by @yiqingy0 in #5319
Waive L0 by @yizhang-nv in #5311
tests: add multi nodes tests by @xinhe-nv in #5196
feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in #5214
[Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in #5317
chore: Update README.md to expose meet-up info by @juney-nvidia in #5329
Remove duplicated test cases by @HuiGao-NV in #5323
Add disagg slurm scripts by @qiaoxj07 in #5243
Unwaive disaggregated serving accuracy tests by @Tabrizian in #5095
[feat] Multi-node CI testing support via Slurm by @yuanjingx87 in #4771
[fix][test] remove some cpp test cases from h100 by @omera-nv in #5335
[fix][test] remove duplicate test runs by @omera-nv in #5241
chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in #5293
[fix][test] clear cuda cache before unittests automatically by @omera-nv in #5121
fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in #4727
ci: Split long running jobs into multiple jobs by @Funatiq in #5268
[feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in #5223
chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in #5261
[test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in #5125
Waive L0 test by @yiqingy0 in #5349
chore: bump version to 1.0.0rc0 by @yiqingy0 in #5326
tests: add ds r1 tp4 test by @xinhe-nv in #5197
chore: enable moe_backend on Qwen3 test by @byshiue in #5230
Fix CI build time increase by @yunruis in #5337
Refactor test timeout for individual long case by @EmmaQiaoCh in #4757
[TRTLLM-5825][fix] Fix torch LoRA TP by @amitz-nv in #5338
test: add qwen3 cases by @ruodil in #5302
test: amend test case name in perf cluster test by @ruodil in #5356
Refactor CutlassFusedMoE by @hlu1 in #5344
[Infra]Fix l0_sanity_check.yml which also as gb202 and gb203 by @EmmaQiaoCh in #5360
fix: Fix DS-R1 nvfp4 test case naming by @syuoni in #5361
[WAR][nvbug/5321947] Add an async sleep to unblock event loop. by @FrankD412 in #5342
blog: Disaggregated Serving in TensorRT-LLM by @Shixiaowei02 in #5353
Fix: fix the deterministic issue in the MTP Eagle path by @lfr-0531 in #5285
doc: subsequent modifications of blog 5 by @Shixiaowei02 in #5366
feat: Support stream_interval by @kaiyux in #5284
Fix: missing clientId when serialize and deserialize response by @kaiyux in #5231
[TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default by @Superjomn in #5312
Add Wechat_Group_QR_Code.png to docs/source/media and main page of TR… by @AdamzNV in #5142
fix: refactor and fix mtp vanilla by @lfr-0531 in #4762
feat: Misc Opt for large scale EP by @dongxuy04 in #5374
refactor: remove TrtGptModelOptionalParams by @Funatiq in #5165
[doc] update mtp documents by @lfr-0531 in #5387

New Contributors

@jellysnack made their first contribution in #5214

Full Changelog: v0.21.0rc2...v1.0.0rc0