Announcement Highlights:
- Model Support
- Features
- Add EAGLE3 support for Qwen3 (#5206)
- Add Piecewise cuda graph support for MLA (#4467)
- feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
- Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
- Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
- Add LLGuidance Support for PyTorch Backend (#5214)
- Fusion finalize and allreduce for qwenmoe model (#5223)
- Support stream_interval (#5284)
- API
- Bug Fixes
- Benchmark
- Performance
- Infrastructure
- Documentation
- Known Issues
- multi-GPU model support on RTX Pro 6000
What's Changed
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5221
- [test] split nemotron test cases from examples_test_list by @crazydemo in #5238
- Update DeepSeek R1 perf numbers to latest release/0.20 results by @litaotju in #5235
- [feat] Add llm args to tune python gc threshold by @nv-yilinf in #5141
- [TRTLLM-5835][feat] Optimized Mamba2Mixer prefill by @tomeras91 in #5128
- [TRTLLM-3456] Speculation: Draft Target in new FW by @IzzyPutterman in #4558
- chore: Waive CI failure. by @SimengLiu-nv in #5252
- [infra] Make test_chunked_prefill faster by @mikeiovine in #5248
- Update internal cutlass commit. by @Tracin in #5228
- test: add more pytorch cases in perf test by @ruodil in #5237
- Fix: https://nvbugs/5345720 by @QiJune in #5259
- test: [CI] remove closed bugs by @xinhe-nv in #5218
- [TRTLLM-5330] perf: Optimize MoE supplementary kernels for large-scale EP by @syuoni in #5215
- fix mla test by @qsang-nv in #5240
- doc: add document of benchmarking for Qwen3 by @byshiue in #5158
- update setup.py for special cases by @qsang-nv in #5227
- move some test cases of TensorRT backend back by @QiJune in #5232
- [feat] Add EAGLE3 support for Qwen3 by @nv-yilinf in #5206
- [TRTLLM-5786][https://nvbugspro.nvidia.com/bug/5310520][test] Add QA test cases by @crazydemo in #5073
- CI: move multi-gpu test cases of tensorrt backend to h200 by @QiJune in #5272
- refactor: Unify decoder test with e2e worklfow by @Funatiq in #5239
- [feat] Piecewise cuda graph support for MLA by @liji-nv in #4467
- chore: Mass integration of release/0.20 by @amirkl94 in #5082
- [TRTLLM-5770] feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5207
- None - Some clean-ups for the automation pipeline by @chzblych in #5245
- Re-implement LlmResponse in Python to reduce host overhead of pybind by @QiJune in #5224
- delete cubins by @qsang-nv in #5274
- infra[TRTLLM-5635] remove package stage in CI build by @niukuo in #5075
- [Infra] - Update dependencies with NGC PyTorch 25.05 and TRT 10.11 by @EmmaQiaoCh in #4885
- [chore] Remove BaseDraftTokenManager by @mikeiovine in #5251
- [infra] Report CI authorization errors to PR by @tburt-nv in #5175
- Revert "[infra] Report CI authorization errors to PR" by @tburt-nv in #5298
- refactor: Update decoder buffer and logits management by @Funatiq in #4450
- fix: only set _mpi_session if world_size is > 1 by @achartier in #5253
- update LlmRequest.is_dummy property by @QiJune in #5283
- test: update qa test list by @crazydemo in #5305
- CI: extend model weights load time for dsv3 in stress test. by @dominicshanshan in #5275
- [fix][test] move deepseek single gpu tests to post merge by @omera-nv in #5280
- Waive L0 tests by @yiqingy0 in #5308
- feat: Add non-streaming support for trtllm serve bench script & fixed prompt and output token length by @yizhang-nv in #4971
- chore: partition LLM class into TorchLLM and TrtLLM by @Superjomn in #4900
- [feat]: improve performance of XQA-MLA for sm120 by @lowsfer in #5087
- doc:update contributing md for internal developers by @nv-guomingz in #5250
- test: cherry-pick deepseek rcca cases in main branch by @ruodil in #5307
- [TRTLLM-5589] feat: Minor optimizations for tunable FP8 batched GEMM op. by @hyukn in #5139
- CI: fix TensorRT H200 tests by @QiJune in #5301
- [TRTLLM-5758] test: Add Bielik-11B-v2.2 Model Support by @Wanli-Jiang in #5159
- chore: Refine printed info of CHECK_TYPE. by @bobboli in #5295
- refactor: Introduce ResourceManagerType enum for resource management by @Funatiq in #5246
- chore: bump version to 0.21.0rc3 by @ZhanruiSunCh in #5309
- test: correct unittest rerun behavior by @tongyuantongyu in #5273
- Fix rerun step by @yiqingy0 in #5319
- Waive L0 by @yizhang-nv in #5311
- tests: add multi nodes tests by @xinhe-nv in #5196
- feat: Add LLGuidance Support for PyTorch Backend by @jellysnack in #5214
- [Infra]Update 5080 and 5090 case condition since we will upgrade driver by @EmmaQiaoCh in #5317
- chore: Update README.md to expose meet-up info by @juney-nvidia in #5329
- Remove duplicated test cases by @HuiGao-NV in #5323
- Add disagg slurm scripts by @qiaoxj07 in #5243
- Unwaive disaggregated serving accuracy tests by @Tabrizian in #5095
- [feat] Multi-node CI testing support via Slurm by @yuanjingx87 in #4771
- [fix][test] remove some cpp test cases from h100 by @omera-nv in #5335
- [fix][test] remove duplicate test runs by @omera-nv in #5241
- chore: skip test_llm_gpt2_medium_fp8 for fp8_pc_pt + quant_lm_head by @achartier in #5293
- [fix][test] clear cuda cache before unittests automatically by @omera-nv in #5121
- fix[nvbug5298640]: trtllm-llmapi-launch multiple LLM instances by @Superjomn in #4727
- ci: Split long running jobs into multiple jobs by @Funatiq in #5268
- [feat] Fusion finalize and allreduce for qwenmoe model by @zongfeijing in #5223
- chore: remove torch_compile prefix for TorchCompileConfig field members by @nv-guomingz in #5261
- [test] add nvfp4 DeepSeek-V3-Lite-mtp tests by @lfr-0531 in #5125
- Waive L0 test by @yiqingy0 in #5349
- chore: bump version to 1.0.0rc0 by @yiqingy0 in #5326
- tests: add ds r1 tp4 test by @xinhe-nv in #5197
- chore: enable moe_backend on Qwen3 test by @byshiue in #5230
- Fix CI build time increase by @yunruis in #5337
- Refactor test timeout for individual long case by @EmmaQiaoCh in #4757
- [TRTLLM-5825][fix] Fix torch LoRA TP by @amitz-nv in #5338
- test: add qwen3 cases by @ruodil in #5302
- test: amend test case name in perf cluster test by @ruodil in #5356
- Refactor CutlassFusedMoE by @hlu1 in #5344
- [Infra]Fix l0_sanity_check.yml which also as gb202 and gb203 by @EmmaQiaoCh in #5360
- fix: Fix DS-R1 nvfp4 test case naming by @syuoni in #5361
- [WAR][nvbug/5321947] Add an async sleep to unblock event loop. by @FrankD412 in #5342
- blog: Disaggregated Serving in TensorRT-LLM by @Shixiaowei02 in #5353
- Fix: fix the deterministic issue in the MTP Eagle path by @lfr-0531 in #5285
- doc: subsequent modifications of blog 5 by @Shixiaowei02 in #5366
- feat: Support stream_interval by @kaiyux in #5284
- Fix: missing clientId when serialize and deserialize response by @kaiyux in #5231
- [TRTLLM-5208][BREAKING CHANGE] chore: make pytorch LLM the default by @Superjomn in #5312
- Add Wechat_Group_QR_Code.png to docs/source/media and main page of TR… by @AdamzNV in #5142
- fix: refactor and fix mtp vanilla by @lfr-0531 in #4762
- feat: Misc Opt for large scale EP by @dongxuy04 in #5374
- refactor: remove TrtGptModelOptionalParams by @Funatiq in #5165
- [doc] update mtp documents by @lfr-0531 in #5387
New Contributors
- @jellysnack made their first contribution in #5214
Full Changelog: v0.21.0rc2...v1.0.0rc0