NVIDIA/TensorRT-LLM v1.0.0rc1 on GitHub

Model Support

Model Support
Features
- Add support for YARN in NemotronNAS models (#4906)
- Add support for per expert activation scaling factors (#5013)
- Add ReDrafter support for Qwen (#4875)
- Add NGrams V2 support (#4569)
- Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538)
- Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
- large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
- Prevent serialization of entire LoRA adapters in each request (#5080)
- Remove cutlass min latency code from AutoTuner. (#5394)
- Opensource MOE MXFP8-MXFP4 implementation (#5222)
- Add chunked prefill support for MLA (Blackwell) (#4651)
- Support disaggregated serving in TRTLLM Sampler (#5328)
- Support mutliCtasKvMode for high-throughput MLA kernels (#5426)
- Add MTP support for Online EPLB (#5213)
- Add debug hook to support dump tensor data and add new debug functions easily (#5182)
API
- Add request_perf_metrics to LLMAPI (#5497)
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
Bug Fixes
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
- Fix the unexpected keyword argument 'streaming' (#5436)
Benchmark
- Update trtllm-bench to support new Pytorch default. (#5491)
- Add support for TRTLLM CustomDataset (#5511)
- Make benchmark_serving part of the library (#5428)
Performance
- Improve XQA-MLA perf (#5468)
- Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
Infrastructure
- Allow configuring linking of NVRTC wrapper (#5189)
- Add timeout setting for long tests found in post-merge (#5501)
Documentation
- Fix benchmark cmd in disagg scripts (#5515)
Known Issues
- multi-GPU model support on RTX Pro 6000

What's Changed

feature: make trtllmsampler new_tokens format the universal format by @netanel-haber in #4401
[fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation by @HuiGao-NV in #5343
test: [CI] remove closed bugs by @xinhe-nv in #5400
refactor: manage cache indirection in decoder state by @Funatiq in #5315
tests: update benchmark test lists by @xinhe-nv in #5365
chore: delete mamba hybrid, since it is now called NemotronH by @vegaluisjose in #5409
[Infra] - Waive failed tests in post-merge and increase some timeout setting by @EmmaQiaoCh in #5424
Add debug hook to support dump tensor data and add new debug functions easily by @HuiGao-NV in #5182
Chore: remove unused variables by @QiJune in #5314
Fix test Pytorch model engine by @Tabrizian in #5416
Add MTP support for Online EPLB by @dongxuy04 in #5213
waive test_moe.py::test_moe_fp8[autotune] by @QiJune in #5455
fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion by @byshiue in #5369
[AutoDeploy] Merge feat/ad_2025_06_13 feature branch by @lucaslie in #5454
feat: Dynamically remove servers in PD by @Shunkangz in #5270
tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5433
fix (NvBug 5354925): Fix static EPLB by @syuoni in #5411
test: Add LLGuidance test and refine guided decoding by @syuoni in #5348
CI: update multi gpu test triggering file list by @QiJune in #5466
start OAIServer with max_beam_width=1 for TorchSampler by @netanel-haber in #5427
chore: bump version to 1.0.0rc1 by @yiqingy0 in #5460
[https://jirasw.nvidia.com/browse/TRTLLM-4645] support mutliCtasKvMode for high-throughput MLA kernels by @PerkzZheng in #5426
CI: waive test_ad_build_small_multi by @QiJune in #5471
feat: Remove not used padding_idx in models by @HuiGao-NV in #5385
[nvbug/5354956] fix: unexpected keyword argument 'streaming' by @kaiyux in #5436
Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device by @HuiGao-NV in #5457
Fix: fix nvbug 5356427 by @HuiGao-NV in #5464
feat: Make benchmark_serving part of the library by @kaiyux in #5428
[TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler by @dcampora in #5328
[chore] Disable block reuse when draft model speculation is being used by @mikeiovine in #5448
chore: split _build_model method for TorchLlm and TrtLlm by @QiJune in #5418
[fix][test] remove test in global scope by @omera-nv in #5470
[fix][ci] dont build wheel for cpp tests by @omera-nv in #5443
CI: reduce BF16 test cases in B200 by @QiJune in #5482
Add sleep function for disagg gen-only benchmarking by @qiaoxj07 in #5398
CI: enable test cases on single device type by @HuiGao-NV in #5484
[5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. by @hyukn in #5485
feat: chunked prefill for MLA (Blackwell) by @jmydurant in #4651
Add unit test for routing kernels by @ChristinaZ in #5405
[CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in #5494
[Infra] - Add timeout setting for long tests found in post-merge by @EmmaQiaoCh in #5501
Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" by @netanel-haber in #5474
keep sm90 headsize 128 cubins by @qsang-nv in #5320
opensource: Opensource MOE MXFP8-MXFP4 implementation by @djns99 in #5222
[TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. by @hyukn in #5394
[TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request by @amitz-nv in #5080
feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) by @dongxuy04 in #5226
[chore] Allow configuring linking of NVRTC wrapper by @AlessioNetti in #5189
perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf by @bobboli in #5318
[fix][ci] trigger multigpu tests for deepseek changes by @omera-nv in #5423
tests: waive tests by @xinhe-nv in #5458
doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5515
[perf] improve XQA-MLA perf by @lowsfer in #5468
feat: Add support for TRTLLM CustomDataset by @kaiyux in #5511
[feat] Add progress bar to benchmark by @arekay in #5173
Add trtllm-bench reviewers. by @FrankD412 in #5452
[CI] move flashinfer llama tests to post merge by @omera-nv in #5506
[fix][ci] move torch tests to run under torch stage by @omera-nv in #5473
refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead by @Funatiq in #5384
[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) by @jmydurant in #5475
fix: MoE autotune fallback failed to query default heuristic by @rosenrodt in #5520
Update allow list 2025_06_26 by @yuanjingx87 in #5526
fix: Mapping rank boundary check bug by @venkywonka in #4935
Update trtllm-bench to support new Pytorch default. by @FrankD412 in #5491
[TRTLLM-4971]: Use safe deserialization in ParallelConfig by @yibinl-nvidia in #4630
tests: waive failed tests on main by @xinhe-nv in #5512
fix: Fix block scale fp8 support for deepseek v3 on Blackwell. by @yuxianq in #5514
Add testing for trtllm-llmapi-launch with tritonserver by @Tabrizian in #5528
Fix execute_process: check results using EQUAL by @yuantailing in #5481
feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch by @djns99 in #5410
[Infra] - Waive failed case in post-merge by @EmmaQiaoCh in #5536
feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler by @dcampora in #5538
ci: waive flaky test test_llama_eagle3 by @syuoni in #5548
fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) by @ChristinaZ in #5519
[fix][ci] correct unittests test prefix by @omera-nv in #5547
Fix : fix build for sm120 by @peaceh-nv in #5265
[TRTLLM-5000][feat] NGrams V2 by @wili-65535 in #4569
[TRTLLM-6104] feat: add request_perf_metrics to LLMAPI by @achartier in #5497
refactor: Speculative decoding buffers part 2 by @Funatiq in #5316
ReDrafter support for Qwen by @darraghdog in #4875
[nvbugs/5309940] Add support for input output token counts by @Tabrizian in #5445
feat: Add support for per expert activation scaling factors by @djns99 in #5013
Make moe permute and final as custom op by @limin2021 in #5412
[AutoDeploy] merge feat/ad-2025-06-24 by @lucaslie in #5556
[Infra] - Add import pytest by @EmmaQiaoCh in #5565
tests: Move stress tests to be Post-Merge only by @amirkl94 in #5166
feat: Add support for YARN in NemotronNAS models by @amirkl94 in #4906

New Contributors

@jmydurant made their first contribution in #4651
@darraghdog made their first contribution in #4875

Full Changelog: v1.0.0rc0...v1.0.0rc1