Model Support
- Model Support
- Features
- Add support for YARN in NemotronNAS models (#4906)
- Add support for per expert activation scaling factors (#5013)
- Add ReDrafter support for Qwen (#4875)
- Add NGrams V2 support (#4569)
- Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538)
- Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
- large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
- Prevent serialization of entire LoRA adapters in each request (#5080)
- Remove cutlass min latency code from AutoTuner. (#5394)
- Opensource MOE MXFP8-MXFP4 implementation (#5222)
- Add chunked prefill support for MLA (Blackwell) (#4651)
- Support disaggregated serving in TRTLLM Sampler (#5328)
- Support mutliCtasKvMode for high-throughput MLA kernels (#5426)
- Add MTP support for Online EPLB (#5213)
- Add debug hook to support dump tensor data and add new debug functions easily (#5182)
- API
- Bug Fixes
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
- Fix the unexpected keyword argument 'streaming' (#5436)
- Benchmark
- Performance
- Infrastructure
- Documentation
- Fix benchmark cmd in disagg scripts (#5515)
- Known Issues
- multi-GPU model support on RTX Pro 6000
What's Changed
- feature: make trtllmsampler new_tokens format the universal format by @netanel-haber in #4401
- [fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation by @HuiGao-NV in #5343
- test: [CI] remove closed bugs by @xinhe-nv in #5400
- refactor: manage cache indirection in decoder state by @Funatiq in #5315
- tests: update benchmark test lists by @xinhe-nv in #5365
- chore: delete mamba hybrid, since it is now called NemotronH by @vegaluisjose in #5409
- [Infra] - Waive failed tests in post-merge and increase some timeout setting by @EmmaQiaoCh in #5424
- Add debug hook to support dump tensor data and add new debug functions easily by @HuiGao-NV in #5182
- Chore: remove unused variables by @QiJune in #5314
- Fix test Pytorch model engine by @Tabrizian in #5416
- Add MTP support for Online EPLB by @dongxuy04 in #5213
- waive test_moe.py::test_moe_fp8[autotune] by @QiJune in #5455
- fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion by @byshiue in #5369
- [AutoDeploy] Merge feat/ad_2025_06_13 feature branch by @lucaslie in #5454
- feat: Dynamically remove servers in PD by @Shunkangz in #5270
- tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5433
- fix (NvBug 5354925): Fix static EPLB by @syuoni in #5411
- test: Add LLGuidance test and refine guided decoding by @syuoni in #5348
- CI: update multi gpu test triggering file list by @QiJune in #5466
- start OAIServer with
max_beam_width=1
for TorchSampler by @netanel-haber in #5427 - chore: bump version to 1.0.0rc1 by @yiqingy0 in #5460
- [https://jirasw.nvidia.com/browse/TRTLLM-4645] support mutliCtasKvMode for high-throughput MLA kernels by @PerkzZheng in #5426
- CI: waive test_ad_build_small_multi by @QiJune in #5471
- feat: Remove not used padding_idx in models by @HuiGao-NV in #5385
- [nvbug/5354956] fix: unexpected keyword argument 'streaming' by @kaiyux in #5436
- Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device by @HuiGao-NV in #5457
- Fix: fix nvbug 5356427 by @HuiGao-NV in #5464
- feat: Make benchmark_serving part of the library by @kaiyux in #5428
- [TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler by @dcampora in #5328
- [chore] Disable block reuse when draft model speculation is being used by @mikeiovine in #5448
- chore: split _build_model method for TorchLlm and TrtLlm by @QiJune in #5418
- [fix][test] remove test in global scope by @omera-nv in #5470
- [fix][ci] dont build wheel for cpp tests by @omera-nv in #5443
- CI: reduce BF16 test cases in B200 by @QiJune in #5482
- Add sleep function for disagg gen-only benchmarking by @qiaoxj07 in #5398
- CI: enable test cases on single device type by @HuiGao-NV in #5484
- [5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. by @hyukn in #5485
- feat: chunked prefill for MLA (Blackwell) by @jmydurant in #4651
- Add unit test for routing kernels by @ChristinaZ in #5405
- [CI] Waive
test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False]
by @venkywonka in #5494 - [Infra] - Add timeout setting for long tests found in post-merge by @EmmaQiaoCh in #5501
- Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" by @netanel-haber in #5474
- keep sm90 headsize 128 cubins by @qsang-nv in #5320
- opensource: Opensource MOE MXFP8-MXFP4 implementation by @djns99 in #5222
- [TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. by @hyukn in #5394
- [TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request by @amitz-nv in #5080
- feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) by @dongxuy04 in #5226
- [chore] Allow configuring linking of NVRTC wrapper by @AlessioNetti in #5189
- perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf by @bobboli in #5318
- [fix][ci] trigger multigpu tests for deepseek changes by @omera-nv in #5423
- tests: waive tests by @xinhe-nv in #5458
- doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5515
- [perf] improve XQA-MLA perf by @lowsfer in #5468
- feat: Add support for TRTLLM CustomDataset by @kaiyux in #5511
- [feat] Add progress bar to benchmark by @arekay in #5173
- Add trtllm-bench reviewers. by @FrankD412 in #5452
- [CI] move flashinfer llama tests to post merge by @omera-nv in #5506
- [fix][ci] move torch tests to run under torch stage by @omera-nv in #5473
- refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead by @Funatiq in #5384
- [TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) by @jmydurant in #5475
- fix: MoE autotune fallback failed to query default heuristic by @rosenrodt in #5520
- Update allow list 2025_06_26 by @yuanjingx87 in #5526
- fix: Mapping rank boundary check bug by @venkywonka in #4935
- Update trtllm-bench to support new Pytorch default. by @FrankD412 in #5491
- [TRTLLM-4971]: Use safe deserialization in ParallelConfig by @yibinl-nvidia in #4630
- tests: waive failed tests on main by @xinhe-nv in #5512
- fix: Fix block scale fp8 support for deepseek v3 on Blackwell. by @yuxianq in #5514
- Add testing for trtllm-llmapi-launch with tritonserver by @Tabrizian in #5528
- Fix execute_process: check results using EQUAL by @yuantailing in #5481
- feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch by @djns99 in #5410
- [Infra] - Waive failed case in post-merge by @EmmaQiaoCh in #5536
- feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler by @dcampora in #5538
- ci: waive flaky test test_llama_eagle3 by @syuoni in #5548
- fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) by @ChristinaZ in #5519
- [fix][ci] correct unittests test prefix by @omera-nv in #5547
- Fix : fix build for sm120 by @peaceh-nv in #5265
- [TRTLLM-5000][feat] NGrams V2 by @wili-65535 in #4569
- [TRTLLM-6104] feat: add request_perf_metrics to LLMAPI by @achartier in #5497
- refactor: Speculative decoding buffers part 2 by @Funatiq in #5316
- ReDrafter support for Qwen by @darraghdog in #4875
- [nvbugs/5309940] Add support for input output token counts by @Tabrizian in #5445
- feat: Add support for per expert activation scaling factors by @djns99 in #5013
- Make moe permute and final as custom op by @limin2021 in #5412
- [AutoDeploy] merge feat/ad-2025-06-24 by @lucaslie in #5556
- [Infra] - Add import pytest by @EmmaQiaoCh in #5565
- tests: Move stress tests to be Post-Merge only by @amirkl94 in #5166
- feat: Add support for YARN in NemotronNAS models by @amirkl94 in #4906
New Contributors
- @jmydurant made their first contribution in #4651
- @darraghdog made their first contribution in #4875
Full Changelog: v1.0.0rc0...v1.0.0rc1