github NVIDIA/TensorRT-LLM v1.0.0rc1

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-release2 months ago

Model Support

  • Model Support
  • Features
    • Add support for YARN in NemotronNAS models (#4906)
    • Add support for per expert activation scaling factors (#5013)
    • Add ReDrafter support for Qwen (#4875)
    • Add NGrams V2 support (#4569)
    • Use inference mode in update_requests to improve perf of TRTLLM Sampler (#5538)
    • Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch (#5410)
    • Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
    • large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) (#5226)
    • Prevent serialization of entire LoRA adapters in each request (#5080)
    • Remove cutlass min latency code from AutoTuner. (#5394)
    • Opensource MOE MXFP8-MXFP4 implementation (#5222)
    • Add chunked prefill support for MLA (Blackwell) (#4651)
    • Support disaggregated serving in TRTLLM Sampler (#5328)
    • Support mutliCtasKvMode for high-throughput MLA kernels (#5426)
    • Add MTP support for Online EPLB (#5213)
    • Add debug hook to support dump tensor data and add new debug functions easily (#5182)
  • API
    • Add request_perf_metrics to LLMAPI (#5497)
    • Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead (#5384)
  • Bug Fixes
    • Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
    • Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
    • Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
    • Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. (#5485)
    • Fix the unexpected keyword argument 'streaming' (#5436)
  • Benchmark
    • Update trtllm-bench to support new Pytorch default. (#5491)
    • Add support for TRTLLM CustomDataset (#5511)
    • Make benchmark_serving part of the library (#5428)
  • Performance
    • Improve XQA-MLA perf (#5468)
    • Optimize swizzle_sf, unswizzle_sf, reswizzle_sf (#5318)
  • Infrastructure
    • Allow configuring linking of NVRTC wrapper (#5189)
    • Add timeout setting for long tests found in post-merge (#5501)
  • Documentation
    • Fix benchmark cmd in disagg scripts (#5515)
  • Known Issues
    • multi-GPU model support on RTX Pro 6000

What's Changed

  • feature: make trtllmsampler new_tokens format the universal format by @netanel-haber in #4401
  • [fix] Add 1 and draft_token_num to seq_len when overlap scheduling is enabled during memory estimation by @HuiGao-NV in #5343
  • test: [CI] remove closed bugs by @xinhe-nv in #5400
  • refactor: manage cache indirection in decoder state by @Funatiq in #5315
  • tests: update benchmark test lists by @xinhe-nv in #5365
  • chore: delete mamba hybrid, since it is now called NemotronH by @vegaluisjose in #5409
  • [Infra] - Waive failed tests in post-merge and increase some timeout setting by @EmmaQiaoCh in #5424
  • Add debug hook to support dump tensor data and add new debug functions easily by @HuiGao-NV in #5182
  • Chore: remove unused variables by @QiJune in #5314
  • Fix test Pytorch model engine by @Tabrizian in #5416
  • Add MTP support for Online EPLB by @dongxuy04 in #5213
  • waive test_moe.py::test_moe_fp8[autotune] by @QiJune in #5455
  • fix: fix bug of qwen3 + eagle3 + finalize_moe_fusion by @byshiue in #5369
  • [AutoDeploy] Merge feat/ad_2025_06_13 feature branch by @lucaslie in #5454
  • feat: Dynamically remove servers in PD by @Shunkangz in #5270
  • tests: Set kv cache free memory fraction in test case by @HuiGao-NV in #5433
  • fix (NvBug 5354925): Fix static EPLB by @syuoni in #5411
  • test: Add LLGuidance test and refine guided decoding by @syuoni in #5348
  • CI: update multi gpu test triggering file list by @QiJune in #5466
  • start OAIServer with max_beam_width=1 for TorchSampler by @netanel-haber in #5427
  • chore: bump version to 1.0.0rc1 by @yiqingy0 in #5460
  • [https://jirasw.nvidia.com/browse/TRTLLM-4645] support mutliCtasKvMode for high-throughput MLA kernels by @PerkzZheng in #5426
  • CI: waive test_ad_build_small_multi by @QiJune in #5471
  • feat: Remove not used padding_idx in models by @HuiGao-NV in #5385
  • [nvbug/5354956] fix: unexpected keyword argument 'streaming' by @kaiyux in #5436
  • Move 3 disaggregated cases from 4 GPUs devices to 1 GPU device by @HuiGao-NV in #5457
  • Fix: fix nvbug 5356427 by @HuiGao-NV in #5464
  • feat: Make benchmark_serving part of the library by @kaiyux in #5428
  • [TRTLLM-5974][feat] Support disaggregated serving in TRTLLM Sampler by @dcampora in #5328
  • [chore] Disable block reuse when draft model speculation is being used by @mikeiovine in #5448
  • chore: split _build_model method for TorchLlm and TrtLlm by @QiJune in #5418
  • [fix][test] remove test in global scope by @omera-nv in #5470
  • [fix][ci] dont build wheel for cpp tests by @omera-nv in #5443
  • CI: reduce BF16 test cases in B200 by @QiJune in #5482
  • Add sleep function for disagg gen-only benchmarking by @qiaoxj07 in #5398
  • CI: enable test cases on single device type by @HuiGao-NV in #5484
  • [5356427] fix: Remove the seq_len of 4096 from FP8 block scale MoE tuning configs. by @hyukn in #5485
  • feat: chunked prefill for MLA (Blackwell) by @jmydurant in #4651
  • Add unit test for routing kernels by @ChristinaZ in #5405
  • [CI] Waive test_fp8_block_scales_4gpus[ep4-mtp_nextn=0-fp8kv=True-attention_dp=True-cuda_graph=True-overlap_scheduler=True-torch_compile=False] by @venkywonka in #5494
  • [Infra] - Add timeout setting for long tests found in post-merge by @EmmaQiaoCh in #5501
  • Revert "feature: unify new_tokens format sample state to trtllm samper new_tokens format (#4401)" by @netanel-haber in #5474
  • keep sm90 headsize 128 cubins by @qsang-nv in #5320
  • opensource: Opensource MOE MXFP8-MXFP4 implementation by @djns99 in #5222
  • [TRTLLM-6019] feat: Remove cutlass min latency code from AutoTuner. by @hyukn in #5394
  • [TRTLLM-5921][feat] Prevent serialization of entire LoRA adapters in each request by @amitz-nv in #5080
  • feat: large-scale EP(part 8: Online EP load balancer integration for PCIe fp8) by @dongxuy04 in #5226
  • [chore] Allow configuring linking of NVRTC wrapper by @AlessioNetti in #5189
  • perf: Optimize swizzle_sf, unswizzle_sf, reswizzle_sf by @bobboli in #5318
  • [fix][ci] trigger multigpu tests for deepseek changes by @omera-nv in #5423
  • tests: waive tests by @xinhe-nv in #5458
  • doc: Fix benchmark cmd in disagg scripts by @kaiyux in #5515
  • [perf] improve XQA-MLA perf by @lowsfer in #5468
  • feat: Add support for TRTLLM CustomDataset by @kaiyux in #5511
  • [feat] Add progress bar to benchmark by @arekay in #5173
  • Add trtllm-bench reviewers. by @FrankD412 in #5452
  • [CI] move flashinfer llama tests to post merge by @omera-nv in #5506
  • [fix][ci] move torch tests to run under torch stage by @omera-nv in #5473
  • refactor: remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead by @Funatiq in #5384
  • [TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) by @jmydurant in #5475
  • fix: MoE autotune fallback failed to query default heuristic by @rosenrodt in #5520
  • Update allow list 2025_06_26 by @yuanjingx87 in #5526
  • fix: Mapping rank boundary check bug by @venkywonka in #4935
  • Update trtllm-bench to support new Pytorch default. by @FrankD412 in #5491
  • [TRTLLM-4971]: Use safe deserialization in ParallelConfig by @yibinl-nvidia in #4630
  • tests: waive failed tests on main by @xinhe-nv in #5512
  • fix: Fix block scale fp8 support for deepseek v3 on Blackwell. by @yuxianq in #5514
  • Add testing for trtllm-llmapi-launch with tritonserver by @Tabrizian in #5528
  • Fix execute_process: check results using EQUAL by @yuantailing in #5481
  • feat: Expose bias and FP8_MXFP4 MOE CUTLASS backend features to pytorch by @djns99 in #5410
  • [Infra] - Waive failed case in post-merge by @EmmaQiaoCh in #5536
  • feat: Use inference mode in update_requests to improve perf of TRTLLM Sampler by @dcampora in #5538
  • ci: waive flaky test test_llama_eagle3 by @syuoni in #5548
  • fix: [https://nvbugspro.nvidia.com/bug/5349343] Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) by @ChristinaZ in #5519
  • [fix][ci] correct unittests test prefix by @omera-nv in #5547
  • Fix : fix build for sm120 by @peaceh-nv in #5265
  • [TRTLLM-5000][feat] NGrams V2 by @wili-65535 in #4569
  • [TRTLLM-6104] feat: add request_perf_metrics to LLMAPI by @achartier in #5497
  • refactor: Speculative decoding buffers part 2 by @Funatiq in #5316
  • ReDrafter support for Qwen by @darraghdog in #4875
  • [nvbugs/5309940] Add support for input output token counts by @Tabrizian in #5445
  • feat: Add support for per expert activation scaling factors by @djns99 in #5013
  • Make moe permute and final as custom op by @limin2021 in #5412
  • [AutoDeploy] merge feat/ad-2025-06-24 by @lucaslie in #5556
  • [Infra] - Add import pytest by @EmmaQiaoCh in #5565
  • tests: Move stress tests to be Post-Merge only by @amirkl94 in #5166
  • feat: Add support for YARN in NemotronNAS models by @amirkl94 in #4906

New Contributors

Full Changelog: v1.0.0rc0...v1.0.0rc1

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.