github NVIDIA/TensorRT-LLM v1.0.0rc2

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-releaseone month ago

Announcement Highlights:

  • Model Support
  • Feature
    • Add KV events support for sliding window attention (#5580)
    • Add beam search support to the PyTorch Workflow (#5333)
    • Support more parameters in openai worker of scaffolding (#5115)
    • Enable CUDA graphs for Nemotron-H (#5646)
    • Add spec dec param to attention op for pytorch workflow (#5146)
    • Fuse w4a8 moe pre-quant scale on Hopper (#5613)
    • Support torch compile for attention dp (#5086)
    • Add W4A16 GEMM support for pytorch workflow (#4232)
    • Add request_perf_metrics to triton LLMAPI backend (#5554)
    • Add AutoDeploy fp8 quantization support for bmm (#3849)
    • Refactor moe permute and finalize op by removing duplicated code (#5557)
    • Support duplicate_kv_weight for qwen3 blockwise scale (#5459)
    • Add LoRA support for pytorch backend in trtllm-serve (#5376)
  • API
    • Enhance yaml loading arbitrary options in LlmArgs (#5610)
    • Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
    • Add LLmArgs option to force using dynamic quantization (#5346)
    • Remove ptuning knobs from TorchLlmArgs (#5595)
    • BREAKING CHANGE:Enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
  • Bug Fixes
    • Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
    • Fix attention DP doesn't work with embedding TP (#5642)
    • Fix broken cyclic reference detect (#5417)
    • Fix permission for local user issues in NGC docker container. (#5373)
    • Fix mtp vanilla draft inputs (#5568)
  • Benchmark
    • Add wide-ep benchmarking scripts (#5760)
  • Performance
    • Reduce DeepEPLowLatency memory and time (#5712)
    • Use tokenizers API to optimize incremental detokenization perf (#5574)
    • Conditionally enable SWAP AB for speculative decoding (#5404)
    • Unify new_tokens format sample state to trtllm samper tokens format (#5513)
    • Replace allgaher with AllToAllPrepare (#5570)
    • Optimizations on weight-only batched gemv kernel (#5420)
    • Optimize MoE sort kernels for large-scale EP (#5435)
    • Avoid reswizzle_sf after allgather. (#5504)
  • Infrastructure
    • Always use x86 image for the Jenkins agent and few clean-ups (#5753)
    • Reduce unnecessary kernel generation (#5476)
    • Update the auto-community label action to be triggered every hour (#5658)
    • Improve dev container tagging (#5551)
    • Update the community action to more appropriate api (#4883)
    • Update nccl to 2.27.5 (#5539)
    • Upgrade xgrammar to 0.1.18 (#5364)
  • Documentation
    • Fix outdated config in DeepSeek best perf practice doc (#5638)
    • Add pd dynamic scaling readme (#5540)
    • Add feature support matrix for PyTorch backend (#5037)
    • 1.0 LLM API doc updates (#5629)
    • Update container instructions (#5490)
  • Known Issues

What's Changed

  • [TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve by @talorabr in #5376
  • [CI] reduce mamba2 ssm test parameterization by @tomeras91 in #5571
  • perf: Avoid reswizzle_sf after allgather. by @bobboli in #5504
  • [feat][test] reuse MPI pool executor across tests by @omera-nv in #5566
  • [TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP by @syuoni in #5435
  • [feat] Optimizations on weight-only batched gemv kernel by @Njuapp in #5420
  • [ci] remove MMLU if followed by GSM8K by @omera-nv in #5578
  • [TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) by @nv-guomingz in #5014
  • Deduplicate waive list by @yiqingy0 in #5546
  • [fix] speedup modeling unittests by @omera-nv in #5579
  • feat : support duplicate_kv_weight for qwen3 blockwise scale by @dongjiyingdjy in #5459
  • [TRTLLM-5331] large-scale EP: perf - Replace allgaher with AllToAllPrepare by @WeiHaocheng in #5570
  • doc: Minor update to DeepSeek R1 best practice by @kaiyux in #5600
  • [nvbug/5354946][fix] Fix mtp vanilla draft inputs by @lfr-0531 in #5568
  • refactor: decoder state setup by @Funatiq in #5093
  • [Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539) by @EmmaQiaoCh in #5587
  • [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions (#5490) by @ixlmar in #5605
  • [ci] move eagle1 and medusa tests to post-merge by @omera-nv in #5604
  • chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs by @Superjomn in #5595
  • [fix][ci] missing class names in post-merge test reports by @omera-nv in #5603
  • refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code by @limin2021 in #5557
  • chore: remove cuda_graph_ prefix from cuda_graph_config filed members. by @nv-guomingz in #5585
  • feat: AutoDeploy fp8 quantization support for bmm by @meenchen in #3849
  • feature: unify new_tokens format sample state to trtllm samper tokens format by @netanel-haber in #5513
  • [fix]: Fix main test skip issue by @yizhang-nv in #5503
  • chores: [TRTLLM-6072] 1.0 LLMAPI doc updates by @hchings in #5629
  • add feature support matrix for PyTorch backend by @QiJune in #5037
  • test: [CI] remove closed bugs by @xinhe-nv in #5572
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5569
  • rcca: test default kv_cache_reuse option for pytorch multimodal by @StanleySun639 in #5544
  • [TRTLLM-6104] feat: add request_perf_metrics to triton LLMAPI backend by @xuanzic in #5554
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5582
  • feat: W4A16 GEMM by @danielafrimi in #4232
  • test: Reduce number of C++ test cases by @Funatiq in #5437
  • [https://nvbugs/5318059][test] Unwaive test by @pamelap-nvidia in #5624
  • [Infra] - Add some timeout and unwaive a test which dev fixed by @EmmaQiaoCh in #5631
  • [#5403][perf] Conditionally enable SWAP AB for speculative decoding by @zoheth in #5404
  • [TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) by @Superjomn in #5431
  • chore: Mass integration of release/0.21 by @dc3671 in #5507
  • refactor: Clean up DecodingInput and DecodingOutput by @Funatiq in #5617
  • perf: Use tokenizers API to optimize incremental detokenization perf by @kaiyux in #5574
  • [feat] Support torch compile for attention dp by @liji-nv in #5086
  • feat: add LLmArgs option to force using dynamic quantization by @achartier in #5346
  • [TRTLLM-5644][infra] Update the community action to more appropriate api by @poweiw in #4883
  • fix: add missing self. from PR #5346 by @achartier in #5653
  • [Bug] attention DP doesn't work with embedding TP by @PerkzZheng in #5642
  • fix: Add back allreduce_strategy parameter into TorchLlmArgs by @HuiGao-NV in #5637
  • perf: better heuristic for allreduce by @yilin-void in #5432
  • feat: fuse w4a8 moe pre-quant scale on Hopper by @xiaoweiw-nv in #5613
  • [chore] 2025-07-02 update github CI allowlist by @niukuo in #5661
  • doc: Add pd dynamic scaling readme by @Shunkangz in #5540
  • chore: enhance yaml loading arbitrary options in LlmArgs by @Superjomn in #5610
  • Feat/pytorch vswa kvcachemanager by @qixiang-99 in #5151
  • [TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest by @Funatiq in #5489
  • [https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op by @jhaotingc in #5146
  • [Infra] - Set default timeout to 1hr and remove some specific settings by @EmmaQiaoCh in #5667
  • [TRTLLM-6143] feat: Improve dev container tagging by @ixlmar in #5551
  • feat:[AutoDeploy] E2E build example for llama4 VLM by @Fridah-nv in #3922
  • fix: Fix missing arg to alltoall_prepare_maybe_dispatch by @syuoni in #5669
  • [Infra] - Waive failed tests for main 0702 by @EmmaQiaoCh in #5671
  • chore: bump version to 1.0.0rc2 by @yiqingy0 in #5645
  • [TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H by @tomeras91 in #5646
  • [Infra] - Fix test stage check for the package sanity check stage by @yiqingy0 in #5694
  • [Infra] - Waive a failed case on main by @EmmaQiaoCh in #5702
  • fix: Set init value for moe expert id by @WeiHaocheng in #5660
  • [ci] small multigpu speedups by @omera-nv in #5643
  • delete duplicate eagle3 and ngram tests by @netanel-haber in #5711
  • chore: Remove unused isFullContextRequest method by @Funatiq in #5666
  • chore: refine the default value by using pydantic default instead of … by @nv-guomingz in #5695
  • [ModelLoad] Concurrent load model by @arekay in #5291
  • [https://nvbugs/5365714] fix(scaffolding): use default LLM rather than trt backend LLM by @dc3671 in #5705
  • [None][infra] Update the auto-community label action to be triggered every hour by @poweiw in #5658
  • MTP and derivatives: Align sample state with trtllm sampler sample state by @netanel-haber in #5675
  • [AutoDeploy] merge feat/ad-2025-06-29 by @lucaslie in #5737
  • feat: support more parameters in openai worker of scaffolding by @ccs96307 in #5115
  • Waive tests : test_openai_lora, test_trtllm_serve_lora_example and test_openai_chat_structural_tag_example by @venkywonka in #5740
  • tests: waive failures on main by @xinhe-nv in #5704
  • chore: Mass integration of release/0.21 by @dc3671 in #5701
  • [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in #5720
  • Fix none response in PD by @Shunkangz in #5422
  • feat: reduce unnecessary kernel generation by @tongyuantongyu in #5476
  • chore: update doc by replacing use_cuda_graph with cuda_graph_config by @nv-guomingz in #5680
  • Perf: reduce DeepEPLowLatency memory and time by @yuantailing in #5712
  • [Infra] - Waive L0 test by @yiqingy0 in #5748
  • fix: Improve chunking test and skip empty kernel calls by @Funatiq in #5710
  • Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in #5724
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5718
  • fix: check file exists in dev container script by @ixlmar in #5755
  • Raise shut down error for each request by @Shunkangz in #4936
  • [Infra] - Waive L0 flaky test by @yiqingy0 in #5759
  • Fix: pass allreduce strategy to pytorchConfig by @HuiGao-NV in #5746
  • Cache transceiver support VSWA by @chuangz0 in #5505
  • [TRTLLM-3442] feat: added beam search support to the PyTorch Workflow by @stnie in #5333
  • [fix] Update to properly set cuda graphs in trtllm-bench overrides. by @FrankD412 in #5634
  • feat: KV events for sliding window attention by @jthomson04 in #5580
  • Add dummy all_reduce for kernel breakdown by @qiaoxj07 in #5745
  • Add wide-ep benchmarking scripts by @qiaoxj07 in #5760
  • Improve documentation of Kv_block_array by @hypdeb in #5765
  • [Infra] - Always use x86 image for the Jenkins agent and few clean-ups by @chzblych in #5753
  • refactor: decoding inputs by @Funatiq in #5679
  • [Test] - Waive or fix few known test failures by @chzblych in #5769
  • [TRTLLM-5878] add stage for image registration to nspect by @niukuo in #5699

New Contributors

Full Changelog: v1.0.0rc1...v1.0.0rc2

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.