NVIDIA/TensorRT-LLM v1.0.0rc2 on GitHub

Announcement Highlights:

Model Support
Feature
- Add KV events support for sliding window attention (#5580)
- Add beam search support to the PyTorch Workflow (#5333)
- Support more parameters in openai worker of scaffolding (#5115)
- Enable CUDA graphs for Nemotron-H (#5646)
- Add spec dec param to attention op for pytorch workflow (#5146)
- Fuse w4a8 moe pre-quant scale on Hopper (#5613)
- Support torch compile for attention dp (#5086)
- Add W4A16 GEMM support for pytorch workflow (#4232)
- Add request_perf_metrics to triton LLMAPI backend (#5554)
- Add AutoDeploy fp8 quantization support for bmm (#3849)
- Refactor moe permute and finalize op by removing duplicated code (#5557)
- Support duplicate_kv_weight for qwen3 blockwise scale (#5459)
- Add LoRA support for pytorch backend in trtllm-serve (#5376)
API
- Enhance yaml loading arbitrary options in LlmArgs (#5610)
- Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
- Add LLmArgs option to force using dynamic quantization (#5346)
- Remove ptuning knobs from TorchLlmArgs (#5595)
- BREAKING CHANGE:Enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
Bug Fixes
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
- Fix attention DP doesn't work with embedding TP (#5642)
- Fix broken cyclic reference detect (#5417)
- Fix permission for local user issues in NGC docker container. (#5373)
- Fix mtp vanilla draft inputs (#5568)
Benchmark
- Add wide-ep benchmarking scripts (#5760)
Performance
- Reduce DeepEPLowLatency memory and time (#5712)
- Use tokenizers API to optimize incremental detokenization perf (#5574)
- Conditionally enable SWAP AB for speculative decoding (#5404)
- Unify new_tokens format sample state to trtllm samper tokens format (#5513)
- Replace allgaher with AllToAllPrepare (#5570)
- Optimizations on weight-only batched gemv kernel (#5420)
- Optimize MoE sort kernels for large-scale EP (#5435)
- Avoid reswizzle_sf after allgather. (#5504)
Infrastructure
- Always use x86 image for the Jenkins agent and few clean-ups (#5753)
- Reduce unnecessary kernel generation (#5476)
- Update the auto-community label action to be triggered every hour (#5658)
- Improve dev container tagging (#5551)
- Update the community action to more appropriate api (#4883)
- Update nccl to 2.27.5 (#5539)
- Upgrade xgrammar to 0.1.18 (#5364)
Documentation
- Fix outdated config in DeepSeek best perf practice doc (#5638)
- Add pd dynamic scaling readme (#5540)
- Add feature support matrix for PyTorch backend (#5037)
- 1.0 LLM API doc updates (#5629)
- Update container instructions (#5490)
Known Issues

What's Changed

[TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve by @talorabr in #5376
[CI] reduce mamba2 ssm test parameterization by @tomeras91 in #5571
perf: Avoid reswizzle_sf after allgather. by @bobboli in #5504
[feat][test] reuse MPI pool executor across tests by @omera-nv in #5566
[TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP by @syuoni in #5435
[feat] Optimizations on weight-only batched gemv kernel by @Njuapp in #5420
[ci] remove MMLU if followed by GSM8K by @omera-nv in #5578
[TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) by @nv-guomingz in #5014
Deduplicate waive list by @yiqingy0 in #5546
[fix] speedup modeling unittests by @omera-nv in #5579
feat : support duplicate_kv_weight for qwen3 blockwise scale by @dongjiyingdjy in #5459
[TRTLLM-5331] large-scale EP: perf - Replace allgaher with AllToAllPrepare by @WeiHaocheng in #5570
doc: Minor update to DeepSeek R1 best practice by @kaiyux in #5600
[nvbug/5354946][fix] Fix mtp vanilla draft inputs by @lfr-0531 in #5568
refactor: decoder state setup by @Funatiq in #5093
[Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539) by @EmmaQiaoCh in #5587
[TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions (#5490) by @ixlmar in #5605
[ci] move eagle1 and medusa tests to post-merge by @omera-nv in #5604
chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs by @Superjomn in #5595
[fix][ci] missing class names in post-merge test reports by @omera-nv in #5603
refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code by @limin2021 in #5557
chore: remove cuda_graph_ prefix from cuda_graph_config filed members. by @nv-guomingz in #5585
feat: AutoDeploy fp8 quantization support for bmm by @meenchen in #3849
feature: unify new_tokens format sample state to trtllm samper tokens format by @netanel-haber in #5513
[fix]: Fix main test skip issue by @yizhang-nv in #5503
chores: [TRTLLM-6072] 1.0 LLMAPI doc updates by @hchings in #5629
add feature support matrix for PyTorch backend by @QiJune in #5037
test: [CI] remove closed bugs by @xinhe-nv in #5572
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5569
rcca: test default kv_cache_reuse option for pytorch multimodal by @StanleySun639 in #5544
[TRTLLM-6104] feat: add request_perf_metrics to triton LLMAPI backend by @xuanzic in #5554
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5582
feat: W4A16 GEMM by @danielafrimi in #4232
test: Reduce number of C++ test cases by @Funatiq in #5437
[https://nvbugs/5318059][test] Unwaive test by @pamelap-nvidia in #5624
[Infra] - Add some timeout and unwaive a test which dev fixed by @EmmaQiaoCh in #5631
[#5403][perf] Conditionally enable SWAP AB for speculative decoding by @zoheth in #5404
[TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) by @Superjomn in #5431
chore: Mass integration of release/0.21 by @dc3671 in #5507
refactor: Clean up DecodingInput and DecodingOutput by @Funatiq in #5617
perf: Use tokenizers API to optimize incremental detokenization perf by @kaiyux in #5574
[feat] Support torch compile for attention dp by @liji-nv in #5086
feat: add LLmArgs option to force using dynamic quantization by @achartier in #5346
[TRTLLM-5644][infra] Update the community action to more appropriate api by @poweiw in #4883
fix: add missing self. from PR #5346 by @achartier in #5653
[Bug] attention DP doesn't work with embedding TP by @PerkzZheng in #5642
fix: Add back allreduce_strategy parameter into TorchLlmArgs by @HuiGao-NV in #5637
perf: better heuristic for allreduce by @yilin-void in #5432
feat: fuse w4a8 moe pre-quant scale on Hopper by @xiaoweiw-nv in #5613
[chore] 2025-07-02 update github CI allowlist by @niukuo in #5661
doc: Add pd dynamic scaling readme by @Shunkangz in #5540
chore: enhance yaml loading arbitrary options in LlmArgs by @Superjomn in #5610
Feat/pytorch vswa kvcachemanager by @qixiang-99 in #5151
[TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest by @Funatiq in #5489
[https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op by @jhaotingc in #5146
[Infra] - Set default timeout to 1hr and remove some specific settings by @EmmaQiaoCh in #5667
[TRTLLM-6143] feat: Improve dev container tagging by @ixlmar in #5551
feat:[AutoDeploy] E2E build example for llama4 VLM by @Fridah-nv in #3922
fix: Fix missing arg to alltoall_prepare_maybe_dispatch by @syuoni in #5669
[Infra] - Waive failed tests for main 0702 by @EmmaQiaoCh in #5671
chore: bump version to 1.0.0rc2 by @yiqingy0 in #5645
[TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H by @tomeras91 in #5646
[Infra] - Fix test stage check for the package sanity check stage by @yiqingy0 in #5694
[Infra] - Waive a failed case on main by @EmmaQiaoCh in #5702
fix: Set init value for moe expert id by @WeiHaocheng in #5660
[ci] small multigpu speedups by @omera-nv in #5643
delete duplicate eagle3 and ngram tests by @netanel-haber in #5711
chore: Remove unused isFullContextRequest method by @Funatiq in #5666
chore: refine the default value by using pydantic default instead of … by @nv-guomingz in #5695
[ModelLoad] Concurrent load model by @arekay in #5291
[https://nvbugs/5365714] fix(scaffolding): use default LLM rather than trt backend LLM by @dc3671 in #5705
[None][infra] Update the auto-community label action to be triggered every hour by @poweiw in #5658
MTP and derivatives: Align sample state with trtllm sampler sample state by @netanel-haber in #5675
[AutoDeploy] merge feat/ad-2025-06-29 by @lucaslie in #5737
feat: support more parameters in openai worker of scaffolding by @ccs96307 in #5115
Waive tests : test_openai_lora, test_trtllm_serve_lora_example and test_openai_chat_structural_tag_example by @venkywonka in #5740
tests: waive failures on main by @xinhe-nv in #5704
chore: Mass integration of release/0.21 by @dc3671 in #5701
[fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in #5720
Fix none response in PD by @Shunkangz in #5422
feat: reduce unnecessary kernel generation by @tongyuantongyu in #5476
chore: update doc by replacing use_cuda_graph with cuda_graph_config by @nv-guomingz in #5680
Perf: reduce DeepEPLowLatency memory and time by @yuantailing in #5712
[Infra] - Waive L0 test by @yiqingy0 in #5748
fix: Improve chunking test and skip empty kernel calls by @Funatiq in #5710
Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in #5724
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5718
fix: check file exists in dev container script by @ixlmar in #5755
Raise shut down error for each request by @Shunkangz in #4936
[Infra] - Waive L0 flaky test by @yiqingy0 in #5759
Fix: pass allreduce strategy to pytorchConfig by @HuiGao-NV in #5746
Cache transceiver support VSWA by @chuangz0 in #5505
[TRTLLM-3442] feat: added beam search support to the PyTorch Workflow by @stnie in #5333
[fix] Update to properly set cuda graphs in trtllm-bench overrides. by @FrankD412 in #5634
feat: KV events for sliding window attention by @jthomson04 in #5580
Add dummy all_reduce for kernel breakdown by @qiaoxj07 in #5745
Add wide-ep benchmarking scripts by @qiaoxj07 in #5760
Improve documentation of Kv_block_array by @hypdeb in #5765
[Infra] - Always use x86 image for the Jenkins agent and few clean-ups by @chzblych in #5753
refactor: decoding inputs by @Funatiq in #5679
[Test] - Waive or fix few known test failures by @chzblych in #5769
[TRTLLM-5878] add stage for image registration to nspect by @niukuo in #5699

New Contributors

@talorabr made their first contribution in #5376
@Njuapp made their first contribution in #5420
@meenchen made their first contribution in #3849
@xuanzic made their first contribution in #5554
@zoheth made their first contribution in #5404
@ccs96307 made their first contribution in #5115
@jthomson04 made their first contribution in #5580

Full Changelog: v1.0.0rc1...v1.0.0rc2