Announcement Highlights:
- Model Support
- Feature
- Add KV events support for sliding window attention (#5580)
- Add beam search support to the PyTorch Workflow (#5333)
- Support more parameters in openai worker of scaffolding (#5115)
- Enable CUDA graphs for Nemotron-H (#5646)
- Add spec dec param to attention op for pytorch workflow (#5146)
- Fuse w4a8 moe pre-quant scale on Hopper (#5613)
- Support torch compile for attention dp (#5086)
- Add W4A16 GEMM support for pytorch workflow (#4232)
- Add request_perf_metrics to triton LLMAPI backend (#5554)
- Add AutoDeploy fp8 quantization support for bmm (#3849)
- Refactor moe permute and finalize op by removing duplicated code (#5557)
- Support duplicate_kv_weight for qwen3 blockwise scale (#5459)
- Add LoRA support for pytorch backend in trtllm-serve (#5376)
- API
- Enhance yaml loading arbitrary options in LlmArgs (#5610)
- Add back allreduce_strategy parameter into TorchLlmArgs (#5637)
- Add LLmArgs option to force using dynamic quantization (#5346)
- Remove ptuning knobs from TorchLlmArgs (#5595)
- BREAKING CHANGE:Enhance the llm args pytorch config part 1(cuda_graph_config) (#5014)
- Bug Fixes
- Benchmark
- Add wide-ep benchmarking scripts (#5760)
- Performance
- Reduce DeepEPLowLatency memory and time (#5712)
- Use tokenizers API to optimize incremental detokenization perf (#5574)
- Conditionally enable SWAP AB for speculative decoding (#5404)
- Unify new_tokens format sample state to trtllm samper tokens format (#5513)
- Replace allgaher with AllToAllPrepare (#5570)
- Optimizations on weight-only batched gemv kernel (#5420)
- Optimize MoE sort kernels for large-scale EP (#5435)
- Avoid reswizzle_sf after allgather. (#5504)
- Infrastructure
- Always use x86 image for the Jenkins agent and few clean-ups (#5753)
- Reduce unnecessary kernel generation (#5476)
- Update the auto-community label action to be triggered every hour (#5658)
- Improve dev container tagging (#5551)
- Update the community action to more appropriate api (#4883)
- Update nccl to 2.27.5 (#5539)
- Upgrade xgrammar to 0.1.18 (#5364)
- Documentation
- Known Issues
What's Changed
- [TRTLLM-5831][feat] Add LoRA support for pytorch backend in trtllm-serve by @talorabr in #5376
- [CI] reduce mamba2 ssm test parameterization by @tomeras91 in #5571
- perf: Avoid reswizzle_sf after allgather. by @bobboli in #5504
- [feat][test] reuse MPI pool executor across tests by @omera-nv in #5566
- [TRTLLM-5965] perf: Optimize MoE sort kernels for large-scale EP by @syuoni in #5435
- [feat] Optimizations on weight-only batched gemv kernel by @Njuapp in #5420
- [ci] remove MMLU if followed by GSM8K by @omera-nv in #5578
- [TRTLLM-5530][BREAKING CHANGE]: enhance the llm args pytorch config part 1(cuda_graph_config) by @nv-guomingz in #5014
- Deduplicate waive list by @yiqingy0 in #5546
- [fix] speedup modeling unittests by @omera-nv in #5579
- feat : support duplicate_kv_weight for qwen3 blockwise scale by @dongjiyingdjy in #5459
- [TRTLLM-5331] large-scale EP: perf - Replace allgaher with AllToAllPrepare by @WeiHaocheng in #5570
- doc: Minor update to DeepSeek R1 best practice by @kaiyux in #5600
- [nvbug/5354946][fix] Fix mtp vanilla draft inputs by @lfr-0531 in #5568
- refactor: decoder state setup by @Funatiq in #5093
- [Infra][main] Cherry-pick from release/0.21: Update nccl to 2.27.5 (#5539) by @EmmaQiaoCh in #5587
- [TRTLLM-5989, TRTLLM-5991, TRTLLM-5993] doc: Update container instructions (#5490) by @ixlmar in #5605
- [ci] move eagle1 and medusa tests to post-merge by @omera-nv in #5604
- chore [TRTLLM-6009]: remove ptuning knobs from TorchLlmArgs by @Superjomn in #5595
- [fix][ci] missing class names in post-merge test reports by @omera-nv in #5603
- refactor: [TRTLLM-6150] Refactor moe permute and finalize op by removing duplicated code by @limin2021 in #5557
- chore: remove cuda_graph_ prefix from cuda_graph_config filed members. by @nv-guomingz in #5585
- feat: AutoDeploy fp8 quantization support for bmm by @meenchen in #3849
- feature: unify new_tokens format sample state to trtllm samper tokens format by @netanel-haber in #5513
- [fix]: Fix main test skip issue by @yizhang-nv in #5503
- chores: [TRTLLM-6072] 1.0 LLMAPI doc updates by @hchings in #5629
- add feature support matrix for PyTorch backend by @QiJune in #5037
- test: [CI] remove closed bugs by @xinhe-nv in #5572
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5569
- rcca: test default kv_cache_reuse option for pytorch multimodal by @StanleySun639 in #5544
- [TRTLLM-6104] feat: add request_perf_metrics to triton LLMAPI backend by @xuanzic in #5554
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5582
- feat: W4A16 GEMM by @danielafrimi in #4232
- test: Reduce number of C++ test cases by @Funatiq in #5437
- [https://nvbugs/5318059][test] Unwaive test by @pamelap-nvidia in #5624
- [Infra] - Add some timeout and unwaive a test which dev fixed by @EmmaQiaoCh in #5631
- [#5403][perf] Conditionally enable SWAP AB for speculative decoding by @zoheth in #5404
- [TRTLLM-5277] chore: refine llmapi examples for 1.0 (part1) by @Superjomn in #5431
- chore: Mass integration of release/0.21 by @dc3671 in #5507
- refactor: Clean up DecodingInput and DecodingOutput by @Funatiq in #5617
- perf: Use tokenizers API to optimize incremental detokenization perf by @kaiyux in #5574
- [feat] Support torch compile for attention dp by @liji-nv in #5086
- feat: add LLmArgs option to force using dynamic quantization by @achartier in #5346
- [TRTLLM-5644][infra] Update the community action to more appropriate api by @poweiw in #4883
- fix: add missing self. from PR #5346 by @achartier in #5653
- [Bug] attention DP doesn't work with embedding TP by @PerkzZheng in #5642
- fix: Add back allreduce_strategy parameter into TorchLlmArgs by @HuiGao-NV in #5637
- perf: better heuristic for allreduce by @yilin-void in #5432
- feat: fuse w4a8 moe pre-quant scale on Hopper by @xiaoweiw-nv in #5613
- [chore] 2025-07-02 update github CI allowlist by @niukuo in #5661
- doc: Add pd dynamic scaling readme by @Shunkangz in #5540
- chore: enhance yaml loading arbitrary options in LlmArgs by @Superjomn in #5610
- Feat/pytorch vswa kvcachemanager by @qixiang-99 in #5151
- [TRTLLM-1316] refactor: Remove unnecessary pipeline parallelism logic from postProcessRequest by @Funatiq in #5489
- [https://nvbugspro.nvidia.com/bug/5329655] [feat] Pytorch path add spec dec param to attention op by @jhaotingc in #5146
- [Infra] - Set default timeout to 1hr and remove some specific settings by @EmmaQiaoCh in #5667
- [TRTLLM-6143] feat: Improve dev container tagging by @ixlmar in #5551
- feat:[AutoDeploy] E2E build example for llama4 VLM by @Fridah-nv in #3922
- fix: Fix missing arg to alltoall_prepare_maybe_dispatch by @syuoni in #5669
- [Infra] - Waive failed tests for main 0702 by @EmmaQiaoCh in #5671
- chore: bump version to 1.0.0rc2 by @yiqingy0 in #5645
- [TRTLLM-4923][feat] Enable CUDA graphs for Nemotron-H by @tomeras91 in #5646
- [Infra] - Fix test stage check for the package sanity check stage by @yiqingy0 in #5694
- [Infra] - Waive a failed case on main by @EmmaQiaoCh in #5702
- fix: Set init value for moe expert id by @WeiHaocheng in #5660
- [ci] small multigpu speedups by @omera-nv in #5643
- delete duplicate eagle3 and ngram tests by @netanel-haber in #5711
- chore: Remove unused isFullContextRequest method by @Funatiq in #5666
- chore: refine the default value by using pydantic default instead of … by @nv-guomingz in #5695
- [ModelLoad] Concurrent load model by @arekay in #5291
- [https://nvbugs/5365714] fix(scaffolding): use default LLM rather than trt backend LLM by @dc3671 in #5705
- [None][infra] Update the auto-community label action to be triggered every hour by @poweiw in #5658
- MTP and derivatives: Align sample state with trtllm sampler sample state by @netanel-haber in #5675
- [AutoDeploy] merge feat/ad-2025-06-29 by @lucaslie in #5737
- feat: support more parameters in openai worker of scaffolding by @ccs96307 in #5115
- Waive tests : test_openai_lora, test_trtllm_serve_lora_example and test_openai_chat_structural_tag_example by @venkywonka in #5740
- tests: waive failures on main by @xinhe-nv in #5704
- chore: Mass integration of release/0.21 by @dc3671 in #5701
- [fix: nvbugs/5355493] Correctly clamp max sequence len to max attention window by @netanel-haber in #5720
- Fix none response in PD by @Shunkangz in #5422
- feat: reduce unnecessary kernel generation by @tongyuantongyu in #5476
- chore: update doc by replacing use_cuda_graph with cuda_graph_config by @nv-guomingz in #5680
- Perf: reduce DeepEPLowLatency memory and time by @yuantailing in #5712
- [Infra] - Waive L0 test by @yiqingy0 in #5748
- fix: Improve chunking test and skip empty kernel calls by @Funatiq in #5710
- Cherry pick "[NVBUG:5355009] Modify check for fuse_fp4_quant on SM120 by @farazkh80 in #5724
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #5718
- fix: check file exists in dev container script by @ixlmar in #5755
- Raise shut down error for each request by @Shunkangz in #4936
- [Infra] - Waive L0 flaky test by @yiqingy0 in #5759
- Fix: pass allreduce strategy to pytorchConfig by @HuiGao-NV in #5746
- Cache transceiver support VSWA by @chuangz0 in #5505
- [TRTLLM-3442] feat: added beam search support to the PyTorch Workflow by @stnie in #5333
- [fix] Update to properly set cuda graphs in trtllm-bench overrides. by @FrankD412 in #5634
- feat: KV events for sliding window attention by @jthomson04 in #5580
- Add dummy all_reduce for kernel breakdown by @qiaoxj07 in #5745
- Add wide-ep benchmarking scripts by @qiaoxj07 in #5760
- Improve documentation of Kv_block_array by @hypdeb in #5765
- [Infra] - Always use x86 image for the Jenkins agent and few clean-ups by @chzblych in #5753
- refactor: decoding inputs by @Funatiq in #5679
- [Test] - Waive or fix few known test failures by @chzblych in #5769
- [TRTLLM-5878] add stage for image registration to nspect by @niukuo in #5699
New Contributors
- @talorabr made their first contribution in #5376
- @Njuapp made their first contribution in #5420
- @meenchen made their first contribution in #3849
- @xuanzic made their first contribution in #5554
- @zoheth made their first contribution in #5404
- @ccs96307 made their first contribution in #5115
- @jthomson04 made their first contribution in #5580
Full Changelog: v1.0.0rc1...v1.0.0rc2