NVIDIA/TensorRT-LLM v1.1.0rc2 on GitHub

Announcement Highlights:

Model Support
- Refactor llama4 for multimodal encoder IFB (#6844)
API
- Add standalone multimodal encoder (#6743)
- Enable Cross-Attention to use XQA kernels for Whisper (#7035)
- Enable nanobind as the default binding library (#6608)
- trtllm-serve + autodeploy integration (#7141)
- Chat completions API for gpt-oss (#7261)
- KV Cache Connector API (#7228)
- Create PyExecutor from TorchLlmArgs Part 1 (#7105)
- TP Sharding read from the model config (#6972)
Benchmark
- add llama4 tp4 tests (#6989)
- add test_multi_nodes_eval tests (#7108)
- nsys profile output kernel classifier (#7020)
- add kv cache size in bench metric and fix failed cases (#7160)
- add perf metrics endpoint to openai server and openai disagg server (#6985)
- add gpt-osss tests to sanity list (#7158)
- add l20 specific qa test list (#7067)
- Add beam search CudaGraph + Overlap Scheduler tests (#7326)
- Update qwen3 timeout to 60 minutes (#7200)
- Update maxnt of llama_v3.2_1b bench (#7279)
- Improve performance of PyTorchModelEngine._get_lora_params_from_requests (#7033)
- Accelerate global scale calculations for deepEP fp4 combine (#7126)
- Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238)
- Balance the request based on number of tokens in AttentionDP (#7183)
- Wrap the swiglu into custom op to avoid redundant device copy (#7021)
Feature
- Add QWQ-32b torch test (#7284)
- Fix llama4 multimodal by skipping request validation (#6957)
- Add group attention pattern for solar-pro-preview (#7054)
- Add Mistral Small 3.1 multimodal in Triton Backend (#6714)
- Update lora for phi4-mm (#6817)
- refactor the CUDA graph runner to manage all CUDA graphs (#6846)
- Enable chunked prefill for Nemotron-H (#6334)
- Add customized default routing method (#6818)
- Testing cache transmission functionality in Python (#7025)
- Simplify decoder state initialization for speculative decoding (#6869)
- Support MMMU for multimodal models (#6828)
- Deepseek: Start Eagle work (#6210)
- Optimize and refactor alltoall in WideEP (#6973)
- Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time (#7113)
- Hopper Fp8 context mla (#7116)
- Padding for piecewise cudagraph (#6750)
- Add low precision all2all for mnnvl (#7155)
- Use numa to bind CPU (#7304)
- Skip prefetching consolidated safetensors when appropriate (#7013)
- Unify sampler handle logits implementation (#6867)
- Move fusion, kvcache, and compile to modular inference optimizer (#7057)
- Make finalize fusion part of the tactic selection logic (#6915)
- Fuse slicing into MoE (#6728)
- Add logging for OAI disagg server (#7232)
Documentation
- Update gpt-oss deployment guide to latest release image (#7101)
- update stale link for AutoDeploy (#7135)
- Add GPT-OSS Deployment Guide into official doc site (#7143)
- Refine GPT-OSS doc (#7180)
- update feature_combination_matrix doc (#6691)
- update disagg doc about UCX_MAX_RNDV_RAILS (#7205)
- Display tech blog for nvidia.github.io domain (#7241)
- Updated blog9_Deploying_GPT_OSS_on_TRTLLM (#7260)
- Update autodeploy README.md, deprecate lm_eval in examples folder (#7233)
- add adp balance blog (#7213)
- fix doc formula (#7367)
- update disagg readme and scripts for pipeline parallelism (#6875)

What's Changed

[None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
[None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
[None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
[None][fix] fix llmapi import error by @crazydemo in #7030
[TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
[None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
[TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
[None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
[TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
[None][fix] fix scaffolding dynasor test by @dc3671 in #7070
[None][chore] Update namelist in blossom-ci by @karljang in #7015
[None][ci] move unittests to sub-directories by @Funatiq in #6635
[None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
[None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
[TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
[None][chore] Only check the bindings lib for current build by @liji-nv in #7026
[None][ci] move some tests of b200 to post merge by @QiJune in #7093
[https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
[TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
[None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
[None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
[None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
[None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
[None][chore] waive failed cases on H100 by @xinhe-nv in #7084
[None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
[https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
[https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
[None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
[https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
[https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
[None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
[https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
[None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
[#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
[None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
[None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
[None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
[TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
[None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
[TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
[TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
[None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
[TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
[TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
[None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
[None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
[#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
[TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
[None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
[None][feat] Deepseek: Start Eagle work by @IzzyPutterman in #6210
[None][fix] Correct KV cache percentage report out. by @FrankD412 in #7102
[None] [feat] nsys profile output kernel classifier by @gracehonv in #7020
[None][fix] Waive test by @Tabrizian in #7185
[https://nvbugs/5467232][fix] Fix load_torch_hf_lora to override lora_config.trtllm_modules_to_hf_modules with default only when it has no value by @amitz-nv in #7132
[TRTLLM-6743][feat] Optimize and refactor alltoall in WideEP by @dongxuy04 in #6973
[TRTLLM-7321][doc] Refine GPT-OSS doc by @dongfengy in #7180
[None][infra] Prepare for single GPU GB200 test pipeline by @chzblych in #7073
[None][chore] Enable auto deploy accuracy test in CI by @ajrasane in #7179
[None] [ci] Reorganize CMake and Python integration test infrastructure for C++ tests by @Funatiq in #6754
[None][infra] Split DGX_B200 stage into multiple parts and pre-/post-merge by @yiqingy0 in #7074
[TRTLLM-7096][infra] Testing cache transmission functionality in Python by @bo-nv in #7025
[None][feat] add gpt-osss tests to sanity list by @xinhe-nv in #7158
[None][chore] cherry-pick 6940 by @bo-nv in #7097
[None][feat] Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time. by @hyukn in #7113
[None][ci] waive test_mamba2_chunk_scan_combined_prefill_chunking[seqlens1-8] by @QiJune in #7194
[None][test] add l20 specific qa test list by @crazydemo in #7067
[None][fix] Fix MoE load balancer config loading by @syuoni in #7150
[TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests by @amitz-nv in #7033
[None][chore] remove CLI support for mamba cache dtype setting by @shaharmor98 in #7119
[None][refactor] refactor the CUDA graph runner to manage all CUDA graphs by @QiJune in #6846
[None][infra] Waive failed tests on main branch by @EmmaQiaoCh in #7201
[https://nvbugs/5440241][fix] Fix 70B GSM8K Accuracy drop by @chenfeiz0326 in #6967
[None][fix] Update to pull LLM from a central location. by @FrankD412 in #6458
[None][chore] Refactored the handle logits pp communication by @dcampora in #7154
[TRTLLM-7319][perf] Fuse slicing into MoE. by @bobboli in #6728
[None][fix][AutoDeploy] canonicalize_graph before shape prop for consistent state_dict by @lucaslie in #7223
[TRTLLM-6342][feat] TP Sharding read from the model config by @greg-kwasniewski1 in #6972
[None][doc] update feature_combination_matrix doc by @leslie-fang25 in #6691
[None][test] add kv cache size in bench metric and fix failed cases by @ruodil in #7160
[None][chore] Create PyExecutor from TorchLlmArgs Part 1 by @leslie-fang25 in #7105
[https://nvbugs/5452463][doc] update disagg doc about UCX_MAX_RNDV_RAILS by @zhengd-nv in #7205
[None][feat] Skip prefetching consolidated safetensors when appropriate by @2ez4bz in #7013
[None] [fix] improve kvcache allocation in PyTorch runtime by @qixiang-99 in #5933
[None][chore] Update CI allowlist 2025-08-25 by @yuanjingx87 in #7229
[None][test] Update qwen3 timeout to 60 minutes by @nvamyt in #7200
[https://nvbugs/5457504][fix] fix kv cache event test in disaggregated worker tests by @zhengd-nv in #7028
[TRTLLM-6549][feat] add perf metrics endpoint to openai server and openai disagg server by @zhengd-nv in #6985
[None][doc] Display tech blog for nvidia.github.io domain. by @nv-guomingz in #7241
[https://nvbugs/5477332][fix] Relax atol in test_mamba2_chunk_scan_combined_prefill_chunking by @amitz-nv in #7215
[None][feat] Hopper Fp8 context mla by @zhou-yuxin in #7116
[None][infra] Add retry 3 times if ssh cluster failed by @EmmaQiaoCh in #6859
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #7251
[None][fix] Updated blog9_Deploying_GPT_OSS_on_TRTLLM by @Maurits-de-Groot in #7260
[None][ci] move qwen3 tests from b200 to gb200 by @QiJune in #7257
[None][perf] Accelerate global scale calculations for deepEP fp4 combine by @yilin-void in #7126
[None][fix] Fix data type of KV Cache percentage in bench. by @FrankD412 in #7230
[None][doc] Update autodeploy README.md, deprecate lm_eval in examples folder by @Fridah-nv in #7233
[None][update] Update disagg code owners by @Tabrizian in #7266
[TRTLLM-6633][feat] Padding for piecewise cudagraph by @liji-nv in #6750
[https://nvbugs/5412456][fix] Remove from waives.txt by @zhou-yuxin in #7248
[None][fix] Remove and fuse some element-wise ops in the ds-r1-fp8 model by @lfr-0531 in #7238
[None][opt] Balance the request based on number of tokens in AttentionDP by @Shunkangz in #7183
[TRTLLM-6960][fix] replace flasky scaled_mm test with more stable config by @dc3671 in #7089
[None][feat] Add logging for OAI disagg server by @Tabrizian in #7232
[TRTLLM-7457][ci] Update & cleanup unittest parallel config by @tongyuantongyu in #7254
[None][chore] update disagg readme and scripts for pipeline parallelism by @raayandhar in #6875
[None][chore] Wrap the swiglu into custom op to avoid redundant device copy. by @hyukn in #7021
[None][fix] Fix possible hang issue in WideEP and move some tests to pre-merge by @dongxuy04 in #7262
[None][ci] remove test_llm_api_autodeploy from B200 test db by @QiJune in #7282
[https://nvbugs/5453727][fix] Fix bug of how GPT-OSS setup the parameters in CI by @byshiue in #7151
[None][fix] Update maxnt of llama_v3.2_1b bench by @nvamyt in #7279
[None][refactor] Move draft token padding out of Drafter by @mikeiovine in #7134
[TRTLLM-7250][fix] waive failed cases by @xinhe-nv in #7292
[None][infra] Waive failed tests on main 08/27 by @EmmaQiaoCh in #7300
[None][ci] parallelize unit tests of auto deploy in B200 by @QiJune in #7291
[https://nvbugs/5458798][fix] AD perf test outliers handling, tightened threshold, re-enabled in CI, fixed mem threshold by @MrGeva in #7189
[https://nvbugs/5453727][fix] unwaive qwen3 CI tests by @byshiue in #7293
[None][fix] Remove the wheel from intermediate docker storage by @MartinMarciniszyn in #7175
[None] [chore] Make disagg example compatible with recommended usage by @kaiyux in #7121
[TRTLLM-6822][infra] Add PR-Checklist github action and modify PR template by @venkywonka in #6029
[TRTLLM-7207][feat] Chat completions API for gpt-oss by @LinPoly in #7261
[None][ci] fix test list name by @QiJune in #7321
[None][fix] Disable mandatory PR checklist enforcement by @venkywonka in #7325
[https://nvbugs/5430124][ci] Unwaive Mistral 3.1 Small tests by @2ez4bz in #7274
[None][ci] skip TestGPTOSS by @QiJune in #7333
[TRTLLM-6876][feat] Add low precision all2all for mnnvl by @zongfeijing in #7155
[None] [feat] Use numa to bind CPU by @kaiyux in #7304
[https://nvbugs/5474453][fix] fix path to tested model by @nzmora-nvidia in #7272
[None][doc] add adp balance blog by @yunruis in #7213
[None][infra] Waive failed tests on main branch 08/26 by @EmmaQiaoCh in #7346
[None][fix] mxfp4 padding bug for TRT-LLM and CUTLASS MoE backends by @nekorobov in #7214
[None][chore] Some improvements for CI stability by @chzblych in #7199
[None][feat] Refactor llama4 for multimodal encoder IFB by @dongfengy in #6844
[https://nvbugs/5445466][fix] Bypass MLP TP split for MNNVL in DeepSeek V3 to avoid hanging. by @timlee0212 in #6886
[TRTLLM-7457][ci] Update unittest parallel config by @tongyuantongyu in #7297
[None][perf] Disable Swap AB when num tokens exceeds N dimension by @djns99 in #7104
[TRTLLM-6646][test] NIM migration to TRT-LLM LLMAPI : Add QWQ-32b torch test by @aalanwyr in #7284
[None][feat] KV Cache Connector API by @richardhuo-nv in #7228
[None] [chore] Update .coderabbit.yaml review configuration by @venkywonka in #7351
[https://nvbugs/5445466][fix] Eliminate race when loading HF dynamic modules by @chang-l in #7268
[TRTLLM-7280][test] Add beam search CudaGraph + Overlap Scheduler tests by @fredricz-20070104 in #7326
[None][fix] fix doc formula by @yunruis in #7367
[https://nvbugs/5481385][fix] Fix max_seq_len in cuda graph warmup and intermediate_size in fused_moe_deepgemm by @lfr-0531 in #7345
[None][chore] Update pre-merge test to add DeepSeek/LLaMA and gpt-oss by @pengbowang-nv in #7192
[None][infra] Waive failed tests on main branch 08/29 by @EmmaQiaoCh in #7370

New Contributors

@karljang made their first contribution in #7015
@yuhyao made their first contribution in #7072
@gracehonv made their first contribution in #7020
@greg-kwasniewski1 made their first contribution in #6972
@nvamyt made their first contribution in #7200
@Maurits-de-Groot made their first contribution in #7260
@nzmora-nvidia made their first contribution in #7272
@aalanwyr made their first contribution in #7284

Full Changelog: v1.1.0rc1...v1.1.0rc2