NVIDIA/TensorRT-LLM v1.1.0rc0 on GitHub

Announcement Highlights:

Model Support
- Add model gpt-oss (#6645)
- Support Aggregate mode for phi4-mm (#6184)
- Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
- Support running heterogeneous model execution for Nemotron-H (#6866)
- Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
API
- BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
Benchmark
- Improve Llama4 performance for small max_seqlen cases (#6306)
- Multimodal benchmark_serving support (#6622)
- Add perf-sweep scripts (#6738)
Feature
- Support LoRA reload CPU cache evicted adapter (#6510)
- Add FP8 context MLA support for SM120 (#6059)
- Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
- Include attention dp rank info with KV cache events (#6563)
- Clean up ngram auto mode, add max_concurrency to configs (#6676)
- Add NCCL Symmetric Integration for All Reduce (#4500)
- Remove input_sf swizzle for module WideEPMoE (#6231)
- Enable guided decoding with disagg serving (#6704)
- Make fused_moe_cute_dsl work on blackwell (#6616)
- Move kv cache measure into transfer session (#6633)
- Optimize CUDA graph memory usage for spec decode cases (#6718)
- Core Metrics Implementation (#5785)
- Resolve KV cache divergence issue (#6628)
- AutoDeploy: Optimize prepare_inputs (#6634)
- Enable FP32 mamba ssm cache (#6574)
- Support SharedTensor on MultimodalParams (#6254)
- Improve dataloading for benchmark_dataset by using batch processing (#6548)
- Store the block of context request into kv cache (#6683)
- Add standardized GitHub issue templates and disable blank issues (#6494)
- Improve the performance of online EPLB on Hopper by better overlapping (#6624)
- Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
- CUTLASS MoE FC2+Finalize fusion (#3294)
- Add GPT OSS support for AutoDeploy (#6641)
- Add LayerNorm module (#6625)
- Support custom repo_dir for SLURM script (#6546)
- DeepEP LL combine FP4 (#6822)
- AutoTuner tuning config refactor and valid tactic generalization (#6545)
- Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
- Add support for Hopper MLA chunked prefill (#6655)
- Helix: extend mapping to support different CP types (#6816)
Documentation
- Remove the outdated features which marked as Experimental (#5995)
- Add LoRA feature usage doc (#6603)
- Add deployment guide section for VDR task (#6669)
- Add doc for multimodal feature support matrix (#6619)
- Move AutoDeploy README.md to torch docs (#6528)
- Add checkpoint refactor docs (#6592)
- Add K2 tool calling examples (#6667)
- Add the workaround doc for H200 OOM (#6853)
- Update moe support matrix for DS R1 (#6883)
- BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
[None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
[None][chore] update readme for perf release test by @ruodil in #6664
[None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
[None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
[None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
[https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
[TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
[None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
[None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
[TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
[None][test] correct test-db context for perf yaml file by @ruodil in #6686
[None] [feat] Add model gpt-oss by @hlu1 in #6645
[https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
[None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
[TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
[TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
[TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
[None][infra] Fix guardwords by @EmmaQiaoCh in #6711
[None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
[None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
[None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
[None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
[None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
[https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
[None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
[https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
[None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
[TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
[TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
[TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
[TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
[None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
[TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
[TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
[None][test] fix yml condition error under qa folder by @ruodil in #6734
[None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
[https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
[None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
[https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
[None][fix] Remove lock related typo in py_executor by @lancelly in #6653
[None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
[None][fix]revert kvcache transfer by @chuangz0 in #6709
[TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
[None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
[TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
[None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
[None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
[TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
[None][feat] Core Metrics Implementation by @hcyezhang in #5785
[https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
[TRTLLM-6637][feat] Resolve KV cache divergence issue by @ziyixiong-nv in #6628
[None][infra] Waive test main 0808 by @EmmaQiaoCh in #6751
[#5048][enhance] AutoDeploy: Optimize prepare_inputs by @galagam in #6634
[None][chore] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash by @eopXD in #6249
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache by @shaharmor98 in #6574
[https://nvbugs/5444937][fix] Fixing kv_cache_event unit test by @pcastonguay in #6753
[TRTLLM-6823][doc] Add checkpoint refactor docs by @shaharmor98 in #6592
[None][feat] Support SharedTensor on MultimodalParams by @yechank-nvidia in #6254
[None][feat] improve dataloading for benchmark_dataset by using batch… by @zerollzeng in #6548
[https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in #6736
[None][fix] fix same pp disagg by @chuangz0 in #6730
[None][feat] Add gpt-oss GSM8K test. by @Tracin in #6732
[None][test] Test trtllm-bench AD vs, PT BEs on H100 single gpu by @MrGeva in #6487
[TRTLLM-5633][infra] Force set changed file diff to empty string for post-merge CI by @yiqingy0 in #6777
[None][chore] remove closed bugs by @xinhe-nv in #6772
[None][infra] Waive failed tests on main 0811 by @EmmaQiaoCh in #6778
fix: Ensure that Python stub generation works against libnvidia-ml stubs by @MartinMarciniszyn in #6188
[TRTLLM-5532][feat] store the block of context request into kv cache by @byshiue in #6683
[None][doc] Add K2 tool calling examples by @lancelly in #6667
[None][infra] Unwaive an updated case to test by @EmmaQiaoCh in #6791
[None][chore] always try-catch when clear build folder in build_wheel.py by @zhenhuaw-me in #6748
[TRTLLM-6812][feat] Add standardized GitHub issue templates and disable blank issues by @venkywonka in #6494
[None][fix] Refactoring to avoid circular import when importing torch models by @rakib-hasan in #6720
[None][chore] Find LLM_ROOT and LLM_BACKEND_ROOT dynamically by @achartier in #6763
[https://nvbugs/5385987][fix] Fix Qwen2 quantization issue by pinning transformers version by @chang-l in #6673
[None][perf] Improve the performance of online EPLB on Hopper by better overlapping by @jinyangyuan-nvidia in #6624
[https://nvbugs/5441438][fix] Set correct draft length for the cuda graph dummy request by @ziyixiong-nv in #6701
[TRTLLM-6854][feat] Enable guided decoding with CUDA graph padding and draft model chunked prefill by @syuoni in #6774
[#4403][autodeploy] Refactor: Move more transformations to new inf optimizer, Add quantization_source to factory interface by @Fridah-nv in #6760
[None][feat] CUTLASS MoE FC2+Finalize fusion by @sklevtsov-nvidia in #3294
[TRTLLM-6906][chore] Using pybind to bind functions in thop/attentionOp by @lancelly in #6745
[None][fix] Fix attention dp log by @Shunkangz in #6570
[None][fix] fix ci by @QiJune in #6814
[TRTQA-2920][chore] improve hang tests by @xinhe-nv in #6781
[https://nvbugs/5438869][fix] Set nvfp4 expert w1 w3 weight scale to the same value if they're not by @jhaotingc in #6656
[None][feat] Add GPT OSS support for AutoDeploy by @nvchenghaoz in #6641
[#6187][feat] add LayerNorm module by @Funatiq in #6625
[None][refactor] Simplify decoder state initialization by @Funatiq in #6559
[TRTLLM-7008][fix] fix wideEP weights loading and args by @dongxuy04 in #6789
[None][fix] Refactoring input prep to allow out-of-tree models by @rakib-hasan in #6497
feat: Support custom repo_dir for SLURM script by @kaiyux in #6546
[None][fix] Pre-allocate workspaces for DeepGEMM MoE to avoid frequent cudaFree/cudaMalloc by @lfr-0531 in #6811
[TRTLLM-6772][feat] Multimodal benchmark_serving support by @yechank-nvidia in #6622
[https://nvbugs/5452167][fix] Fix ngram padding issue by @mikeiovine in #6837
[#6530][fix] Fix script when using calibration tensors from modelopt by @achartier in #6803
[https://nvbugs/5412456][fix] Fix an illegal instruction was encountered by @zhou-yuxin in #6776
[None][feat] DeepEP LL combine FP4 by @yilin-void in #6822
[TRTLLM-4501][feat] AutoTuner tuning config refactor and valid tactic generalization. by @hyukn in #6545
[TRTLLM-7030][fix] Refactor the example doc of dist-serving by @Shixiaowei02 in #6766
[TRTLLM-7093][fix] the perf regression to cvt_fp4 kernels by @PerkzZheng in #6851
[https://nvbugs/5412885][doc] Add the workaround doc for H200 OOM by @zhenhuaw-me in #6853
[https://nvbugs/5378031] [feat] Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend by @rosenrodt in #6200
[None][infra] Waive failed cases on main by @EmmaQiaoCh in #6863
[None][feat] Support running heterogeneous model execution for Nemotron-H by @danielafrimi in #6866
[https://nvbugs/5302040][feat] Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) by @wu6u3tw in #5527
[https://nvbugs/5394685][fix] the bug with spec-decoding + SWA && an accuracy issue related to 2CTA MLA by @PerkzZheng in #6834
[https://nvbugs/5410399][chore] Unwaive mtp llmapi test by @mikeiovine in #6833
[None][fix] max_num_sequences argument in nanobind by @Linda-Stadter in #6862
[None][feat] Add test for speculative rejection sampler (2-model) by @IzzyPutterman in #6542
[None][chore] fix markdown format for the deployment guide by @zhenhuaw-me in #6879
[None][feat] Add support for Hopper MLA chunked prefill by @jmydurant in #6655
[TRTLLM-6675][infra] Cherry-pick #6623 by @bo-nv in #6735
[https://nvbugs/5427043][fix] request length exceeds max_num_tokens by @Superjomn in #6821
[None][fix] Add FP4 all2all unitest and fix a bug for module WideEPMoE by @StudyingShao in #6784
[None][doc] update moe support matrix for DS R1 by @litaotju in #6883
[None][test] Add perf-sweep scripts by @chenfeiz0326 in #6738
[TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands by @Shixiaowei02 in #6323
[https://nvbugs/5445466][fix] fix deepseek r1 hang by not enabling mnnvl by default by @pengbowang-nv in #6860
[TRTLLM-6853][feat] refactor deepseekv3 model by @kris1025 in #6698
[None][fix] Fix python-only build that uses TRTLLM_USE_PRECOMPILED by @jiaganc in #6825
[None][infra] Waive failed cases on main 08/14 by @EmmaQiaoCh in #6902
[TRTLLM-5966][feat] Helix: extend mapping to support different CP types by @MatthiasKohl in #6816
[https://nvbugs/5450262][fix] Fix unsupported alltoall use case by @bobboli in #6882

New Contributors

@chenopis made their first contribution in #6531
@zhanghaotong made their first contribution in #6626
@lancelly made their first contribution in #6653
@hcyezhang made their first contribution in #5785
@galagam made their first contribution in #6634
@MrGeva made their first contribution in #6487
@sklevtsov-nvidia made their first contribution in #3294
@nvchenghaoz made their first contribution in #6641
@jiaganc made their first contribution in #6825

Full Changelog: v1.0.0rc6...v1.1.0rc0