NVIDIA/TensorRT-LLM v1.0.0rc4 on GitHub

Announcement Highlights:

Model Support
- Add phi-4-multimodal model support (#5644)
- Add EXAONE 4.0 model support (#5696)
Feature
- Add support for two-model engine KV cache reuse (#6133)
- Unify name of NGram speculative decoding (#5937)
- Add retry knobs and handling in disaggregated serving (#5808)
- Add Eagle-3 support for qwen3 dense model (#5879)
- Remove padding of FusedMoE in attention DP (#6064)
- Enhanced handling of decoder requests and logits within the batch manager (#6055)
- Add support for Modelopt fp8_pb_wo quantization scheme (#6106)
- Update deepep dispatch API (#6037)
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
- Simplify token availability calculation for VSWA (#6134)
- Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
- Enable guided decoding with overlap scheduler (#6000)
- Use cacheTransceiverConfig as knobs for disagg service (#5234)
- Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
- Enhance ModelConfig for kv cache size calculations (#5868)
- Clean up drafter/resource manager creation logic (#5805)
- Add core infrastructure to enable loading of custom checkpoint formats (#5372)
- Cleanup disable_fp4_allgather (#6006)
- Use session abstraction in data transceiver and cache formatter (#5611)
- Add support for Triton request cancellation (#5898)
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs (#5684)
- Remove enforced sorted order of batch slots (#3502)
- Use huge page mapping for host accessible memory on GB200 (#5963)
API
- [BREAKING CHANGE] Unify KvCacheConfig in LLM class for pytorch backend (#5752)
- [BREAKING CHANGE] Rename cuda_graph_config padding_enabled field (#6003)
Bug Fixes
- Skip prompt length checking for generation only requests (#6146)
- Avoid memory calls during broadcast for single GPU (#6010)
- Record kv-cache size in MLACacheFormatter (#6181)
- Always use py_seq_slot in runtime (#6147)
- Update beam search workspace estimation to new upper bound (#5926)
- Update disaggregation handling in sampler (#5762)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Remove duplicated KVCache transmission check (#6022)
- Release slots with spec decode + disagg (#5975) (#6032)
- Add propogation for trust_remote_code to AutoConfig (#6001)
- Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
- Pad DeepEP fp4 recv tensors if empty (#6048)
- Adjust window sizes of VSWA at torch backend (#5880)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Update torch.compile option to fix triton store_cubin error (#5865)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
- Rewrite completion API to avoid repetitive tokens (#5201)
- Fix disagg + speculative decoding (#5558)
Benchmark
- Add latency support for trtllm bench (#3730)
Performance
- Optimize TRTLLM Sampler perf single beam single step (#5550)
- Performance Optimization for MNNVL TwoShot Kernel (#5925)
- Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986)
- Enable cuda graph by default (#5480)
Infrastructure
- Add script to map tests <-> jenkins stages & vice-versa (#5431)
- Speedup beam search unit tests with fixtures for LLM (#5843)
- Fix single-GPU stage failed will not raise error (#6165)
- Update bot help messages (#5277)
- Update jenkins container images (#6094)
- Set up the initial config for CodeRabbit (#6128)
- Upgrade NIXL to 0.3.1 (#5991)
- Upgrade modelopt to 0.33 (#6058)
- Support show all stage name list when stage name check failed (#5946)
- Run docs build only if PR contains only doc changes (#5184)
Documentation
- Update broken link of PyTorchModelEngine in arch_overview (#6171)
- Add initial documentation for trtllm-bench CLI. (#5734)
- Add documentation for eagle3+disagg+dynamo (#6072)
- Update llama-3.3-70B guide (#6028)
Known Issues

What's Changed

[TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch by @Superjomn in #5956
fix: Fix MoE benchmark by @syuoni in #5966
[TRTLLM-6160] chore: add sampling examples for pytorch by @Superjomn in #5951
Use huge page mapping for host accessible memory on GB200 by @dongxuy04 in #5963
Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default by @dominicshanshan in #5480
fix: set allreduce strategy to model config by @WeiHaocheng in #5955
chore: Mass integration of release/0.21 (part 3) by @dc3671 in #5909
infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check by @ZhanruiSunCh in #5709
Waive L0 test by @yiqingy0 in #6002
[Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage by @kaiyux in #5882
fix #4974: A thread leak issue in scaffolding unittest by @ccs96307 in #5020
feat: EXAONE4.0 support by @yechank-nvidia in #5696
[TRTLLM-5653][infra] Run docs build only if PR contains only doc changes by @zhanga5 in #5184
feat: Update Gemma3 Vision Encoder by @brb-nv in #5973
enh: Bidirectional mask with multiple images for Gemma3 by @brb-nv in #5976
refactor: Remove enforced sorted order of batch slots by @Funatiq in #3502
[fix] fix eagle3 two model disaggregated serving test by @Tabrizian in #6014
perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend by @djns99 in #5986
[nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs by @ixlmar in #5964
doc: update EXAONE 4.0 news by @yechank-nvidia in #6034
[Model load] Fix llama min-latency model load by @arekay in #5883
fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse by @djns99 in #4135
Doc: Update llama-3.3-70B guide by @jiahanc in #6028
infra: [TRTLLM-6331] Support show all stage name list when stage name check failed by @ZhanruiSunCh in #5946
[Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report by @yiqingy0 in #5672
[Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils by @lucaslie in #6026
infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… by @ZhanruiSunCh in #5945
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #6003
test: add recursive updating pytorch config and change MOE backend format in perf test by @ruodil in #6046
test: add llama_v3.3_70b_cases in perf test by @ruodil in #6035
[infra] add more log on reuse-uploading by @niukuo in #6036
fix: adjust window sizes of VSWA at torch backend by @jaedeok-nvidia in #5880
[nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision by @MinaHuai in #5998
Fix: pad DeepEP fp4 recv tensors if empty by @yuantailing in #6048
[fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop by @jinyangyuan-nvidia in #6053
support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… by @ttyio in #5684
Cherry-pick #5947 by @lfr-0531 in #5989
test: Add regression tests for Gemma3 VLM by @brb-nv in #6033
feat/add latency support for trtllm bench by @danielafrimi in #3730
feat: Add support for Triton request cancellation by @achartier in #5898
[fix] Fix Triton build by @Tabrizian in #6076
fix: Unable to load phi4-model with tp_size>1 by @Wanli-Jiang in #5962
chore: Bump version to 1.0.0rc4 by @yiqingy0 in #6086
chroe: upgrade modelopt to 0.33 by @nv-guomingz in #6058
[nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test by @zhengd-nv in #6041
feat: use session abstraction in data transceiver and cache formatter by @zhengd-nv in #5611
[TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 by @bo-nv in #5991
feat: Add deepseek-lite tests for RTX pro 6000 by @peaceh-nv in #5903
[nvbug/5359218][tests] add test llm api test case on lookahead with chunked prefill by @crazydemo in #6051
[nvbug/5387226] chore: add propogation for trust_remote_code to AutoConfig by @Superjomn in #6001
tests: add QA test cases by @crazydemo in #5959
fix: Add $HOME/.local/bin to PATH when running docker in local user mode by @MartinMarciniszyn in #6062
[TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend by @Superjomn in #5752
BlockManager copy constructor fix by @tshmilnvidia in #5982
update spec_dec by @qsang-nv in #6079
chore: Cleanup disable_fp4_allgather. by @bobboli in #6006
infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines by @ZhanruiSunCh in #5199
[Infra] - Waive failed cases in post-merge on main by @EmmaQiaoCh in #6096
Add documentation for eagle3+disagg+dynamo by @Tabrizian in #6072
fix: Update trtllm args issues with extra nested config by @Wanli-Jiang in #5996
[TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats by @shaharmor98 in #5372
[refactor] Clean up drafter/resource manager creation logic by @mikeiovine in #5805
Fix: Enhance ModelConfig for kv cache size calculations by @qixiang-99 in #5868
feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support by @Wanli-Jiang in #5644
[TRTLLM-6070] docs: Add initial documentation for trtllm-bench CLI. by @FrankD412 in #5734
test: Update Llama4 Scout FP4 & FP8 accuracy tests by @chenfeiz0326 in #5901
[fix] Performance Optimization for MNNVL TwoShot Kernel by @timlee0212 in #5934
infra: fix SBSA test stage by @ZhanruiSunCh in #6113
Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend by @ChristinaZ in #5919
[fix] Release slots with spec decode + disagg (#5975) by @Tabrizian in #6032
[None][infra] Set up the initial config for CodeRabbit by @chzblych in #6128
CI: update multi gpu test trigger file list by @QiJune in #6131
fix: convert venv_prefix to str before comparison with base_prefix by @dc3671 in #6121
[Infra] - Add wiave list for pytest when using slurm by @EmmaQiaoCh in #6130
chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service by @chuangz0 in #5234
[TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler by @syuoni in #6000
chores: unwaive a few tests for v1.0 by @hchings in #6107
test: update max_beam_width to 1 due to torchsampler changes. by @nv-guomingz in #6101
fix: Fix DeepSeek R1 CI by @yizhang-nv in #6129
test: fix PytestUnknownMarkWarning: Unknown pytest.mark.timeout by @StanleySun639 in #6115
[TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter by @ziyixiong-nv in #6007
feat: nanobind bindings by @Linda-Stadter in #5961
[fix] Update jenkins container images by @ixlmar in #6094
[fix] Remove duplicated KVCache transmission check by @Tabrizian in #6022
[fix] Fix Mistral3VLM weight-loading & enable in pre-merge by @2ez4bz in #6105
[fix] Fixes KV Cache overrides in trtllm-bench by @FrankD412 in #6103
Refactor KVCacheManager: Simplify token availability calculation and … by @qixiang-99 in #6134
feat: Add support for benchmarking individual gemms in MOE benchmark by @djns99 in #6080
Revert "feat: nanobind bindings (#5961)" by @Tabrizian in #6160
[TRTLLM-6368] Update deepep dispatch API by @yifeizhang-c in #6037
fix TMA error with GEMM+AR on TP=2 by @xavier-nvidia in #6075
[https://nvbugs/5387375] fix(scaffolding): fix scaffolding aime test in test_e2e by @dc3671 in #6140
feat: add support for Modelopt fp8_pb_wo quantization scheme by @achartier in #6106
fix single_disagg_test by @chuangz0 in #6166
[TRTLLM-5179] - Update bot help messages by @yiqingy0 in #5277
[None][infra] Update the allow list of CI trigger by @niukuo in #6168
chore: add more log in FmhaDispatcher by @QiJune in #6170
[Infra] - Waive failed tests in post-merge by @EmmaQiaoCh in #6176
refactor: Enhanced handling of decoder requests and logits within the batch manager by @Funatiq in #6055
update broken link of PyTorchModelEngine in arch_overview by @leslie-fang25 in #6171
fix: NVBug 5385576 py_batch_idx issue by @hchings in #6153
infra: fix single-GPU stage failed will not raise error by @ZhanruiSunCh in #6165
[ci] Speedup beam search unit tests with fixtures for LLM by @stnie in #5843
feat: Remove padding in attention DP. by @bobboli in #6064
[TRTLLM-6471] Infra: unwaive nixl tests and some disagg-serve tests by @bo-nv in #6095
enh: Add script to map tests <-> jenkins stages & vice-versa by @venkywonka in #5177
feat(eagle3):support qwen3 dense model by @xq25478 in #5879
[nvbugs/5369799] fix: Update disaggregation handling in sampler by @stnie in #5762
[nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound by @stnie in #5926
[nvbug/5393888][nvbug/5393042] Always use py_seq_slot by @netanel-haber in #6147
[https://nvbugs/5393961][fix] record kv-cache size in MLACacheFormatter by @bo-nv in #6181
[Issue 5927][fix] Avoid memory calls during broadcast for single GPU by @johncalesp in #6010
[Disaggregated] Add retry knobs and handling by @arekay in #5808
[refactor] Unify name of NGram speculative decoding by @wili-65535 in #5937
[TRTLLM-6452][feat]: Two-model engine KV cache reuse support by @ziyixiong-nv in #6133
[fix]: Skip prompt length checking for generation only requests by @LinPoly in #6146

New Contributors

@zhanga5 made their first contribution in #5184
@jaedeok-nvidia made their first contribution in #5880
@bo-nv made their first contribution in #5991
@tshmilnvidia made their first contribution in #5982
@yifeizhang-c made their first contribution in #6037
@leslie-fang25 made their first contribution in #6171
@xq25478 made their first contribution in #5879
@johncalesp made their first contribution in #6010

Full Changelog: v1.0.0rc3...v1.0.0rc4