github NVIDIA/TensorRT-LLM v1.0.0rc4

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-releaseone month ago

Announcement Highlights:

  • Model Support
    • Add phi-4-multimodal model support (#5644)
    • Add EXAONE 4.0 model support (#5696)
  • Feature
    • Add support for two-model engine KV cache reuse (#6133)
    • Unify name of NGram speculative decoding (#5937)
    • Add retry knobs and handling in disaggregated serving (#5808)
    • Add Eagle-3 support for qwen3 dense model (#5879)
    • Remove padding of FusedMoE in attention DP (#6064)
    • Enhanced handling of decoder requests and logits within the batch manager (#6055)
    • Add support for Modelopt fp8_pb_wo quantization scheme (#6106)
    • Update deepep dispatch API (#6037)
    • Add support for benchmarking individual gemms in MOE benchmark (#6080)
    • Simplify token availability calculation for VSWA (#6134)
    • Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
    • Enable guided decoding with overlap scheduler (#6000)
    • Use cacheTransceiverConfig as knobs for disagg service (#5234)
    • Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
    • Enhance ModelConfig for kv cache size calculations (#5868)
    • Clean up drafter/resource manager creation logic (#5805)
    • Add core infrastructure to enable loading of custom checkpoint formats (#5372)
    • Cleanup disable_fp4_allgather (#6006)
    • Use session abstraction in data transceiver and cache formatter (#5611)
    • Add support for Triton request cancellation (#5898)
    • Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs (#5684)
    • Remove enforced sorted order of batch slots (#3502)
    • Use huge page mapping for host accessible memory on GB200 (#5963)
  • API
    • [BREAKING CHANGE] Unify KvCacheConfig in LLM class for pytorch backend (#5752)
    • [BREAKING CHANGE] Rename cuda_graph_config padding_enabled field (#6003)
  • Bug Fixes
    • Skip prompt length checking for generation only requests (#6146)
    • Avoid memory calls during broadcast for single GPU (#6010)
    • Record kv-cache size in MLACacheFormatter (#6181)
    • Always use py_seq_slot in runtime (#6147)
    • Update beam search workspace estimation to new upper bound (#5926)
    • Update disaggregation handling in sampler (#5762)
    • Fix TMA error with GEMM+AR on TP=2 (#6075)
    • Fix scaffolding aime test in test_e2e (#6140)
    • Fix KV Cache overrides in trtllm-bench (#6103)
    • Remove duplicated KVCache transmission check (#6022)
    • Release slots with spec decode + disagg (#5975) (#6032)
    • Add propogation for trust_remote_code to AutoConfig (#6001)
    • Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
    • Pad DeepEP fp4 recv tensors if empty (#6048)
    • Adjust window sizes of VSWA at torch backend (#5880)
    • Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
    • Fix eagle3 two model disaggregated serving test (#6014)
    • Update torch.compile option to fix triton store_cubin error (#5865)
    • Fix chunked prefill + overlap scheduling (#5761)
    • Fix mgmn postprocess error (#5835)
    • Fallback to cubins for fp8 fmha kernels on Ada (#5779)
    • Enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
    • Rewrite completion API to avoid repetitive tokens (#5201)
    • Fix disagg + speculative decoding (#5558)
  • Benchmark
    • Add latency support for trtllm bench (#3730)
  • Performance
    • Optimize TRTLLM Sampler perf single beam single step (#5550)
    • Performance Optimization for MNNVL TwoShot Kernel (#5925)
    • Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986)
    • Enable cuda graph by default (#5480)
  • Infrastructure
    • Add script to map tests <-> jenkins stages & vice-versa (#5431)
    • Speedup beam search unit tests with fixtures for LLM (#5843)
    • Fix single-GPU stage failed will not raise error (#6165)
    • Update bot help messages (#5277)
    • Update jenkins container images (#6094)
    • Set up the initial config for CodeRabbit (#6128)
    • Upgrade NIXL to 0.3.1 (#5991)
    • Upgrade modelopt to 0.33 (#6058)
    • Support show all stage name list when stage name check failed (#5946)
    • Run docs build only if PR contains only doc changes (#5184)
  • Documentation
    • Update broken link of PyTorchModelEngine in arch_overview (#6171)
    • Add initial documentation for trtllm-bench CLI. (#5734)
    • Add documentation for eagle3+disagg+dynamo (#6072)
    • Update llama-3.3-70B guide (#6028)
  • Known Issues

What's Changed

  • [TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch by @Superjomn in #5956
  • fix: Fix MoE benchmark by @syuoni in #5966
  • [TRTLLM-6160] chore: add sampling examples for pytorch by @Superjomn in #5951
  • Use huge page mapping for host accessible memory on GB200 by @dongxuy04 in #5963
  • Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default by @dominicshanshan in #5480
  • fix: set allreduce strategy to model config by @WeiHaocheng in #5955
  • chore: Mass integration of release/0.21 (part 3) by @dc3671 in #5909
  • infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check by @ZhanruiSunCh in #5709
  • Waive L0 test by @yiqingy0 in #6002
  • [Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage by @kaiyux in #5882
  • fix #4974: A thread leak issue in scaffolding unittest by @ccs96307 in #5020
  • feat: EXAONE4.0 support by @yechank-nvidia in #5696
  • [TRTLLM-5653][infra] Run docs build only if PR contains only doc changes by @zhanga5 in #5184
  • feat: Update Gemma3 Vision Encoder by @brb-nv in #5973
  • enh: Bidirectional mask with multiple images for Gemma3 by @brb-nv in #5976
  • refactor: Remove enforced sorted order of batch slots by @Funatiq in #3502
  • [fix] fix eagle3 two model disaggregated serving test by @Tabrizian in #6014
  • perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend by @djns99 in #5986
  • [nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs by @ixlmar in #5964
  • doc: update EXAONE 4.0 news by @yechank-nvidia in #6034
  • [Model load] Fix llama min-latency model load by @arekay in #5883
  • fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse by @djns99 in #4135
  • Doc: Update llama-3.3-70B guide by @jiahanc in #6028
  • infra: [TRTLLM-6331] Support show all stage name list when stage name check failed by @ZhanruiSunCh in #5946
  • [Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report by @yiqingy0 in #5672
  • [Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils by @lucaslie in #6026
  • infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… by @ZhanruiSunCh in #5945
  • chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #6003
  • test: add recursive updating pytorch config and change MOE backend format in perf test by @ruodil in #6046
  • test: add llama_v3.3_70b_cases in perf test by @ruodil in #6035
  • [infra] add more log on reuse-uploading by @niukuo in #6036
  • fix: adjust window sizes of VSWA at torch backend by @jaedeok-nvidia in #5880
  • [nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision by @MinaHuai in #5998
  • Fix: pad DeepEP fp4 recv tensors if empty by @yuantailing in #6048
  • [fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop by @jinyangyuan-nvidia in #6053
  • support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… by @ttyio in #5684
  • Cherry-pick #5947 by @lfr-0531 in #5989
  • test: Add regression tests for Gemma3 VLM by @brb-nv in #6033
  • feat/add latency support for trtllm bench by @danielafrimi in #3730
  • feat: Add support for Triton request cancellation by @achartier in #5898
  • [fix] Fix Triton build by @Tabrizian in #6076
  • fix: Unable to load phi4-model with tp_size>1 by @Wanli-Jiang in #5962
  • chore: Bump version to 1.0.0rc4 by @yiqingy0 in #6086
  • chroe: upgrade modelopt to 0.33 by @nv-guomingz in #6058
  • [nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test by @zhengd-nv in #6041
  • feat: use session abstraction in data transceiver and cache formatter by @zhengd-nv in #5611
  • [TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 by @bo-nv in #5991
  • feat: Add deepseek-lite tests for RTX pro 6000 by @peaceh-nv in #5903
  • [nvbug/5359218][tests] add test llm api test case on lookahead with chunked prefill by @crazydemo in #6051
  • [nvbug/5387226] chore: add propogation for trust_remote_code to AutoConfig by @Superjomn in #6001
  • tests: add QA test cases by @crazydemo in #5959
  • fix: Add $HOME/.local/bin to PATH when running docker in local user mode by @MartinMarciniszyn in #6062
  • [TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend by @Superjomn in #5752
  • BlockManager copy constructor fix by @tshmilnvidia in #5982
  • update spec_dec by @qsang-nv in #6079
  • chore: Cleanup disable_fp4_allgather. by @bobboli in #6006
  • infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines by @ZhanruiSunCh in #5199
  • [Infra] - Waive failed cases in post-merge on main by @EmmaQiaoCh in #6096
  • Add documentation for eagle3+disagg+dynamo by @Tabrizian in #6072
  • fix: Update trtllm args issues with extra nested config by @Wanli-Jiang in #5996
  • [TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats by @shaharmor98 in #5372
  • [refactor] Clean up drafter/resource manager creation logic by @mikeiovine in #5805
  • Fix: Enhance ModelConfig for kv cache size calculations by @qixiang-99 in #5868
  • feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support by @Wanli-Jiang in #5644
  • [TRTLLM-6070] docs: Add initial documentation for trtllm-bench CLI. by @FrankD412 in #5734
  • test: Update Llama4 Scout FP4 & FP8 accuracy tests by @chenfeiz0326 in #5901
  • [fix] Performance Optimization for MNNVL TwoShot Kernel by @timlee0212 in #5934
  • infra: fix SBSA test stage by @ZhanruiSunCh in #6113
  • Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend by @ChristinaZ in #5919
  • [fix] Release slots with spec decode + disagg (#5975) by @Tabrizian in #6032
  • [None][infra] Set up the initial config for CodeRabbit by @chzblych in #6128
  • CI: update multi gpu test trigger file list by @QiJune in #6131
  • fix: convert venv_prefix to str before comparison with base_prefix by @dc3671 in #6121
  • [Infra] - Add wiave list for pytest when using slurm by @EmmaQiaoCh in #6130
  • chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service by @chuangz0 in #5234
  • [TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler by @syuoni in #6000
  • chores: unwaive a few tests for v1.0 by @hchings in #6107
  • test: update max_beam_width to 1 due to torchsampler changes. by @nv-guomingz in #6101
  • fix: Fix DeepSeek R1 CI by @yizhang-nv in #6129
  • test: fix PytestUnknownMarkWarning: Unknown pytest.mark.timeout by @StanleySun639 in #6115
  • [TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter by @ziyixiong-nv in #6007
  • feat: nanobind bindings by @Linda-Stadter in #5961
  • [fix] Update jenkins container images by @ixlmar in #6094
  • [fix] Remove duplicated KVCache transmission check by @Tabrizian in #6022
  • [fix] Fix Mistral3VLM weight-loading & enable in pre-merge by @2ez4bz in #6105
  • [fix] Fixes KV Cache overrides in trtllm-bench by @FrankD412 in #6103
  • Refactor KVCacheManager: Simplify token availability calculation and … by @qixiang-99 in #6134
  • feat: Add support for benchmarking individual gemms in MOE benchmark by @djns99 in #6080
  • Revert "feat: nanobind bindings (#5961)" by @Tabrizian in #6160
  • [TRTLLM-6368] Update deepep dispatch API by @yifeizhang-c in #6037
  • fix TMA error with GEMM+AR on TP=2 by @xavier-nvidia in #6075
  • [https://nvbugs/5387375] fix(scaffolding): fix scaffolding aime test in test_e2e by @dc3671 in #6140
  • feat: add support for Modelopt fp8_pb_wo quantization scheme by @achartier in #6106
  • fix single_disagg_test by @chuangz0 in #6166
  • [TRTLLM-5179] - Update bot help messages by @yiqingy0 in #5277
  • [None][infra] Update the allow list of CI trigger by @niukuo in #6168
  • chore: add more log in FmhaDispatcher by @QiJune in #6170
  • [Infra] - Waive failed tests in post-merge by @EmmaQiaoCh in #6176
  • refactor: Enhanced handling of decoder requests and logits within the batch manager by @Funatiq in #6055
  • update broken link of PyTorchModelEngine in arch_overview by @leslie-fang25 in #6171
  • fix: NVBug 5385576 py_batch_idx issue by @hchings in #6153
  • infra: fix single-GPU stage failed will not raise error by @ZhanruiSunCh in #6165
  • [ci] Speedup beam search unit tests with fixtures for LLM by @stnie in #5843
  • feat: Remove padding in attention DP. by @bobboli in #6064
  • [TRTLLM-6471] Infra: unwaive nixl tests and some disagg-serve tests by @bo-nv in #6095
  • enh: Add script to map tests <-> jenkins stages & vice-versa by @venkywonka in #5177
  • feat(eagle3):support qwen3 dense model by @xq25478 in #5879
  • [nvbugs/5369799] fix: Update disaggregation handling in sampler by @stnie in #5762
  • [nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound by @stnie in #5926
  • [nvbug/5393888][nvbug/5393042] Always use py_seq_slot by @netanel-haber in #6147
  • [https://nvbugs/5393961][fix] record kv-cache size in MLACacheFormatter by @bo-nv in #6181
  • [Issue 5927][fix] Avoid memory calls during broadcast for single GPU by @johncalesp in #6010
  • [Disaggregated] Add retry knobs and handling by @arekay in #5808
  • [refactor] Unify name of NGram speculative decoding by @wili-65535 in #5937
  • [TRTLLM-6452][feat]: Two-model engine KV cache reuse support by @ziyixiong-nv in #6133
  • [fix]: Skip prompt length checking for generation only requests by @LinPoly in #6146

New Contributors

Full Changelog: v1.0.0rc3...v1.0.0rc4

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.