Announcement Highlights:
- Model Support
- Feature
- Add support for two-model engine KV cache reuse (#6133)
- Unify name of NGram speculative decoding (#5937)
- Add retry knobs and handling in disaggregated serving (#5808)
- Add Eagle-3 support for qwen3 dense model (#5879)
- Remove padding of FusedMoE in attention DP (#6064)
- Enhanced handling of decoder requests and logits within the batch manager (#6055)
- Add support for Modelopt fp8_pb_wo quantization scheme (#6106)
- Update deepep dispatch API (#6037)
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
- Simplify token availability calculation for VSWA (#6134)
- Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
- Enable guided decoding with overlap scheduler (#6000)
- Use cacheTransceiverConfig as knobs for disagg service (#5234)
- Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
- Enhance ModelConfig for kv cache size calculations (#5868)
- Clean up drafter/resource manager creation logic (#5805)
- Add core infrastructure to enable loading of custom checkpoint formats (#5372)
- Cleanup disable_fp4_allgather (#6006)
- Use session abstraction in data transceiver and cache formatter (#5611)
- Add support for Triton request cancellation (#5898)
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs (#5684)
- Remove enforced sorted order of batch slots (#3502)
- Use huge page mapping for host accessible memory on GB200 (#5963)
- API
- Bug Fixes
- Skip prompt length checking for generation only requests (#6146)
- Avoid memory calls during broadcast for single GPU (#6010)
- Record kv-cache size in MLACacheFormatter (#6181)
- Always use py_seq_slot in runtime (#6147)
- Update beam search workspace estimation to new upper bound (#5926)
- Update disaggregation handling in sampler (#5762)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Remove duplicated KVCache transmission check (#6022)
- Release slots with spec decode + disagg (#5975) (#6032)
- Add propogation for trust_remote_code to AutoConfig (#6001)
- Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
- Pad DeepEP fp4 recv tensors if empty (#6048)
- Adjust window sizes of VSWA at torch backend (#5880)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Update torch.compile option to fix triton store_cubin error (#5865)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
- Rewrite completion API to avoid repetitive tokens (#5201)
- Fix disagg + speculative decoding (#5558)
- Benchmark
- Add latency support for trtllm bench (#3730)
- Performance
- Infrastructure
- Add script to map tests <-> jenkins stages & vice-versa (#5431)
- Speedup beam search unit tests with fixtures for LLM (#5843)
- Fix single-GPU stage failed will not raise error (#6165)
- Update bot help messages (#5277)
- Update jenkins container images (#6094)
- Set up the initial config for CodeRabbit (#6128)
- Upgrade NIXL to 0.3.1 (#5991)
- Upgrade modelopt to 0.33 (#6058)
- Support show all stage name list when stage name check failed (#5946)
- Run docs build only if PR contains only doc changes (#5184)
- Documentation
- Known Issues
What's Changed
- [TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch by @Superjomn in #5956
- fix: Fix MoE benchmark by @syuoni in #5966
- [TRTLLM-6160] chore: add sampling examples for pytorch by @Superjomn in #5951
- Use huge page mapping for host accessible memory on GB200 by @dongxuy04 in #5963
- Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default by @dominicshanshan in #5480
- fix: set allreduce strategy to model config by @WeiHaocheng in #5955
- chore: Mass integration of release/0.21 (part 3) by @dc3671 in #5909
- infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check by @ZhanruiSunCh in #5709
- Waive L0 test by @yiqingy0 in #6002
- [Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage by @kaiyux in #5882
- fix #4974: A thread leak issue in scaffolding unittest by @ccs96307 in #5020
- feat: EXAONE4.0 support by @yechank-nvidia in #5696
- [TRTLLM-5653][infra] Run docs build only if PR contains only doc changes by @zhanga5 in #5184
- feat: Update Gemma3 Vision Encoder by @brb-nv in #5973
- enh: Bidirectional mask with multiple images for Gemma3 by @brb-nv in #5976
- refactor: Remove enforced sorted order of batch slots by @Funatiq in #3502
- [fix] fix eagle3 two model disaggregated serving test by @Tabrizian in #6014
- perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend by @djns99 in #5986
- [nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs by @ixlmar in #5964
- doc: update EXAONE 4.0 news by @yechank-nvidia in #6034
- [Model load] Fix llama min-latency model load by @arekay in #5883
- fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse by @djns99 in #4135
- Doc: Update llama-3.3-70B guide by @jiahanc in #6028
- infra: [TRTLLM-6331] Support show all stage name list when stage name check failed by @ZhanruiSunCh in #5946
- [Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report by @yiqingy0 in #5672
- [Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils by @lucaslie in #6026
- infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… by @ZhanruiSunCh in #5945
- chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #6003
- test: add recursive updating pytorch config and change MOE backend format in perf test by @ruodil in #6046
- test: add llama_v3.3_70b_cases in perf test by @ruodil in #6035
- [infra] add more log on reuse-uploading by @niukuo in #6036
- fix: adjust window sizes of VSWA at torch backend by @jaedeok-nvidia in #5880
- [nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision by @MinaHuai in #5998
- Fix: pad DeepEP fp4 recv tensors if empty by @yuantailing in #6048
- [fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop by @jinyangyuan-nvidia in #6053
- support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… by @ttyio in #5684
- Cherry-pick #5947 by @lfr-0531 in #5989
- test: Add regression tests for Gemma3 VLM by @brb-nv in #6033
- feat/add latency support for trtllm bench by @danielafrimi in #3730
- feat: Add support for Triton request cancellation by @achartier in #5898
- [fix] Fix Triton build by @Tabrizian in #6076
- fix: Unable to load phi4-model with tp_size>1 by @Wanli-Jiang in #5962
- chore: Bump version to 1.0.0rc4 by @yiqingy0 in #6086
- chroe: upgrade modelopt to 0.33 by @nv-guomingz in #6058
- [nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test by @zhengd-nv in #6041
- feat: use session abstraction in data transceiver and cache formatter by @zhengd-nv in #5611
- [TRTLLM-6471] Infra: Upgrade NIXL to 0.3.1 by @bo-nv in #5991
- feat: Add deepseek-lite tests for RTX pro 6000 by @peaceh-nv in #5903
- [nvbug/5359218][tests] add test llm api test case on lookahead with chunked prefill by @crazydemo in #6051
- [nvbug/5387226] chore: add propogation for trust_remote_code to AutoConfig by @Superjomn in #6001
- tests: add QA test cases by @crazydemo in #5959
- fix: Add $HOME/.local/bin to PATH when running docker in local user mode by @MartinMarciniszyn in #6062
- [TRTLLM-5530][BREAKING CHANGE] refactor: unify KvCacheConfig in LLM class for pytorch backend by @Superjomn in #5752
- BlockManager copy constructor fix by @tshmilnvidia in #5982
- update spec_dec by @qsang-nv in #6079
- chore: Cleanup disable_fp4_allgather. by @bobboli in #6006
- infra: [TRTLLM-5879] Spilt single GPU test and multi GPU test into 2 pipelines by @ZhanruiSunCh in #5199
- [Infra] - Waive failed cases in post-merge on main by @EmmaQiaoCh in #6096
- Add documentation for eagle3+disagg+dynamo by @Tabrizian in #6072
- fix: Update trtllm args issues with extra nested config by @Wanli-Jiang in #5996
- [TRTLLM-5493] Add core infrastructure to enable loading of custom checkpoint formats by @shaharmor98 in #5372
- [refactor] Clean up drafter/resource manager creation logic by @mikeiovine in #5805
- Fix: Enhance ModelConfig for kv cache size calculations by @qixiang-99 in #5868
- feat: TRTLLM-5574 Add phi-4-multimodal pytorch-backend support by @Wanli-Jiang in #5644
- [TRTLLM-6070] docs: Add initial documentation for trtllm-bench CLI. by @FrankD412 in #5734
- test: Update Llama4 Scout FP4 & FP8 accuracy tests by @chenfeiz0326 in #5901
- [fix] Performance Optimization for MNNVL TwoShot Kernel by @timlee0212 in #5934
- infra: fix SBSA test stage by @ZhanruiSunCh in #6113
- Feat: Add vectorized loading for finalize kernel in MoE Trtllm backend by @ChristinaZ in #5919
- [fix] Release slots with spec decode + disagg (#5975) by @Tabrizian in #6032
- [None][infra] Set up the initial config for CodeRabbit by @chzblych in #6128
- CI: update multi gpu test trigger file list by @QiJune in #6131
- fix: convert venv_prefix to str before comparison with base_prefix by @dc3671 in #6121
- [Infra] - Add wiave list for pytest when using slurm by @EmmaQiaoCh in #6130
- chore:[BREAKING CHANGE] use cacheTransceiverConfig as knobs for disagg service by @chuangz0 in #5234
- [TRTLLM-6406, TRTLLM-5172] feat: Enable guided decoding with overlap scheduler by @syuoni in #6000
- chores: unwaive a few tests for v1.0 by @hchings in #6107
- test: update max_beam_width to 1 due to torchsampler changes. by @nv-guomingz in #6101
- fix: Fix DeepSeek R1 CI by @yizhang-nv in #6129
- test: fix PytestUnknownMarkWarning: Unknown pytest.mark.timeout by @StanleySun639 in #6115
- [TRTLLM-6352][feat] Migrate EAGLE3 and draft/target speculation to Drafter by @ziyixiong-nv in #6007
- feat: nanobind bindings by @Linda-Stadter in #5961
- [fix] Update jenkins container images by @ixlmar in #6094
- [fix] Remove duplicated KVCache transmission check by @Tabrizian in #6022
- [fix] Fix Mistral3VLM weight-loading & enable in pre-merge by @2ez4bz in #6105
- [fix] Fixes KV Cache overrides in trtllm-bench by @FrankD412 in #6103
- Refactor KVCacheManager: Simplify token availability calculation and … by @qixiang-99 in #6134
- feat: Add support for benchmarking individual gemms in MOE benchmark by @djns99 in #6080
- Revert "feat: nanobind bindings (#5961)" by @Tabrizian in #6160
- [TRTLLM-6368] Update deepep dispatch API by @yifeizhang-c in #6037
- fix TMA error with GEMM+AR on TP=2 by @xavier-nvidia in #6075
- [https://nvbugs/5387375] fix(scaffolding): fix scaffolding aime test in test_e2e by @dc3671 in #6140
- feat: add support for Modelopt fp8_pb_wo quantization scheme by @achartier in #6106
- fix single_disagg_test by @chuangz0 in #6166
- [TRTLLM-5179] - Update bot help messages by @yiqingy0 in #5277
- [None][infra] Update the allow list of CI trigger by @niukuo in #6168
- chore: add more log in FmhaDispatcher by @QiJune in #6170
- [Infra] - Waive failed tests in post-merge by @EmmaQiaoCh in #6176
- refactor: Enhanced handling of decoder requests and logits within the batch manager by @Funatiq in #6055
- update broken link of PyTorchModelEngine in arch_overview by @leslie-fang25 in #6171
- fix: NVBug 5385576 py_batch_idx issue by @hchings in #6153
- infra: fix single-GPU stage failed will not raise error by @ZhanruiSunCh in #6165
- [ci] Speedup beam search unit tests with fixtures for LLM by @stnie in #5843
- feat: Remove padding in attention DP. by @bobboli in #6064
- [TRTLLM-6471] Infra: unwaive nixl tests and some disagg-serve tests by @bo-nv in #6095
- enh: Add script to map tests <-> jenkins stages & vice-versa by @venkywonka in #5177
- feat(eagle3):support qwen3 dense model by @xq25478 in #5879
- [nvbugs/5369799] fix: Update disaggregation handling in sampler by @stnie in #5762
- [nvbugs/5354884][fix] Update beam search workspace estimation to new upper bound by @stnie in #5926
- [nvbug/5393888][nvbug/5393042] Always use
py_seq_slot
by @netanel-haber in #6147 - [https://nvbugs/5393961][fix] record kv-cache size in MLACacheFormatter by @bo-nv in #6181
- [Issue 5927][fix] Avoid memory calls during broadcast for single GPU by @johncalesp in #6010
- [Disaggregated] Add retry knobs and handling by @arekay in #5808
- [refactor] Unify name of NGram speculative decoding by @wili-65535 in #5937
- [TRTLLM-6452][feat]: Two-model engine KV cache reuse support by @ziyixiong-nv in #6133
- [fix]: Skip prompt length checking for generation only requests by @LinPoly in #6146
New Contributors
- @zhanga5 made their first contribution in #5184
- @jaedeok-nvidia made their first contribution in #5880
- @bo-nv made their first contribution in #5991
- @tshmilnvidia made their first contribution in #5982
- @yifeizhang-c made their first contribution in #6037
- @leslie-fang25 made their first contribution in #6171
- @xq25478 made their first contribution in #5879
- @johncalesp made their first contribution in #6010
Full Changelog: v1.0.0rc3...v1.0.0rc4