github NVIDIA/TensorRT-LLM v1.3.0rc3

pre-release8 hours ago

Highlights:

Model Support
  - Support LoRa BF16 checkpoints with Llama 3.3-70B FP8 (#9808)
  - Add Eagle3 support for Nemotron H (#11131)
  - Enhance support for complex models (#11254)

API
  - Allow overriding quantization configs (#11062)
  - Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
  - Set max_num_tokens_in_buffer default based on max_seq_len/max_input_len (#11082)

Feature
  - Export ONNX for DriveOS LLM (#10117)
  - Add L2 norm pattern matcher and fusion transform (#10767)
  - Add PDL support for moeAlltoAllKernels (#10591)
  - Integrate KVCacheManager V2 into TRTLLM runtime (#10659)
  - Integrate cuda.tile RMS norm kernels (#9725)
  - Refactor request fetching logic for better separation of concerns (#10988)
  - Implement gen-first disagg_service (#11020)
  - Support disagg SLURM job rescheduling (#11218)
  - Improve layer classification for sharding (#10718)
  - Add priority-based KV cache offload filtering (#10751)
  - Optimize beam search performance (remove GPU sync, fix batching, refactor) (#11276)
  - Avoid sync in PyTorchModelEngine when using beam search (#11341)
  - Adjust DeepGEMM tuning buckets for larger num_tokens scope (#11259)
  - Add CuteDSL FP8 GEMM for Blackwell (#10130)
  - Reduce host memory usage during model loading (#11119)
  - Perfect routing for Deepseek models (#11127)
  - Modularize transceiver for KV manager v2 (step 4) (#11225)

Fix
  - Fix AttributeError with return_perf_metrics on TensorRT backend (#10662)
  - Prevent routing context and generation requests to the same worker; document unique disagg ID (#11095)
  - Prevent out-of-bounds read (#10868)
  - Add __syncthreads() to TinyGEMM to resolve intermittent accuracy issues (#10873)
  - Fix PD disaggregation for VLMs that use mrope (#10865)
  - Always reset drafting states for GuidedDecoder (#10899)
  - Use NCCL as fallback to avoid crash due to insufficient memory (#10928)
  - Fix llama sm120 spec decoding (#10765)
  - Fix MTP one-model sampler (#10369)
  - Align kv_scales with ModelOpt HF checkpoint (#10745)
  - Fix selective_state_update perf regression for T=1 decode path (#11194)
  - Make health_generate work with beam search (#11097)
  - Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
  - Fix CuteDSL argmax on sm120 (#11181)
  - Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel (#11256)
  - Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
  - Fix partial reuse disabled for disagg (#11247)
  - Retake ownership of mrope tensors in prefill worker (#11217)
  - Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
  - Fix accuracy drop in VSWA with KV cache block reuse (#10875)

Documentation
  - Add Glm4MoeForCausalLM to model support matrix (#11156)
  - Fix GLM4-MoE Eagle support documentation (#11198)
  - Add CUDA Graph + LoRA to feature combination matrix (#11187)
  - Fix comments for KV cache manager v2 (#11207)
  - Skip Softmax Attention blog and docs (#10592)
  - Add sparse attention docs to index (#11342)

Test & Infra
  - Update GB200 test configs to use frontend SLURM platforms (#11085)
  - Fix jaraco-context and wheel vulnerability (#10901)
  - Add --high-priority in bot help message (#11133)
  - Print memory usage before/after accuracy test in CI (#11155)
  - Fix mocking of HuggingFace downloads in with_mocked_hf_download (#11200)
  - Set rerun report stage UNSTABLE and pipeline SUCCESS when rerun tests pass (#11210)
  - Move 6x H100 test stage to AIHub platform (#11039)
  - Add disagg perf tests (#10912)
  - Provide uniform test framework to test all MoE backends (#11128)
  - Move disagg scripts env configs from bash to submit.py (#10223)
  - Use free port for serve test (#10878)
  - Fix test_auto_scaling for 2 GPUs (#10866)
  - Update test list (#10883)
  - Fix an invalid test name (#11195)
  - Refine QA test list for SM120 (#11248)
  - Fix multimodal serve test (#11296)
  - Pass without_comm to Cutlass and DeepGEMM (#11229)
  - Promote SampleState to TypeVar and fix typing (#11281)
  - Fix bench script test (#10483)

What's Changed

  • [None][feat] Export ONNX for DriveOS LLM by @nvyocox in #10117
  • [#9525][feat] add L2 norm pattern matcher and fusion transform by @karthikvetrivel in #10767
  • [TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms by @mlefeb01 in #11085
  • [None][doc] Add Glm4MoeForCausalLM to model support matrix by @venkywonka in #11156
  • [None][feat] Perfect routing for Deepseek models by @brb-nv in #11127
  • [TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super by @nv-guomingz in #10791
  • [#8242][feat] Add int4 GPTQ support for AutoDeploy by @Fridah-nv in #8248
  • [https://nvbugs/5804683][infra] unwaive Mistral Large3 test by @byshiue in #10680
  • [TRTLLM-9771][feat] Allow overriding quantization configs by @shuyixiong in #11062
  • [None][ci] Waive a flaky test on A10 by @chzblych in #11163
  • [None][infra] Waive failed cases for main on 1/30 by @EmmaQiaoCh in #11142
  • [None][fix] AttributeError with return_perf_metrics on tensorrt backend by @riZZZhik in #10662
  • [https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID by @reasonsolo in #11095
  • [TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns by @lancelly in #10988
  • [https://nvbugs/5823284][fix] Unwaive no repro hang issue by @liji-nv in #11138
  • [None] [feat] Add PDL support for moeAlltoAllKernels by @kaiyux in #10591
  • [None][infra] Waive failed cases and disable a stage on 02/02 by @EmmaQiaoCh in #11177
  • [TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime by @yizhang-nv in #10659
  • [None][chroe] Mass integration of release/1.2 - 2nd by @dominicshanshan in #11088
  • [None][feat] Integrate cuda.tile RMS norm kernels by @lirundong in #9725
  • [None][test] Fix an invalid test name by @chzblych in #11195
  • [None][feat] Nemotron H: Eagle3 support by @IzzyPutterman in #11131
  • [#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation by @govind-ramnarayan in #11073
  • [None][doc] Fix GLM4-MoE Eagle support documentation by @venkywonka in #11198
  • [TRTLLM-10561][infra] Fix jaraco-context and wheel vulnerability by @yiqingy0 in #10901
  • [TRTLLM-10307][infra] Add --high-priority in bot help message by @mzweilz in #11133
  • [None][chore] Print memory usage before/after accuracy test in CI by @taylor-yb-lee in #11155
  • [TRTLLM-10803][fix] Fix mocking of HuggingFace downloads in with_mocked_hf_download by @anish-shanbhag in #11200
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11193
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11202
  • [TRTLLM-10839][infra] Set rerun report stage UNSTABLE and pipeline SUCCESS in post-merge when there are passed rerun tests by @yiqingy0 in #11210
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #11216
  • [None][fix] Align kv_scales with modelopt HF checkpoint by @cjluo-nv in #10745
  • [https://nvbugs/5739981][fix] unwaive tests using opt-125M by @ixlmar in #11100
  • [TRTLLM-10019][infra] Move 6 h100 test stage to aihub platform by @yuanjingx87 in #11039
  • [TRTLLM-8921][feat] implement gen-first disagg_service by @reasonsolo in #11020
  • [#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU by @taylor-yb-lee in #11059
  • [None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol by @riZZZhik in #10644
  • [None][chore] bump version to 1.3.0rc3 by @tburt-nv in #11238
  • [TRTLLM-8263][feat] Add Disagg Perf Tests by @chenfeiz0326 in #10912
  • [None][fix] Fix selective_state_update perf regression for T=1 decode path by @galagam in #11194
  • [TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends by @xxi-nv in #11128
  • [None][fix] make health_generate work with beam search by @ixlmar in #11097
  • [None][feat] move some disagg script's env configs from bash to submit.py by @dc3671 in #10223
  • [https://nvbugs/5624818][fix] Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 by @eopXD in #11192
  • [None][feat] Support disagg slurm jobs rescheduling by @qiaoxj07 in #11218
  • [#10966][feat] AutoDeploy: kv cache manager integration [2/2] by @lucaslie in #11149
  • [TRTLLM-10673][feat] Improved layer classification for sharding by @greg-kwasniewski1 in #10718
  • [None][chore] AutoDeploy: Set nanov3 and superv3 configs to use flashinfer ssm by @galagam in #11183
  • [https://nvbugs/5674665][fix] Fix accuracy drop in VSWA with KV cache block reuse by @SimengLiu-nv in #10875
  • [https://nvbugs/5849697][fix] Refine QA Test List for SM120 by @dongfengy in #11248
  • [https://nvbugs/5854860][fix] Fix cutedsl argmax on sm120 by @dongfengy in #11181
  • [None][fix] Fix comments for kv cache manager v2 by @yizhang-nv in #11207
  • [https://nvbugs/5837275][fix] Unwaive the failing case that cannot be… by @liji-nv in #11137
  • [https://nvbugs/5800679][fix] Re-enable test after bug fixed by @dongfengy in #11249
  • [TRTLLM-9210][fix] Add failed cases into waives.txt by @xinhe-nv in #11223
  • [https://nvbugs/5747920][fix] Fix multimodal serve test by @yechank-nvidia in #11296
  • [None][chore] Pass without_comm to cutlass and deepgemm by @xxi-nv in #11229
  • [None][feat] Enhance support for complex models by @lowsfer in #11254
  • [#11037][fix] Fix proto-to-SamplingParams conversion bugs and add gRPC tests by @CatherineSue in #11292
  • [None][feat] Add priority-based KV cache offload filtering support by @nv-yna in #10751
  • [None][docs] Add CUDA Graph + LoRA in Feature Combination Matrix by @JyChang012 in #11187
  • [TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) by @ixlmar in #11276
  • [https://nvbugs/5820874][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope by @chenfeiz0326 in #11259
  • [TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes by @ixlmar in #11281
  • [None][fix] Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel. by @yuxianq in #11256
  • [#11234][test] Move test_ad_export_onnx to integration examples by @nvyocox in #11260
  • [None][fix] Reduce host memory usage during model loading by @jthomson04 in #11119
  • [None][chore] Remove outdated comment in model_engine.py by @hnover-nv in #11240
  • [TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len by @chuangz0 in #11082
  • [https://nvbugs/5859869][fix] remove test waive since test is already deprecated by @lucaslie in #11288
  • [TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell by @yifeizhang-c in #10130
  • [https://nvbugs/5856637][ci] Remove the skip for fixed tests. by @SimengLiu-nv in #11285
  • [https://nvbugs/5744432][fix] fix bench script test by @Superjomn in #10483
  • [TRTLLM-10021][docs] Skip Softmax Attention blog and docs. by @bobboli in #10592
  • [#11148][feat] AutoDeploy: Better structure the custom op by @nvchenghaoz in #11152
  • [None][feat] AutoDeploy: add triton backend for causal conv by @nvchenghaoz in #11124
  • [https://nvbugs/5722629] [fix] Remove waive for nvbug 5722629 by @zongfeijing in #11278
  • [None][infra] Waive failed case and delete the redundent waives by @EmmaQiaoCh in #11331
  • [https://nvbugs/5756028][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation by @eopXD in #10798
  • [None][doc] Add sparse attention docs to index. by @bobboli in #11342
  • [TRTLLM-9524][feat] Modularization of the transceiver for KV manager v2 (step 4) by @Shixiaowei02 in #11225
  • [None][chore] AutoDeploy update SuperV3 checkpoints and accuracy thresholds by @galagam in #11107
  • [https://nvbugs/5863392][fix] fix partial reuse disabled for disagg by @Tabrizian in #11247
  • [https://nvbugs/5848756][fix] Re-take ownership of mrope tensors in prefill worker by @2ez4bz in #11217
  • [TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search by @ixlmar in #11341
  • [None][ci] Waive test failures on main 02/08 by @chzblych in #11365

New Contributors

Full Changelog: v1.3.0rc2...v1.3.0rc3

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.