NVIDIA/TensorRT-LLM v1.3.0rc3 on GitHub

Highlights:

Model Support
- Support LoRa BF16 checkpoints with Llama 3.3-70B FP8 (#9808)
- Add Eagle3 support for Nemotron H (#11131)
- Enhance support for complex models (#11254)

API
- Allow overriding quantization configs (#11062)
- Set continuous_usage_stats default to False to follow OpenAI protocol (#10644)
- Set max_num_tokens_in_buffer default based on max_seq_len/max_input_len (#11082)

Feature
- Export ONNX for DriveOS LLM (#10117)
- Add L2 norm pattern matcher and fusion transform (#10767)
- Add PDL support for moeAlltoAllKernels (#10591)
- Integrate KVCacheManager V2 into TRTLLM runtime (#10659)
- Integrate cuda.tile RMS norm kernels (#9725)
- Refactor request fetching logic for better separation of concerns (#10988)
- Implement gen-first disagg_service (#11020)
- Support disagg SLURM job rescheduling (#11218)
- Improve layer classification for sharding (#10718)
- Add priority-based KV cache offload filtering (#10751)
- Optimize beam search performance (remove GPU sync, fix batching, refactor) (#11276)
- Avoid sync in PyTorchModelEngine when using beam search (#11341)
- Adjust DeepGEMM tuning buckets for larger num_tokens scope (#11259)
- Add CuteDSL FP8 GEMM for Blackwell (#10130)
- Reduce host memory usage during model loading (#11119)
- Perfect routing for Deepseek models (#11127)
- Modularize transceiver for KV manager v2 (step 4) (#11225)

Fix
- Fix AttributeError with return_perf_metrics on TensorRT backend (#10662)
- Prevent routing context and generation requests to the same worker; document unique disagg ID (#11095)
- Prevent out-of-bounds read (#10868)
- Add __syncthreads() to TinyGEMM to resolve intermittent accuracy issues (#10873)
- Fix PD disaggregation for VLMs that use mrope (#10865)
- Always reset drafting states for GuidedDecoder (#10899)
- Use NCCL as fallback to avoid crash due to insufficient memory (#10928)
- Fix llama sm120 spec decoding (#10765)
- Fix MTP one-model sampler (#10369)
- Align kv_scales with ModelOpt HF checkpoint (#10745)
- Fix selective_state_update perf regression for T=1 decode path (#11194)
- Make health_generate work with beam search (#11097)
- Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 (#11192)
- Fix CuteDSL argmax on sm120 (#11181)
- Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel (#11256)
- Fix VSWA initialization with spec-dec and boundary condition in context input preparation (#10798)
- Fix partial reuse disabled for disagg (#11247)
- Retake ownership of mrope tensors in prefill worker (#11217)
- Fix proto-to-SamplingParams conversion bugs and add gRPC tests (#11292)
- Fix accuracy drop in VSWA with KV cache block reuse (#10875)

Documentation
- Add Glm4MoeForCausalLM to model support matrix (#11156)
- Fix GLM4-MoE Eagle support documentation (#11198)
- Add CUDA Graph + LoRA to feature combination matrix (#11187)
- Fix comments for KV cache manager v2 (#11207)
- Skip Softmax Attention blog and docs (#10592)
- Add sparse attention docs to index (#11342)

Test & Infra
- Update GB200 test configs to use frontend SLURM platforms (#11085)
- Fix jaraco-context and wheel vulnerability (#10901)
- Add --high-priority in bot help message (#11133)
- Print memory usage before/after accuracy test in CI (#11155)
- Fix mocking of HuggingFace downloads in with_mocked_hf_download (#11200)
- Set rerun report stage UNSTABLE and pipeline SUCCESS when rerun tests pass (#11210)
- Move 6x H100 test stage to AIHub platform (#11039)
- Add disagg perf tests (#10912)
- Provide uniform test framework to test all MoE backends (#11128)
- Move disagg scripts env configs from bash to submit.py (#10223)
- Use free port for serve test (#10878)
- Fix test_auto_scaling for 2 GPUs (#10866)
- Update test list (#10883)
- Fix an invalid test name (#11195)
- Refine QA test list for SM120 (#11248)
- Fix multimodal serve test (#11296)
- Pass without_comm to Cutlass and DeepGEMM (#11229)
- Promote SampleState to TypeVar and fix typing (#11281)
- Fix bench script test (#10483)

What's Changed

[None][feat] Export ONNX for DriveOS LLM by @nvyocox in #10117
[#9525][feat] add L2 norm pattern matcher and fusion transform by @karthikvetrivel in #10767
[TRTINFRA-7548][infra] Update GB200 test configs to use frontend SLURM platforms by @mlefeb01 in #11085
[None][doc] Add Glm4MoeForCausalLM to model support matrix by @venkywonka in #11156
[None][feat] Perfect routing for Deepseek models by @brb-nv in #11127
[TRTLLM-10398][feat] Enable TRTLLM moe backend for Nemotron Super by @nv-guomingz in #10791
[#8242][feat] Add int4 GPTQ support for AutoDeploy by @Fridah-nv in #8248
[https://nvbugs/5804683][infra] unwaive Mistral Large3 test by @byshiue in #10680
[TRTLLM-9771][feat] Allow overriding quantization configs by @shuyixiong in #11062
[None][ci] Waive a flaky test on A10 by @chzblych in #11163
[None][infra] Waive failed cases for main on 1/30 by @EmmaQiaoCh in #11142
[None][fix] AttributeError with return_perf_metrics on tensorrt backend by @riZZZhik in #10662
[https://nvbugs/5834212][fix] prevent routing ctx and gen requests to the same worker; update doc for unique disagg ID by @reasonsolo in #11095
[TRTLLM-10666][chore] Refactor request fetching logic for better separation of concerns by @lancelly in #10988
[https://nvbugs/5823284][fix] Unwaive no repro hang issue by @liji-nv in #11138
[None] [feat] Add PDL support for moeAlltoAllKernels by @kaiyux in #10591
[None][infra] Waive failed cases and disable a stage on 02/02 by @EmmaQiaoCh in #11177
[TRTLLM-9766][feat] Integration of the KVCacheManager V2 to TRTLLM Runtime by @yizhang-nv in #10659
[None][chroe] Mass integration of release/1.2 - 2nd by @dominicshanshan in #11088
[None][feat] Integrate cuda.tile RMS norm kernels by @lirundong in #9725
[None][test] Fix an invalid test name by @chzblych in #11195
[None][feat] Nemotron H: Eagle3 support by @IzzyPutterman in #11131
[#10826][feat] AutoDeploy: Eagle One-Model [2/n]: Prefill-Only Implementation by @govind-ramnarayan in #11073
[None][doc] Fix GLM4-MoE Eagle support documentation by @venkywonka in #11198
[TRTLLM-10561][infra] Fix jaraco-context and wheel vulnerability by @yiqingy0 in #10901
[TRTLLM-10307][infra] Add --high-priority in bot help message by @mzweilz in #11133
[None][chore] Print memory usage before/after accuracy test in CI by @taylor-yb-lee in #11155
[TRTLLM-10803][fix] Fix mocking of HuggingFace downloads in with_mocked_hf_download by @anish-shanbhag in #11200
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11193
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11202
[TRTLLM-10839][infra] Set rerun report stage UNSTABLE and pipeline SUCCESS in post-merge when there are passed rerun tests by @yiqingy0 in #11210
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #11216
[None][fix] Align kv_scales with modelopt HF checkpoint by @cjluo-nv in #10745
[https://nvbugs/5739981][fix] unwaive tests using opt-125M by @ixlmar in #11100
[TRTLLM-10019][infra] Move 6 h100 test stage to aihub platform by @yuanjingx87 in #11039
[TRTLLM-8921][feat] implement gen-first disagg_service by @reasonsolo in #11020
[#11086][feat] Optimize Auto Deploy weight loading by preloading weights to CPU by @taylor-yb-lee in #11059
[None][fix] Set continuous_usage_stats default to False to follow OpenAI protocol by @riZZZhik in #10644
[None][chore] bump version to 1.3.0rc3 by @tburt-nv in #11238
[TRTLLM-8263][feat] Add Disagg Perf Tests by @chenfeiz0326 in #10912
[None][fix] Fix selective_state_update perf regression for T=1 decode path by @galagam in #11194
[TRTLLM-9111][feat] provide the uniform test framework to test all MoE backends by @xxi-nv in #11128
[None][fix] make health_generate work with beam search by @ixlmar in #11097
[None][feat] move some disagg script's env configs from bash to submit.py by @dc3671 in #10223
[https://nvbugs/5624818][fix] Work around accuracy issue by enforcing paged_context_fmha on Hopper for fmha_v2 by @eopXD in #11192
[None][feat] Support disagg slurm jobs rescheduling by @qiaoxj07 in #11218
[#10966][feat] AutoDeploy: kv cache manager integration [2/2] by @lucaslie in #11149
[TRTLLM-10673][feat] Improved layer classification for sharding by @greg-kwasniewski1 in #10718
[None][chore] AutoDeploy: Set nanov3 and superv3 configs to use flashinfer ssm by @galagam in #11183
[https://nvbugs/5674665][fix] Fix accuracy drop in VSWA with KV cache block reuse by @SimengLiu-nv in #10875
[https://nvbugs/5849697][fix] Refine QA Test List for SM120 by @dongfengy in #11248
[https://nvbugs/5854860][fix] Fix cutedsl argmax on sm120 by @dongfengy in #11181
[None][fix] Fix comments for kv cache manager v2 by @yizhang-nv in #11207
[https://nvbugs/5837275][fix] Unwaive the failing case that cannot be… by @liji-nv in #11137
[https://nvbugs/5800679][fix] Re-enable test after bug fixed by @dongfengy in #11249
[TRTLLM-9210][fix] Add failed cases into waives.txt by @xinhe-nv in #11223
[https://nvbugs/5747920][fix] Fix multimodal serve test by @yechank-nvidia in #11296
[None][chore] Pass without_comm to cutlass and deepgemm by @xxi-nv in #11229
[None][feat] Enhance support for complex models by @lowsfer in #11254
[#11037][fix] Fix proto-to-SamplingParams conversion bugs and add gRPC tests by @CatherineSue in #11292
[None][feat] Add priority-based KV cache offload filtering support by @nv-yna in #10751
[None][docs] Add CUDA Graph + LoRA in Feature Combination Matrix by @JyChang012 in #11187
[TRTLLM-10030][perf] beam search (remove GPU sync + fix batching + refactor) by @ixlmar in #11276
[https://nvbugs/5820874][fix] Adjust deepgemm tuning buckets to cover larger num_tokens's scope by @chenfeiz0326 in #11259
[TRTLLM-10030][chore] promote SampleState to TypeVar + typing fixes by @ixlmar in #11281
[None][fix] Fix amax to avoid NaN issue in fp8_blockscale_gemm_kernel. by @yuxianq in #11256
[#11234][test] Move test_ad_export_onnx to integration examples by @nvyocox in #11260
[None][fix] Reduce host memory usage during model loading by @jthomson04 in #11119
[None][chore] Remove outdated comment in model_engine.py by @hnover-nv in #11240
[TRTLLM-10752][chore] set default val of max_num_tokens_in_buffer as max_seq_len or max_input_len by @chuangz0 in #11082
[https://nvbugs/5859869][fix] remove test waive since test is already deprecated by @lucaslie in #11288
[TRTLLM-9457][feat] Add cute dsl fp8 gemm for Blackwell by @yifeizhang-c in #10130
[https://nvbugs/5856637][ci] Remove the skip for fixed tests. by @SimengLiu-nv in #11285
[https://nvbugs/5744432][fix] fix bench script test by @Superjomn in #10483
[TRTLLM-10021][docs] Skip Softmax Attention blog and docs. by @bobboli in #10592
[#11148][feat] AutoDeploy: Better structure the custom op by @nvchenghaoz in #11152
[None][feat] AutoDeploy: add triton backend for causal conv by @nvchenghaoz in #11124
[https://nvbugs/5722629] [fix] Remove waive for nvbug 5722629 by @zongfeijing in #11278
[None][infra] Waive failed case and delete the redundent waives by @EmmaQiaoCh in #11331
[https://nvbugs/5756028][fix] Fix VSWA initialization with spec-dec and boundary condition in context input preparation by @eopXD in #10798
[None][doc] Add sparse attention docs to index. by @bobboli in #11342
[TRTLLM-9524][feat] Modularization of the transceiver for KV manager v2 (step 4) by @Shixiaowei02 in #11225
[None][chore] AutoDeploy update SuperV3 checkpoints and accuracy thresholds by @galagam in #11107
[https://nvbugs/5863392][fix] fix partial reuse disabled for disagg by @Tabrizian in #11247
[https://nvbugs/5848756][fix] Re-take ownership of mrope tensors in prefill worker by @2ez4bz in #11217
[TRTLLM-10030][perf] avoid sync in PyTorchModelEngine when using beam search by @ixlmar in #11341
[None][ci] Waive test failures on main 02/08 by @chzblych in #11365

New Contributors

@nvyocox made their first contribution in #10117
@riZZZhik made their first contribution in #10662
@lirundong made their first contribution in #9725
@mzweilz made their first contribution in #11133
@nv-yna made their first contribution in #10751

Full Changelog: v1.3.0rc2...v1.3.0rc3