NVIDIA/TensorRT-LLM v1.3.0rc12 on GitHub

Highlights

Model Support
- Add LTX-2 two-stage pipeline support (#12361)
- Add CUDA graph support for LTX-2 with torch.compile compatibility (#12653)
- Add video temporal compression for Nemotron Nano and RADIO (#12649)
- Extend the Python cache transceiver to support Qwen-Next (#12772)
- Add CuteDSL MoE backend support for Qwen3.5 (#12799)
- Fix LoRA support for Qwen3 models (#12785)
- Support loading FP8 LoRA weight files (#12848)
- Add support for speculative decoding with LoRA (#12661)
- Fix OOM with large numbers of LoRA adapters (#12815)
- Partially fix LoRA overallocation for Nemotron NAS (#12817)
- Skip inference_mode() when torch.compile=True for Gemma3 FP8 (#12367)
- Skip NVFP4 fused norm when the dimension does not meet requirements (#12901)
- Update MoE hidden_size in the communicator for Nemotron-H (#12890)
- Unify image-as-tensor handling to avoid repeated conversions for nano models (#12994)
API
- Refine the VisualGen API structure (#12807)
- Convert VisualGenParams to Pydantic with request validation, per-model defaults, and extra_params support (#12922)
- Align AttentionPlugin with the EdgeLLM interface (#12233)
- Add shorthand KVConnector paths for lmcache and kvbm (#12626)
- Add the missing allow_partial_loading parameter to CuteDSL and ConfigurableMoE load_weights (#12761)
- Improve KV cache statistics monitoring (#12413)
Feature
- Add NvTelemetry/GXT-compliant usage telemetry (#12384)
- Add production-level Prometheus metrics for iteration stats, config info, token counters, and phase histograms (#12545)
- Add conversation-affinity routing for disaggregated serving (#12526)
- Enable block reuse with the overlap scheduler (#12816)
- Unify VisualGen parallelism (#12509)
- Consolidate piecewise CUDA graph VLM updates (#12852)
- Add tunable NVFP4 quantization with an additional FlashInfer backend (#12126)
- Optimize GDN prefill with indexed in-kernel state updates (#12791)
Fix
- Propagate disaggregated_params through PostprocWorker (#12513)
- Prebuild disaggregated context responses to avoid ctx_request_id races (#12466)
- Generate HMAC keys for MGMN IPC servers in disaggregated serving (#12670)
- Enable HMAC authentication in VisualGen ZMQ IPC channels (#12680)
- Fix disaggregated gen-only hangs caused by blocking KV transfers (#12640)
- Replace busy-poll sleep in get_async_noblock with the ZMQ async poller (#12189)
- Make trust_remote_code opt-in in MultimodalModelRunner (#12669)
- Fix VLM guided decoding startup crashes caused by missing vocab_size_padded (#12284)
- Eliminate double PNG encoding in visual generation serving (#12903)
- Treat whitespace-only content correctly in nano-v3 reasoning swap (#12912)
- Clamp usedNumBlocks to non-negative values in KV cache statistics (#11922)
- Fix moe_chunking_tokens handling during MoE A2A (#12929)
- Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crashes (#12868)
- Remove leftover onboardBlocks parameters in kvCacheManagerTest (#13107)
- Add CUDA device setup before load_remote_agent (#12619)
- Fix Mooncake transfer agent binding (#12723)
- Fix multi_stream_moe accuracy with MLIR and piecewise CUDA graphs (#12847)
- Fix Nano chunked prefill (#12782)
- Fix constrained decoding for GLM5 (#12869)
- Fix benchmark disaggregated deadlocks by removing a blocking fill loop (#12208)
- Update CUTLASS C++ to 4.4.2 (#12897)
- Pin Ray to 2.54.1 (#13071)
Documentation
- Add the attention developer guide (#12693)
- Add a README for custom Claude Code skills and agents (#12920)
- Update coding guidelines to require Python >= 3.10 (#13094)
Benchmark
- Optimize the Qwen3.5 decode delta kernel (#12740)
- Reduce host overhead in DSA MLA attention (#12631)
- Add a host performance regression test suite for PyExecutor (#12148)
- Add benchmark coverage for allreduce backends (#12887)
- Restore DSR1/DSV32/K2 disaggregated performance tests (#12688)
- Support NV SA benchmarks in CI performance testing (#13004)
- Add K2.5 performance tests into CI (#12931)
Test & Infra
- Update Perf Sanity System code paths (#12430)
- Bump etcd to 3.6.9 to pick up the gRPC fix (#12594)
- Fix the PLC nightly pipeline and expose more pipeline data (#12940)
- Exclude QA nodes when running TRTLLM CI (#13102)
- Add a unit test for lifecycle race condition errors in disaggregated serving (#12803)
- Add an end-to-end test for PP + disagg + block reuse + chunked prefill hangs (#12913)
- Add Nemotron-3-Super-120B-A12B-NVFP4 functional and performance cases on DGX Spark (#12830)
- Remove obsolete RTX-6000 OOM tests (#12800)
- Remove unused tests (#12625)
- Check unused fixtures (#12730)
- Fix Qwen3 skip-softmax attention CI tests (#12789)
- Fix failing KV cache transceiver tests from the perf sanity changes (#12554)
- Fix Wan unit tests (#13026)
- Remove obsolete waivers (#12979)
- Move the PY312-UB2404 sanity check test to A100X nodes (#13077)
- Pin Ray to 2.54.1 in the Slurm CI stage (#13085)

What's Changed

[None][test] Unwaive Nemotron H flaky case by @nv-guomingz in #11236
[https://nvbugs/5997543][fix] unwaive test_disaggregated_overlap_transceiver_runtime_python by @chuangz0 in #12580
[TRTLLM-11574][feat] Some updates on Perf Sanity System codes by @chenfeiz0326 in #12430
[None][doc] add attention developer guide by @QiJune in #12693
[https://nvbugs/5991957][fix] Propagate disaggregated_params through PostprocWorker by @peihu-nv in #12513
[https://nvbugs/5883590][fix] Generate HMAC key for MGMN IPC server in disaggregated serving by @yibinl-nvidia in #12670
[https://nvbugs/5941242][fix] Fix SigLIP test failure by @tijyojwad in #12717
[None][feat] Optimize qwen3.5 decode delta kernel by @nv-guomingz in #12740
[https://nvbugs/5961736][fix] Prebuild disagg ctx response to avoid ctx_request_id race by @peihu-nv in #12466
[https://nvbugs/5922880][fix] Enable HMAC authentication in VisualGen ZMQ IPC channels by @yibinl-nvidia in #12680
[None][fix] Add missing allow_partial_loading param to CuteDSL and ConfigurableMoE load_weights by @qiaoxj07 in #12761
[None][chore] Waive hanging Nemotron Super test by @brb-nv in #12821
[None][fix] add cuda set device before load_remote_agent by @chuangz0 in #12619
[None][chore] Remove closed bugs by @xinhe-nv in #12766
[None][test] Remove RTX-6000 OOM test cases by @yufeiwu-nv in #12800
[None][fix] Fix LoRA support for Qwen3 models by @achartier in #12785
[TRTLLM-11343][feat] LTX-2 Two Stage pipeline support by @yibinl-nvidia in #12361
[#12808][feat] AutoDeploy: Add Gemma4 Support by @bmarimuthu-nv in #12710
[None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by @kaiyux in #12831
[#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by @wojciech-wais in #11922
[https://nvbugs/6029864][fix] Fix flaky ray test failure by @brb-nv in #12697
[https://nvbugs/5813192][fix] Make trust_remote_code opt-in in MultimodalModelRunner by @yibinl-nvidia in #12669
[None][infra] Bump etcd to 3.6.9 to involve grpc fix by @yuanjingx87 in #12594
[https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by @brb-nv in #12815
[None][feat] AutoDeploy: Add the Triton kernel for MLA by @nvchenghaoz in #12664
[None][fix] replace busy-poll sleep in get_async_noblock with zmq async poller by @edenfunf in #12189
[https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever by @yingguo-trt in #12803
[None][infra] Add DSR1 DSV32 K2 Disagg Perf Tests Back by @chenfeiz0326 in #12688
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12765
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12814
[None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property by @stefanpantic in #12284
[None][fix] Fix Nano chunked prefill by @2ez4bz in #12782
[https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by @liji-nv in #12659
[None][test] remove unused tests by @xinhe-nv in #12625
[https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by @peihu-nv in #12640
[#12593][feat] AutoDeploy: onboard DeepSeek-R1 by @galagam in #12601
[#11548][feat] AutoDeploy: Optimize Qwen3.5 perf by @taylor-yb-lee in #12265
[None][chore] Set the use_one_model flag to True by default on llm ap… by @nv-guomingz in #12836
[https://nvbugs/5921674][fix] unwaive TestNemotronNanoV3 fp8 tests by @tcherckez-nvidia in #12792
[None][feat] Add NvTelemetry/GXT-compliant usage telemetry by @venkywonka in #12384
[https://nvbugs/5996776][fix] Fix test OOM by @dongfengy in #12856
[None][feat] Support loading FP8 LoRA weight files by @achartier in #12848
[None][test] check unused fixtures by @xinhe-nv in #12730
[TRTLLM-11804][feat] Mechanical refactoring VisualGen API by @zhenhuaw-me in #12807
[TRTLLM-11324][perf] Add host performance regression test suite for PyExecutor by @hyukn in #12148
[None][chore] unwaive some dis-agg tests by @Shixiaowei02 in #12828
[TRTLLM-11707][feat] Add CUDA graph support (torch compile compatible) for LTX-2 by @luyiyun1021 in #12653
[https://nvbugs/6055474][test] Fix RTX-6000 with wrong moe backend by @yufeiwu-nv in #12886
[None][chore] Waive failing pre-merge test by @brb-nv in #12916
[None][docs] Add README for custom Claude Code skills and agents by @kaiyux in #12920
[TRTLLM-11421][feat] Support better kv cache statistics monitoring by @eopXD in #12413
[https://nvbugs/5448464][fix] Partially fix LoRA overallocation for Nemotron NAS by @brb-nv in #12817
[https://nvbugs/5996776][fix] Unwaive tests after fix by @dongfengy in #12906
[https://nvbugs/5940463][fix] remove test_cli_flow.py::TestSantacoder case by @QiJune in #12845
[None][test] Add Nemotron-3-Super-120B-A12B-NVFP4 func and perf cases on DGX-spark by @JennyLiu-nv in #12830
[None][infra] Fix plc nightly pipeline and show more data by @yuanjingx87 in #12940
[TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO by @2ez4bz in #12649
[https://nvbugs/5910749][https://nvbugs/5995486][test] Fix Qwen3 skip softmax attention CI tests by @bobboli in #12789
[https://nvbugs/6043312][fix] fix_mooncake_transfer_agent_binding by @chuangz0 in #12723
[#12699][feat] consolidate piecewise CUDA graph VLM updates by @nvchenghaoz in #12852
[TRTLLM-11770][feat] Skip nvfp4 fused norm if the dim doesn't meet the requirement by @pamelap-nvidia in #12901
[None][fix] skip inference_mode() when torch.compile=True for gemma3 fp8 by @amukkara in #12367
[None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 by @suyoggupta in #12866
[#12634][feat] AutoDeploy: Support rank 256 MLA in flashinfer_mla by @bmarimuthu-nv in #12519
[https://nvbugs/5997534][fix] AutoDeploy: Skip Eagle3 One Model Test on pre-Hopper by @govind-ramnarayan in #12757
[None][fix] Fix multi_stream_moe accuracy with MLIR and piecewise cudagraphs by @suyoggupta in #12847
[https://nvbugs/5961739][fix] Unwaiving failing tests by @greg-kwasniewski1 in #12936
[#12954][fix] AutoDeploy: Fix Gemma4 MoE config (disable multi_stream_moe, lower free_gpu_memory_fraction) by @suyoggupta in #12955
[TRTLLM-11540][feat] Add EAGLE3 dynamic tree speculative decoding support by @sunnyqgg in #12062
[https://nvbugs/6064029][fix] Eliminate double PNG encoding in visual gen serving by @karljang in #12903
[TRTLLM-11532][refactor] Unify VisualGen parallelism by @NVShreyas in #12509
[None][fix] Fix 'max_batch_size' conflict in AD dashboard script by @tcherckez-nvidia in #12967
[TRTLLM-11797][feat] Add cutedsl moe backend supporting for qwen3.5. by @nv-guomingz in #12799
[TRTLLM-11315][feat] Extend python cache transceiver to support Qwen-Next by @bo-nv in #12772
[https://nvbugs/5991576][test] Add E2E test for PP+disagg+block_reuse+chunked_prefill hang by @yingguo-trt in #12913
[None][feat] Align AttentionPlugin with EdgeLLM interface by @nvyocox in #12233
[None][infra] Waive 1 failed cases for main in post-merge 2648 by @ZhanruiSunCh in #12975
[https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent… by @liji-nv in #12631
[None][infra] Waive 8 failed cases for main in post-merge 2646 by @ZhanruiSunCh in #12934
[None][fix] Unwaive phi4 accuracy tests by @Wanli-Jiang in #12832
[None][feat] Add benchmark for all allreduce backend by @yilin-void in #12887
[TRTLLM-11893][feat] Convert VisualGenParams to Pydantic with extra_params, per-model defaults, and request validation by @zhenhuaw-me in #12922
[None][infra] Waive 4 failed cases for main in pre-merge 33523 by @ZhanruiSunCh in #12977
[None][infra] Waive 4 failed cases for main in post-merge by @xinhe-nv in #12973
[TRTLLM-11657][feat] Conversation affinity disagg router by @reasonsolo in #12526
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #12953
[None][feat] optimize GDN prefill with indexed in-kernel state updates by @nv-guomingz in #12791
[https://nvbugs/6061812][fix] Unblock ruff check by @VALLIS-NERIA in #12996
[None][fix] Update moe hidden_size in communicator for nemotron-h by @Wanli-Jiang in #12890
[TRTLLM-11492][fix] Fix benchmark disagg deadlock by eliminating blocking fill loop by @chienchunhung in #12208
[TRTLLM-10938][feat] Enable block reuse with overlap scheduler by @chienchunhung in #12816
[#12617][feat] Add support for speculative decoding with LoRA by @Funatiq in #12661
[None][fix] Fix contrained decoding for GLM5 by @cascade812 in #12869
[None][test] Waive two dsv3lite cases due to nvbug 6071081. by @nv-guomingz in #13001
[None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms) by @nvyutwu in #12545
[None][infra] Remove invalid test case in waive list by @yuanjingx87 in #13008
[None][chore] Fix failing KV Cache Transceiver Tests from #11574 by @ekou24 in #12554
[https://nvbugs/6060281][fix] Treat whitespace-only content in nano-v3 reasoning swap by @tijyojwad in #12912
[None][feat] KVConnector shorthand paths for "lmcache" and "kvbm" with examples by @sammshen in #12626
[#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only) by @govind-ramnarayan in #12708
[https://nvbugs/6059036][fix] AutoDeploy fix registry accuracy tests by @nvchenghaoz in #12942
[https://nvbugs/5963665][refactor] Refactor warmup orchestration in ModelEngine by @liji-nv in #12407
[https://nvbugs/5973214][fix] unwaive qwen3 ci test by @byshiue in #12237
[https://nvbugs/5781383][chore] Unwaive test by @shuyixiong in #12282
[None][chore] AutoDeploy: Added Qwen3.5 accuracy test for NVFP4 by @taylor-yb-lee in #13014
[None][chore] Unify code path for reuse/non-reuse when adding sequence in kv cache manager by @eopXD in #10437
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #13016
[None][chore] Waive failed tests by @yiqingy0 in #13035
[TRTLLM-11091][feat] Add tunable nvfp4 quantize with additional FlashInfer backend by @chang-l in #12126
[TRTLLM-11540][feat] Revert EAGLE3 dynamic tree speculative decoding support (#12062) by @brb-nv in #13006
[None][fix] fix Wan unit tests by @zhenhuaw-me in #13026
[None][fix] Update CUTLASS C++ to 4.4.2 by @depaulmillz in #12897
[None][chore] Waive failing tests 04/14 by @brb-nv in #13049
[None][chore] Unwaive broader test lists by @brb-nv in #13053
[None][chore] Update waived test name by @brb-nv in #13058
[None][infra] Waive 4 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13067
[None][fix] Fix moe_chunking_tokens during MoE A2A by @Wanli-Jiang in #12929
[None][infra] Support nv sa benchmark in CI Perf Test by @chenfeiz0326 in #13004
[None][fix] Pin Ray version to 2.54.1 by @shuyixiong in #13071
[None][infra] Add K2.5 Perf Tests into CI by @chenfeiz0326 in #12931
[https://nvbugs/5846024][fix] Remove waivers by @VALLIS-NERIA in #12979
[https://nvbugs/5838178][fix] Fix failing lora test for Llama by @brb-nv in #12950
[None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash by @yifjiang in #12868
[None][fix] Pin Ray version to 2.54.1 in slurm CI stage by @shuyixiong in #13085
[TRTLLM-11266][feat] Unify image as tensor to avoid multiple converting for nano model by @Wanli-Jiang in #12994
[None][fix] Update CODING_GUIDELINES.md to say Python >= 3.10 by @hnover-nv in #13094
[None][chore] Remove onboard block switch for KV cache manager by @eopXD in #12449
[None][infra] Waive 3 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13070
[None][infra] Exclude QA nodes when running TRTLLM CI by @yuanjingx87 in #13102
[None][fix] Remove leftover onboardBlocks param in kvCacheManagerTest by @eopXD in #13107
[None][infra] Waive 1 failed cases for main in post-merge 2653 by @ZhanruiSunCh in #13109
[TRTLLM-11990][infra] Move PY312-UB2404 sanityCheck test to A100X node by @yiqingy0 in #13077
[None][chore] Bump version to 1.3.0rc12 by @VALLIS-NERIA in #13129

New Contributors

@stefanpantic made their first contribution in #12284
@nvyutwu made their first contribution in #12545
@sammshen made their first contribution in #12626
@depaulmillz made their first contribution in #12897

Full Changelog: v1.3.0rc11...v1.3.0rc12