github NVIDIA/TensorRT-LLM v1.3.0rc12

pre-release6 hours ago

Highlights

  • Model Support

    • Add LTX-2 two-stage pipeline support (#12361)
    • Add CUDA graph support for LTX-2 with torch.compile compatibility (#12653)
    • Add video temporal compression for Nemotron Nano and RADIO (#12649)
    • Extend the Python cache transceiver to support Qwen-Next (#12772)
    • Add CuteDSL MoE backend support for Qwen3.5 (#12799)
    • Fix LoRA support for Qwen3 models (#12785)
    • Support loading FP8 LoRA weight files (#12848)
    • Add support for speculative decoding with LoRA (#12661)
    • Fix OOM with large numbers of LoRA adapters (#12815)
    • Partially fix LoRA overallocation for Nemotron NAS (#12817)
    • Skip inference_mode() when torch.compile=True for Gemma3 FP8 (#12367)
    • Skip NVFP4 fused norm when the dimension does not meet requirements (#12901)
    • Update MoE hidden_size in the communicator for Nemotron-H (#12890)
    • Unify image-as-tensor handling to avoid repeated conversions for nano models (#12994)
  • API

    • Refine the VisualGen API structure (#12807)
    • Convert VisualGenParams to Pydantic with request validation, per-model defaults, and extra_params support (#12922)
    • Align AttentionPlugin with the EdgeLLM interface (#12233)
    • Add shorthand KVConnector paths for lmcache and kvbm (#12626)
    • Add the missing allow_partial_loading parameter to CuteDSL and ConfigurableMoE load_weights (#12761)
    • Improve KV cache statistics monitoring (#12413)
  • Feature

    • Add NvTelemetry/GXT-compliant usage telemetry (#12384)
    • Add production-level Prometheus metrics for iteration stats, config info, token counters, and phase histograms (#12545)
    • Add conversation-affinity routing for disaggregated serving (#12526)
    • Enable block reuse with the overlap scheduler (#12816)
    • Unify VisualGen parallelism (#12509)
    • Consolidate piecewise CUDA graph VLM updates (#12852)
    • Add tunable NVFP4 quantization with an additional FlashInfer backend (#12126)
    • Optimize GDN prefill with indexed in-kernel state updates (#12791)
  • Fix

    • Propagate disaggregated_params through PostprocWorker (#12513)
    • Prebuild disaggregated context responses to avoid ctx_request_id races (#12466)
    • Generate HMAC keys for MGMN IPC servers in disaggregated serving (#12670)
    • Enable HMAC authentication in VisualGen ZMQ IPC channels (#12680)
    • Fix disaggregated gen-only hangs caused by blocking KV transfers (#12640)
    • Replace busy-poll sleep in get_async_noblock with the ZMQ async poller (#12189)
    • Make trust_remote_code opt-in in MultimodalModelRunner (#12669)
    • Fix VLM guided decoding startup crashes caused by missing vocab_size_padded (#12284)
    • Eliminate double PNG encoding in visual generation serving (#12903)
    • Treat whitespace-only content correctly in nano-v3 reasoning swap (#12912)
    • Clamp usedNumBlocks to non-negative values in KV cache statistics (#11922)
    • Fix moe_chunking_tokens handling during MoE A2A (#12929)
    • Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crashes (#12868)
    • Remove leftover onboardBlocks parameters in kvCacheManagerTest (#13107)
    • Add CUDA device setup before load_remote_agent (#12619)
    • Fix Mooncake transfer agent binding (#12723)
    • Fix multi_stream_moe accuracy with MLIR and piecewise CUDA graphs (#12847)
    • Fix Nano chunked prefill (#12782)
    • Fix constrained decoding for GLM5 (#12869)
    • Fix benchmark disaggregated deadlocks by removing a blocking fill loop (#12208)
    • Update CUTLASS C++ to 4.4.2 (#12897)
    • Pin Ray to 2.54.1 (#13071)
  • Documentation

    • Add the attention developer guide (#12693)
    • Add a README for custom Claude Code skills and agents (#12920)
    • Update coding guidelines to require Python >= 3.10 (#13094)
  • Benchmark

    • Optimize the Qwen3.5 decode delta kernel (#12740)
    • Reduce host overhead in DSA MLA attention (#12631)
    • Add a host performance regression test suite for PyExecutor (#12148)
    • Add benchmark coverage for allreduce backends (#12887)
    • Restore DSR1/DSV32/K2 disaggregated performance tests (#12688)
    • Support NV SA benchmarks in CI performance testing (#13004)
    • Add K2.5 performance tests into CI (#12931)
  • Test & Infra

    • Update Perf Sanity System code paths (#12430)
    • Bump etcd to 3.6.9 to pick up the gRPC fix (#12594)
    • Fix the PLC nightly pipeline and expose more pipeline data (#12940)
    • Exclude QA nodes when running TRTLLM CI (#13102)
    • Add a unit test for lifecycle race condition errors in disaggregated serving (#12803)
    • Add an end-to-end test for PP + disagg + block reuse + chunked prefill hangs (#12913)
    • Add Nemotron-3-Super-120B-A12B-NVFP4 functional and performance cases on DGX Spark (#12830)
    • Remove obsolete RTX-6000 OOM tests (#12800)
    • Remove unused tests (#12625)
    • Check unused fixtures (#12730)
    • Fix Qwen3 skip-softmax attention CI tests (#12789)
    • Fix failing KV cache transceiver tests from the perf sanity changes (#12554)
    • Fix Wan unit tests (#13026)
    • Remove obsolete waivers (#12979)
    • Move the PY312-UB2404 sanity check test to A100X nodes (#13077)
    • Pin Ray to 2.54.1 in the Slurm CI stage (#13085)

What's Changed

  • [None][test] Unwaive Nemotron H flaky case by @nv-guomingz in #11236
  • [https://nvbugs/5997543][fix] unwaive test_disaggregated_overlap_transceiver_runtime_python by @chuangz0 in #12580
  • [TRTLLM-11574][feat] Some updates on Perf Sanity System codes by @chenfeiz0326 in #12430
  • [None][doc] add attention developer guide by @QiJune in #12693
  • [https://nvbugs/5991957][fix] Propagate disaggregated_params through PostprocWorker by @peihu-nv in #12513
  • [https://nvbugs/5883590][fix] Generate HMAC key for MGMN IPC server in disaggregated serving by @yibinl-nvidia in #12670
  • [https://nvbugs/5941242][fix] Fix SigLIP test failure by @tijyojwad in #12717
  • [None][feat] Optimize qwen3.5 decode delta kernel by @nv-guomingz in #12740
  • [https://nvbugs/5961736][fix] Prebuild disagg ctx response to avoid ctx_request_id race by @peihu-nv in #12466
  • [https://nvbugs/5922880][fix] Enable HMAC authentication in VisualGen ZMQ IPC channels by @yibinl-nvidia in #12680
  • [None][fix] Add missing allow_partial_loading param to CuteDSL and ConfigurableMoE load_weights by @qiaoxj07 in #12761
  • [None][chore] Waive hanging Nemotron Super test by @brb-nv in #12821
  • [None][fix] add cuda set device before load_remote_agent by @chuangz0 in #12619
  • [None][chore] Remove closed bugs by @xinhe-nv in #12766
  • [None][test] Remove RTX-6000 OOM test cases by @yufeiwu-nv in #12800
  • [None][fix] Fix LoRA support for Qwen3 models by @achartier in #12785
  • [TRTLLM-11343][feat] LTX-2 Two Stage pipeline support by @yibinl-nvidia in #12361
  • [#12808][feat] AutoDeploy: Add Gemma4 Support by @bmarimuthu-nv in #12710
  • [None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by @kaiyux in #12831
  • [#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by @wojciech-wais in #11922
  • [https://nvbugs/6029864][fix] Fix flaky ray test failure by @brb-nv in #12697
  • [https://nvbugs/5813192][fix] Make trust_remote_code opt-in in MultimodalModelRunner by @yibinl-nvidia in #12669
  • [None][infra] Bump etcd to 3.6.9 to involve grpc fix by @yuanjingx87 in #12594
  • [https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by @brb-nv in #12815
  • [None][feat] AutoDeploy: Add the Triton kernel for MLA by @nvchenghaoz in #12664
  • [None][fix] replace busy-poll sleep in get_async_noblock with zmq async poller by @edenfunf in #12189
  • [https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever by @yingguo-trt in #12803
  • [None][infra] Add DSR1 DSV32 K2 Disagg Perf Tests Back by @chenfeiz0326 in #12688
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12765
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12814
  • [None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property by @stefanpantic in #12284
  • [None][fix] Fix Nano chunked prefill by @2ez4bz in #12782
  • [https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by @liji-nv in #12659
  • [None][test] remove unused tests by @xinhe-nv in #12625
  • [https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by @peihu-nv in #12640
  • [#12593][feat] AutoDeploy: onboard DeepSeek-R1 by @galagam in #12601
  • [#11548][feat] AutoDeploy: Optimize Qwen3.5 perf by @taylor-yb-lee in #12265
  • [None][chore] Set the use_one_model flag to True by default on llm ap… by @nv-guomingz in #12836
  • [https://nvbugs/5921674][fix] unwaive TestNemotronNanoV3 fp8 tests by @tcherckez-nvidia in #12792
  • [None][feat] Add NvTelemetry/GXT-compliant usage telemetry by @venkywonka in #12384
  • [https://nvbugs/5996776][fix] Fix test OOM by @dongfengy in #12856
  • [None][feat] Support loading FP8 LoRA weight files by @achartier in #12848
  • [None][test] check unused fixtures by @xinhe-nv in #12730
  • [TRTLLM-11804][feat] Mechanical refactoring VisualGen API by @zhenhuaw-me in #12807
  • [TRTLLM-11324][perf] Add host performance regression test suite for PyExecutor by @hyukn in #12148
  • [None][chore] unwaive some dis-agg tests by @Shixiaowei02 in #12828
  • [TRTLLM-11707][feat] Add CUDA graph support (torch compile compatible) for LTX-2 by @luyiyun1021 in #12653
  • [https://nvbugs/6055474][test] Fix RTX-6000 with wrong moe backend by @yufeiwu-nv in #12886
  • [None][chore] Waive failing pre-merge test by @brb-nv in #12916
  • [None][docs] Add README for custom Claude Code skills and agents by @kaiyux in #12920
  • [TRTLLM-11421][feat] Support better kv cache statistics monitoring by @eopXD in #12413
  • [https://nvbugs/5448464][fix] Partially fix LoRA overallocation for Nemotron NAS by @brb-nv in #12817
  • [https://nvbugs/5996776][fix] Unwaive tests after fix by @dongfengy in #12906
  • [https://nvbugs/5940463][fix] remove test_cli_flow.py::TestSantacoder case by @QiJune in #12845
  • [None][test] Add Nemotron-3-Super-120B-A12B-NVFP4 func and perf cases on DGX-spark by @JennyLiu-nv in #12830
  • [None][infra] Fix plc nightly pipeline and show more data by @yuanjingx87 in #12940
  • [TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO by @2ez4bz in #12649
  • [https://nvbugs/5910749][https://nvbugs/5995486][test] Fix Qwen3 skip softmax attention CI tests by @bobboli in #12789
  • [https://nvbugs/6043312][fix] fix_mooncake_transfer_agent_binding by @chuangz0 in #12723
  • [#12699][feat] consolidate piecewise CUDA graph VLM updates by @nvchenghaoz in #12852
  • [TRTLLM-11770][feat] Skip nvfp4 fused norm if the dim doesn't meet the requirement by @pamelap-nvidia in #12901
  • [None][fix] skip inference_mode() when torch.compile=True for gemma3 fp8 by @amukkara in #12367
  • [None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 by @suyoggupta in #12866
  • [#12634][feat] AutoDeploy: Support rank 256 MLA in flashinfer_mla by @bmarimuthu-nv in #12519
  • [https://nvbugs/5997534][fix] AutoDeploy: Skip Eagle3 One Model Test on pre-Hopper by @govind-ramnarayan in #12757
  • [None][fix] Fix multi_stream_moe accuracy with MLIR and piecewise cudagraphs by @suyoggupta in #12847
  • [https://nvbugs/5961739][fix] Unwaiving failing tests by @greg-kwasniewski1 in #12936
  • [#12954][fix] AutoDeploy: Fix Gemma4 MoE config (disable multi_stream_moe, lower free_gpu_memory_fraction) by @suyoggupta in #12955
  • [TRTLLM-11540][feat] Add EAGLE3 dynamic tree speculative decoding support by @sunnyqgg in #12062
  • [https://nvbugs/6064029][fix] Eliminate double PNG encoding in visual gen serving by @karljang in #12903
  • [TRTLLM-11532][refactor] Unify VisualGen parallelism by @NVShreyas in #12509
  • [None][fix] Fix 'max_batch_size' conflict in AD dashboard script by @tcherckez-nvidia in #12967
  • [TRTLLM-11797][feat] Add cutedsl moe backend supporting for qwen3.5. by @nv-guomingz in #12799
  • [TRTLLM-11315][feat] Extend python cache transceiver to support Qwen-Next by @bo-nv in #12772
  • [https://nvbugs/5991576][test] Add E2E test for PP+disagg+block_reuse+chunked_prefill hang by @yingguo-trt in #12913
  • [None][feat] Align AttentionPlugin with EdgeLLM interface by @nvyocox in #12233
  • [None][infra] Waive 1 failed cases for main in post-merge 2648 by @ZhanruiSunCh in #12975
  • [https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent… by @liji-nv in #12631
  • [None][infra] Waive 8 failed cases for main in post-merge 2646 by @ZhanruiSunCh in #12934
  • [None][fix] Unwaive phi4 accuracy tests by @Wanli-Jiang in #12832
  • [None][feat] Add benchmark for all allreduce backend by @yilin-void in #12887
  • [TRTLLM-11893][feat] Convert VisualGenParams to Pydantic with extra_params, per-model defaults, and request validation by @zhenhuaw-me in #12922
  • [None][infra] Waive 4 failed cases for main in pre-merge 33523 by @ZhanruiSunCh in #12977
  • [None][infra] Waive 4 failed cases for main in post-merge by @xinhe-nv in #12973
  • [TRTLLM-11657][feat] Conversation affinity disagg router by @reasonsolo in #12526
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12953
  • [None][feat] optimize GDN prefill with indexed in-kernel state updates by @nv-guomingz in #12791
  • [https://nvbugs/6061812][fix] Unblock ruff check by @VALLIS-NERIA in #12996
  • [None][fix] Update moe hidden_size in communicator for nemotron-h by @Wanli-Jiang in #12890
  • [TRTLLM-11492][fix] Fix benchmark disagg deadlock by eliminating blocking fill loop by @chienchunhung in #12208
  • [TRTLLM-10938][feat] Enable block reuse with overlap scheduler by @chienchunhung in #12816
  • [#12617][feat] Add support for speculative decoding with LoRA by @Funatiq in #12661
  • [None][fix] Fix contrained decoding for GLM5 by @cascade812 in #12869
  • [None][test] Waive two dsv3lite cases due to nvbug 6071081. by @nv-guomingz in #13001
  • [None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms) by @nvyutwu in #12545
  • [None][infra] Remove invalid test case in waive list by @yuanjingx87 in #13008
  • [None][chore] Fix failing KV Cache Transceiver Tests from #11574 by @ekou24 in #12554
  • [https://nvbugs/6060281][fix] Treat whitespace-only content in nano-v3 reasoning swap by @tijyojwad in #12912
  • [None][feat] KVConnector shorthand paths for "lmcache" and "kvbm" with examples by @sammshen in #12626
  • [#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only) by @govind-ramnarayan in #12708
  • [https://nvbugs/6059036][fix] AutoDeploy fix registry accuracy tests by @nvchenghaoz in #12942
  • [https://nvbugs/5963665][refactor] Refactor warmup orchestration in ModelEngine by @liji-nv in #12407
  • [https://nvbugs/5973214][fix] unwaive qwen3 ci test by @byshiue in #12237
  • [https://nvbugs/5781383][chore] Unwaive test by @shuyixiong in #12282
  • [None][chore] AutoDeploy: Added Qwen3.5 accuracy test for NVFP4 by @taylor-yb-lee in #13014
  • [None][chore] Unify code path for reuse/non-reuse when adding sequence in kv cache manager by @eopXD in #10437
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #13016
  • [None][chore] Waive failed tests by @yiqingy0 in #13035
  • [TRTLLM-11091][feat] Add tunable nvfp4 quantize with additional FlashInfer backend by @chang-l in #12126
  • [TRTLLM-11540][feat] Revert EAGLE3 dynamic tree speculative decoding support (#12062) by @brb-nv in #13006
  • [None][fix] fix Wan unit tests by @zhenhuaw-me in #13026
  • [None][fix] Update CUTLASS C++ to 4.4.2 by @depaulmillz in #12897
  • [None][chore] Waive failing tests 04/14 by @brb-nv in #13049
  • [None][chore] Unwaive broader test lists by @brb-nv in #13053
  • [None][chore] Update waived test name by @brb-nv in #13058
  • [None][infra] Waive 4 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13067
  • [None][fix] Fix moe_chunking_tokens during MoE A2A by @Wanli-Jiang in #12929
  • [None][infra] Support nv sa benchmark in CI Perf Test by @chenfeiz0326 in #13004
  • [None][fix] Pin Ray version to 2.54.1 by @shuyixiong in #13071
  • [None][infra] Add K2.5 Perf Tests into CI by @chenfeiz0326 in #12931
  • [https://nvbugs/5846024][fix] Remove waivers by @VALLIS-NERIA in #12979
  • [https://nvbugs/5838178][fix] Fix failing lora test for Llama by @brb-nv in #12950
  • [None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash by @yifjiang in #12868
  • [None][fix] Pin Ray version to 2.54.1 in slurm CI stage by @shuyixiong in #13085
  • [TRTLLM-11266][feat] Unify image as tensor to avoid multiple converting for nano model by @Wanli-Jiang in #12994
  • [None][fix] Update CODING_GUIDELINES.md to say Python >= 3.10 by @hnover-nv in #13094
  • [None][chore] Remove onboard block switch for KV cache manager by @eopXD in #12449
  • [None][infra] Waive 3 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13070
  • [None][infra] Exclude QA nodes when running TRTLLM CI by @yuanjingx87 in #13102
  • [None][fix] Remove leftover onboardBlocks param in kvCacheManagerTest by @eopXD in #13107
  • [None][infra] Waive 1 failed cases for main in post-merge 2653 by @ZhanruiSunCh in #13109
  • [TRTLLM-11990][infra] Move PY312-UB2404 sanityCheck test to A100X node by @yiqingy0 in #13077
  • [None][chore] Bump version to 1.3.0rc12 by @VALLIS-NERIA in #13129

New Contributors

  • @stefanpantic made their first contribution in #12284
  • @nvyutwu made their first contribution in #12545
  • @sammshen made their first contribution in #12626
  • @depaulmillz made their first contribution in #12897

Full Changelog: v1.3.0rc11...v1.3.0rc12

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.