github NVIDIA/TensorRT-LLM v1.3.0rc9

pre-release8 hours ago

Highlights

  • Model Support
    • Add Qwen3-next attention DP support (#10218)
    • Improve DeepSeek-V3.2 NVFP4 indexer GEMMs and routing kernels (#11989, #12055)
    • Support KV cache and speculative decoding in the Trtllm-Gen attention backend (#11667, #12267)
    • Add audio support and chunked-prefix enablement for Nemotron models (#12191, #12414)
    • Add GLM 5 support and fix DSA MTP issues (#11990)
    • Add initial Qwen3.5 text model support for the PyTorch backend with BF16/FP8 (#12242)
  • API
    • Add energy metrics to trtllm-serve and benchmarking workflows (#11855)
    • Expose video_pruning_rate in llmargs and improve Nano V2 VL handling (#12194)
    • Add TLLM_PROFILE_LOG_RANKS to control per-rank step logging (#12263)
    • Improve the serve CLI with renamed flags and mm_embedding_serve enhancements (#12105)
    • Add an auto option for tool and reasoning parsers (#12104)
    • Support interleaved thinking in trtllm-serve (#12199)
    • BREAKING: Set the default KV cache transfer timeout to 60 seconds (#12249)
  • Feature
    • Add FP8 combine support in moe_a2a (#11844)
    • Add batch generation support to visual generation pipelines (#12121)
    • Improve request management in the sampler (#11861)
    • Add fused AllReduce + RMSNorm with optional residual support (#12201)
    • Add constraint-based memory partitioning and a Python scheduler for KVCacheManagerV2 (#12212, #11939)
    • Add LM head sharding (#12252)
    • Add an interactive recipe selector with curated configs and button-grid UI (#11917)
    • Improve DSA and FlashMLA performance with new kernel fusions and cached tile-scheduler metadata (#12322, #12161)
    • Improve model performance with CuteDSL indexer_top_k, FlashInfer MLP activation, and refined KV cache buffer sizing (#12236, #12131, #12274)
  • Fix
    • Fix disaggregated perf test result generation, env export, and port allocation issues (#12211, #12140)
    • Fix harmony and tool-calling parsers for agentic coding use cases (#12045)
    • Fix torch.compile compatibility by routing DSA attention through the MLA custom op (#12186)
    • Fix min_tokens handling for long prompts and return explicit scheduling errors when requests cannot be placed (#12166, #12206)
    • Fix KV cache V2 OOMs and weight-loading OOMs in disaggregated serving (#12188, #12377)
    • Fix lost requests, dummy-request crashes, and GUIDE_TYPE_STRUCTURAL_TAG handling in request management paths (#12197, #12403, #12330)
    • Fix W4A16 AWQ bias handling on SM100 and add bias support to WeightOnlyQuantLinearMethod (#12190, #12317)
    • Fix MiniMax model loading and multimodal loading error propagation (#12182, #12331)
    • Fix MTP/DSA reliability, PARD accuracy, and NVFP4 MoE mixed-precision scales (#12010, #12360, #12240)
    • Fix DGX Spark multi-node hangs, cross-node rollout issues in Verl, and CUDA_VISIBLE_DEVICES propagation in scripts (#12316, #11924, #12370)
    • Fix build and runtime issues for SM103 context-attention kernels, L40s IB transfers, LlavaNext dtype fallback, and MnnvlMemory resource cleanup (#12248, #12152, #12169, #11979)
    • Add warmups to avoid AIPerf timeouts and I2V torch.compile recompilation (#12178, #12351)
    • Pre-cache aesthetic predictor weights to avoid VBench 429 failures (#12127)
  • Documentation
    • Add the NVLink one-sided AlltoAll blog post and improve tech blog sequencing and links (#12195, #12386, #12425)
    • Update the Nemotron 3 Super deployment guide for tool calling and reasoning parsers (#12215)
    • Update the README and other developer-facing documentation (#12307, #12258)
  • Test & Infra
    • Limit pre-merge pre-commit checks to changed files (#11379)
    • Use CPU affinity instead of raw CPU count for default build parallelism (#12167)
    • Add broader performance, accuracy, and end-to-end coverage for Nemotron, DeepSeek-V3.2, disaggregated serving, FLUX, and DSA host-cache offload (#12184, #12142, #12275, #12279, #12278, #12153)
    • Update multi-node and MPI-related test coverage (#12075, #12300)
    • Add SSH key authentication support for SLURM clusters (#12172)
    • Use the public PyTorch index as a CI fallback and update the CI allowlist (#12261, #12296)
    • Enable type checking for sampler modules and improve Python KV transceiver coverage (#11678, #11574)
    • Remove outdated QA coverage and refactor benchmarking and test infrastructure (#12277, #12344, #12124, #11720, #12192)

What's Changed

  • [TRTLLM-10929][feat] add fp8 combine in moe_a2a by @dc3671 in #11844
  • [TRTLLM-9767][feat] Enable attention dp for qwen3-next. by @nv-guomingz in #10218
  • [None][fix] Fix Disagg Perf Test No result.xml Bug by @chenfeiz0326 in #12211
  • [https://nvbugs/5955188][fix] Fix harmony parsers for agentic coding use cases by @dongfengy in #12045
  • [https://nvbugs/5973536][fix] Route DSA attention through MLA custom op for torch.compile compatibility by @yizhang-nv in #12186
  • [https://nvbugs/5823135][fix] Fix min_tokens not respected when prompt is long by @JunyiXu-nv in #12166
  • [None][doc] Blog18 for NVLinkOneSided AlltoAll. by @bobboli in #12195
  • [None][chore] Remove closed bugs by @xinhe-nv in #12222
  • [None][fix] Fix KV cache V2 OOM with separate draft KV cache (EAGLE3/MTP) by @yizhang-nv in #12188
  • [None][doc] AutoDeploy: ad-model-onboard skill updates by @bmarimuthu-nv in #12234
  • [TRTLLM-10569][infra] Only check the changed files in pre-commit in pre-merge CI by @yiqingy0 in #11379
  • [https://nvbugs/5948878][fix] fix lost requests by @bo-nv in #12197
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12218
  • [None][chore] fix deepep trtllm backend MXFP4 by @leslie-fang25 in #12219
  • [None][chore] Alltoall benchmark script refine (second time). by @bobboli in #12192
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12220
  • [None][fix] Fix W4A16 AWQ bias not applied on SM100 (Blackwell) by @Tracin in #12190
  • [None][fix] Export computed env vars to env_vars.json and fix port allocation in disagg benchmark by @qiaoxj07 in #12140
  • [TRTLLM-11288][fix] Adapt LTX2 pipeline to CompilationConfig warmup interface by @luyiyun1021 in #12232
  • [https://nvbugs/5955927][fix] Add warm up before aiperf to fix timeout issue. by @dominicshanshan in #12178
  • [None][refactor] Improve request management in sampler by @Funatiq in #11861
  • [None][chore] Use affinity rather than CPU count for default build parallelism by @achartier in #12167
  • [None][feat] Support kv cache in Trtllm-Gen attention backend by @yihwang-nv in #11667
  • [None][docs] Update nemotron 3 super deployment to include tool calling and reasoning parser by @tijyojwad in #12215
  • [None][fix] Add more models to increase perf test coverage by @chenfeiz0326 in #12184
  • [TRTLLM-9521][feat] Unfuse indexer.wk from attention GEMM for DS-V3.2 NVFP4 by @peihu-nv in #11989
  • [https://nvbugs/5879588][fix] fix MiniMax model loading bugs by @jmydurant in #12182
  • [TRTLLM-10333][feat] Add energy metrics in trtllm-serve and benchmark… by @JunyiXu-nv in #11855
  • [None][test] Update nemotron super test cases with official ckpt. by @nv-guomingz in #12142
  • [None][fix] Reliability fixes for MTP with DSA and support host cache offload for DSA by @dmtri35 in #12010
  • [None][infra] Waive 5 failed cases for main in post-merge 2599 by @ZhanruiSunCh in #12283
  • [None][infra] use public torch index as CI backup by @tburt-nv in #12261
  • [TRTLLM-11362][feat] Add batch generation support to visual gen pipelines by @karljang in #12121
  • [https://nvbugs/5973801][fix] exclude subproc_worker_timer from thread leak checks by @MrGeva in #12286
  • [#11432][feat] AutoDeploy: Enable fp8 quantization fusion part 1 by @galagam in #11910
  • [#10931][feat] AutoDeploy: one-model spec dec by @lucaslie in #11701
  • [https://nvbugs/5973536][fix] Add NVFP4+FP8KV+MTP accuracy specs for DeepSeek-V3.2-Exp by @yizhang-nv in #12269
  • [#11368][fix] FP4 CUTLASS GEMM shared memory overflow on GB10 (SM121) by @mihai-chiorean in #12141
  • [TRTLLM-11267][feat] Add audio support for nemotron by @2ez4bz in #12191
  • [None][feat] GLM 5 support and DSA MTP fixes by @NVShreyas in #11990
  • [None][fix] Relax MoE test tolerance for fp16 TP mode accuracy mismatch by @xxi-nv in #12244
  • [None][test] update function multi nodes test by @xinhe-nv in #12075
  • [TRTLLM-11285][feat] Fuse indexer wk + weights_proj into single GEMM in TF32 for DS-V3.2 by @peihu-nv in #12055
  • [None][docs] Fix AGENTS.md accuracy and reduce context bloat by @kaiyux in #12258
  • [None][doc] Update README. by @bobboli in #12307
  • [None][test] Add E2E logprobs test for disaggregated serving via OpenAI API by @yingguo-trt in #12275
  • [https://nvbugs/5981841][fix] AutoDeploy: Disable match_swiglu_pattern for Llama 3.3 70B Instruct by @govind-ramnarayan in #12299
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12293
  • [https://nvbugs/5969726][fix] exclude IB transfer on L40s by @chuangz0 in #12152
  • [TRTLLM-9019][feat] Expose video_pruning_rate as llmargs and fix nano-v2-vl by @Wanli-Jiang in #12194
  • [TRTLLM-11517][feat] Add TLLM_PROFILE_LOG_RANKS env var to control per-rank step logging by @longlee0622 in #12263
  • [None][chore] Bump version to 1.3.0rc9 by @yuanjingx87 in #12295
  • [TRTINFRA-7698][infra] - Add SSH key authentication support for SLURM clusters by @mlefeb01 in #12172
  • [None][infra] Update CI allowedlist by @yuanjingx87 in #12296
  • [TRTLLM-8804][chore] enable type checking for sampler modules by @ixlmar in #11678
  • [None][feat] Add fused allreduce+RMSNorm op and optional residual in … by @lfr-0531 in #12201
  • [https://nvbugs/5969206][fix] BREAKING: Setting default value of KV cache transfer timeout to 60s by @pcastonguay in #12249
  • [None][infra] PLC nightly source code scanning by @yuanjingx87 in #12124
  • [None][fix] LlavaNext dtype fallback when text_config.torch_dtype is None by @indrajit96 in #12169
  • [#11694][feat] AutoDeploy: Improve the piecewise CG memory usage by @nvchenghaoz in #11993
  • [https://nvbugs/5979443][chore] Refine the trtllm MoE unit test by @leslie-fang25 in #12318
  • [TRTLLM-11257][fix] release GPU memory and FDs in MnnvlMemory on pidfd failure to prevent leak by @zhaoyangwang-nvidia in #11979
  • [None][test] Fix mpi-type issue and add wideep acc test to dev's l0 local flow by @fredricz-20070104 in #12300
  • [None][fix] Fix the issue of excluding all context attention kernels when building for sm103 by @yifeizhang-c in #12248
  • [None][infra] Waive 4 failed cases for main in post-merge 2603 by @ZhanruiSunCh in #12334
  • [https://nvbugs/5937478][test] Add RCCA test for DeepSeek-V3.2 multi-turn tool_call encoding by @crazydemo in #12279
  • [https://nvbugs/5389100][test] Remove TensorRT integration test list and add trtllm-serve for test_perf.py by @yufeiwu-nv in #12277
  • [#11526][chore] AutoDeploy accuracy tests: use nemotron-3 official checkpoints by @galagam in #12243
  • [TRTLLM-10407][perf] Enable CuteDSL indexer_top_k in model by @limin2021 in #12236
  • [None][test] Add DSA host cache offload tests to CI and QA test lists by @longlee0622 in #12278
  • [TRTLLM-10076][feat] Serve CLI improvements: renames, new flags, and mm_embedding_serve enhancements by @JunyiXu-nv in #12105
  • [None][chore] Refine kv cache buffer calculation by @yihwang-nv in #12274
  • [None][feat] Constraint-based memory partitioning to KVCacheManagerV2 by @lowsfer in #12212
  • [None][infra] Waive 5 failed cases for main in post-merge 2604 by @ZhanruiSunCh in #12345
  • [None][feat] Enable speculative decoding in TrtllmGen attention backend by @yihwang-nv in #12267
  • [https://nvbugs/5893116][fix] fix disagg llama oom by @chuangz0 in #12281
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12328
  • [https://nvbugs/5808603][fix] Add bias support to WeightOnlyQuantLinearMethod by @stnie in #12317
  • [https://nvbugs/5949524][fix] Fix hang issue on DGX-Spark multinode by @JennyLiu-nv in #12316
  • [None][chore] Improved test coverage for Python KV Transceiver by @ekou24 in #11574
  • [#10607][feat] added AutoDeploy serving perf test with Super test by @MrGeva in #12287
  • [#12183][fix] Fix TRTLLM-Gen NVFP4 MoE scales for mixed-precision che… by @tcherckez-nvidia in #12240
  • [TRTLLM-11358][test] Add trtllm-serve e2e tests for FLUX by @JunyiXu-nv in #12153
  • [None][perf] enable flashinfer mlp activation and fix piecewise graph for gemma3-1B by @amukkara in #12131
  • [https://nvbugs/5875031][fix] Compile XQA with sm_120f by @pamelap-nvidia in #12170
  • [None][fix] Properly raise errors from multimodal loading by @2ez4bz in #12331
  • [#11992][fix] Handle GUIDE_TYPE_STRUCTURAL_TAG in gRPC request manager by @CatherineSue in #12330
  • [TRTLLM-10688][fix] fix cross-node rollout issues in verl by @hchings in #11924
  • [None][fix] Relax W8A16 MoE test tolerance for DTP mode by @xxi-nv in #12335
  • [https://nvbugs/5964329][fix] fix PARD accuracy issue by @cascade812 in #12360
  • [None][fix] Pass CUDA_VISIBLE_DEVICES as script arg instead of srun --export by @qiaoxj07 in #12370
  • [None][fix] return an explicit error if the requests can't be schedul… by @Tabrizian in #12206
  • [None][feat] Initial Qwen3.5 text model support for PyT backend (BF16/FP8) by @rosenrodt in #12242
  • [https://nvbugs/5725811][test] Remove outdated llama-v4 and ministral-8b models out of QA scope by @yufeiwu-nv in #12344
  • [TRTLLM-10077][feat] Add 'auto' option for tool and reasoning parsers by @JunyiXu-nv in #12104
  • [https://nvbugs/5814350][fix] Fix OOM killed during weight loading in disaggregated sever by @yingguo-trt in #12377
  • [TRTLLM-11357][feat] Support interleaved thinking for trtllm-serve by @JunyiXu-nv in #12199
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12363
  • [None][doc] Optimize the tech blog sequence. by @nv-guomingz in #12386
  • [TRTLLM-12250][feat] added lm head sharding by @greg-kwasniewski1 in #12252
  • [TRTLLM-9523][chore] Refactor the transfer logic (step 6) by @Shixiaowei02 in #12231
  • [https://nvbugs/5895249][fix] Update test waives by @greg-kwasniewski1 in #12247
  • [TRTLLM-11497][fix] Add I2V warmup to prevent torch.compile recompilation by @luyiyun1021 in #12351
  • [TRTLLM-11287][feat] Implement python based scheduler for KVCacheManagerV2 by @lancelly in #11939
  • [https://nvbugs/5961414][fix] Pre-cache aesthetic predictor weights to avoid VBench 429 errors by @chang-l in #12127
  • [TRTLLMINF-10][chore] move repeated apt-get installs into tritondevel Docker … by @dpitman-nvda in #11720
  • [https://nvbugs/5991576][fix] fix dummy request crash with PP + ADP + disagg + block reuse by @Tabrizian in #12403
  • [None][feat] Interactive recipe selector with curated configs and button-grid UI by @venkywonka in #11917
  • [TRTLLM-11587][feat] Enable chunked prefix for Nemotron models on sm120 by @pamelap-nvidia in #12414
  • [https://nvbugs/5983390][perf] Kernel fusions in _gather_k_cache_for_chunk of Indexer in DSA by @hyukn in #12322
  • [None][perf] Cache FlashMLA tile-scheduler metadata across attention layers by @bobboli in #12161
  • [None][doc] Fix invalid links in tech blogs. by @nv-guomingz in #12425

New Contributors

Full Changelog: v1.3.0rc8...v1.3.0rc9

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.