Highlights
-
Model Support
- Add LTX-2 two-stage pipeline support (#12361)
- Add CUDA graph support for LTX-2 with
torch.compilecompatibility (#12653) - Add video temporal compression for Nemotron Nano and RADIO (#12649)
- Extend the Python cache transceiver to support Qwen-Next (#12772)
- Add CuteDSL MoE backend support for Qwen3.5 (#12799)
- Fix LoRA support for Qwen3 models (#12785)
- Support loading FP8 LoRA weight files (#12848)
- Add support for speculative decoding with LoRA (#12661)
- Fix OOM with large numbers of LoRA adapters (#12815)
- Partially fix LoRA overallocation for Nemotron NAS (#12817)
- Skip
inference_mode()whentorch.compile=Truefor Gemma3 FP8 (#12367) - Skip NVFP4 fused norm when the dimension does not meet requirements (#12901)
- Update MoE
hidden_sizein the communicator for Nemotron-H (#12890) - Unify image-as-tensor handling to avoid repeated conversions for nano models (#12994)
-
API
- Refine the VisualGen API structure (#12807)
- Convert
VisualGenParamsto Pydantic with request validation, per-model defaults, andextra_paramssupport (#12922) - Align
AttentionPluginwith the EdgeLLM interface (#12233) - Add shorthand
KVConnectorpaths forlmcacheandkvbm(#12626) - Add the missing
allow_partial_loadingparameter to CuteDSL and ConfigurableMoEload_weights(#12761) - Improve KV cache statistics monitoring (#12413)
-
Feature
- Add NvTelemetry/GXT-compliant usage telemetry (#12384)
- Add production-level Prometheus metrics for iteration stats, config info, token counters, and phase histograms (#12545)
- Add conversation-affinity routing for disaggregated serving (#12526)
- Enable block reuse with the overlap scheduler (#12816)
- Unify VisualGen parallelism (#12509)
- Consolidate piecewise CUDA graph VLM updates (#12852)
- Add tunable NVFP4 quantization with an additional FlashInfer backend (#12126)
- Optimize GDN prefill with indexed in-kernel state updates (#12791)
-
Fix
- Propagate
disaggregated_paramsthroughPostprocWorker(#12513) - Prebuild disaggregated context responses to avoid
ctx_request_idraces (#12466) - Generate HMAC keys for MGMN IPC servers in disaggregated serving (#12670)
- Enable HMAC authentication in VisualGen ZMQ IPC channels (#12680)
- Fix disaggregated gen-only hangs caused by blocking KV transfers (#12640)
- Replace busy-poll sleep in
get_async_noblockwith the ZMQ async poller (#12189) - Make
trust_remote_codeopt-in inMultimodalModelRunner(#12669) - Fix VLM guided decoding startup crashes caused by missing
vocab_size_padded(#12284) - Eliminate double PNG encoding in visual generation serving (#12903)
- Treat whitespace-only content correctly in nano-v3 reasoning swap (#12912)
- Clamp
usedNumBlocksto non-negative values in KV cache statistics (#11922) - Fix
moe_chunking_tokenshandling during MoE A2A (#12929) - Guard CUDA event
elapsed_timeinperf_metrics_managerto prevent executor crashes (#12868) - Remove leftover
onboardBlocksparameters inkvCacheManagerTest(#13107) - Add CUDA device setup before
load_remote_agent(#12619) - Fix Mooncake transfer agent binding (#12723)
- Fix
multi_stream_moeaccuracy with MLIR and piecewise CUDA graphs (#12847) - Fix Nano chunked prefill (#12782)
- Fix constrained decoding for GLM5 (#12869)
- Fix benchmark disaggregated deadlocks by removing a blocking fill loop (#12208)
- Update CUTLASS C++ to 4.4.2 (#12897)
- Pin Ray to 2.54.1 (#13071)
- Propagate
-
Documentation
-
Benchmark
- Optimize the Qwen3.5 decode delta kernel (#12740)
- Reduce host overhead in DSA MLA attention (#12631)
- Add a host performance regression test suite for PyExecutor (#12148)
- Add benchmark coverage for allreduce backends (#12887)
- Restore DSR1/DSV32/K2 disaggregated performance tests (#12688)
- Support NV SA benchmarks in CI performance testing (#13004)
- Add K2.5 performance tests into CI (#12931)
-
Test & Infra
- Update Perf Sanity System code paths (#12430)
- Bump etcd to 3.6.9 to pick up the gRPC fix (#12594)
- Fix the PLC nightly pipeline and expose more pipeline data (#12940)
- Exclude QA nodes when running TRTLLM CI (#13102)
- Add a unit test for lifecycle race condition errors in disaggregated serving (#12803)
- Add an end-to-end test for PP + disagg + block reuse + chunked prefill hangs (#12913)
- Add Nemotron-3-Super-120B-A12B-NVFP4 functional and performance cases on DGX Spark (#12830)
- Remove obsolete RTX-6000 OOM tests (#12800)
- Remove unused tests (#12625)
- Check unused fixtures (#12730)
- Fix Qwen3 skip-softmax attention CI tests (#12789)
- Fix failing KV cache transceiver tests from the perf sanity changes (#12554)
- Fix Wan unit tests (#13026)
- Remove obsolete waivers (#12979)
- Move the
PY312-UB2404sanity check test to A100X nodes (#13077) - Pin Ray to 2.54.1 in the Slurm CI stage (#13085)
What's Changed
- [None][test] Unwaive Nemotron H flaky case by @nv-guomingz in #11236
- [https://nvbugs/5997543][fix] unwaive test_disaggregated_overlap_transceiver_runtime_python by @chuangz0 in #12580
- [TRTLLM-11574][feat] Some updates on Perf Sanity System codes by @chenfeiz0326 in #12430
- [None][doc] add attention developer guide by @QiJune in #12693
- [https://nvbugs/5991957][fix] Propagate disaggregated_params through PostprocWorker by @peihu-nv in #12513
- [https://nvbugs/5883590][fix] Generate HMAC key for MGMN IPC server in disaggregated serving by @yibinl-nvidia in #12670
- [https://nvbugs/5941242][fix] Fix SigLIP test failure by @tijyojwad in #12717
- [None][feat] Optimize qwen3.5 decode delta kernel by @nv-guomingz in #12740
- [https://nvbugs/5961736][fix] Prebuild disagg ctx response to avoid ctx_request_id race by @peihu-nv in #12466
- [https://nvbugs/5922880][fix] Enable HMAC authentication in VisualGen ZMQ IPC channels by @yibinl-nvidia in #12680
- [None][fix] Add missing allow_partial_loading param to CuteDSL and ConfigurableMoE load_weights by @qiaoxj07 in #12761
- [None][chore] Waive hanging Nemotron Super test by @brb-nv in #12821
- [None][fix] add cuda set device before load_remote_agent by @chuangz0 in #12619
- [None][chore] Remove closed bugs by @xinhe-nv in #12766
- [None][test] Remove RTX-6000 OOM test cases by @yufeiwu-nv in #12800
- [None][fix] Fix LoRA support for Qwen3 models by @achartier in #12785
- [TRTLLM-11343][feat] LTX-2 Two Stage pipeline support by @yibinl-nvidia in #12361
- [#12808][feat] AutoDeploy: Add Gemma4 Support by @bmarimuthu-nv in #12710
- [None][feat] Add Claude Code agents and skills for kernel dev, perf analysis, and compilation by @kaiyux in #12831
- [#11879][fix] Clamp usedNumBlocks to non-negative in KV cache stats by @wojciech-wais in #11922
- [https://nvbugs/6029864][fix] Fix flaky ray test failure by @brb-nv in #12697
- [https://nvbugs/5813192][fix] Make trust_remote_code opt-in in MultimodalModelRunner by @yibinl-nvidia in #12669
- [None][infra] Bump etcd to 3.6.9 to involve grpc fix by @yuanjingx87 in #12594
- [https://nvbugs/5658258][fix] Fix OOM with large number of LoRA adapters by @brb-nv in #12815
- [None][feat] AutoDeploy: Add the Triton kernel for MLA by @nvchenghaoz in #12664
- [None][fix] replace busy-poll sleep in get_async_noblock with zmq async poller by @edenfunf in #12189
- [https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever by @yingguo-trt in #12803
- [None][infra] Add DSR1 DSV32 K2 Disagg Perf Tests Back by @chenfeiz0326 in #12688
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12765
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12814
- [None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property by @stefanpantic in #12284
- [None][fix] Fix Nano chunked prefill by @2ez4bz in #12782
- [https://nvbugs/6029220][fix] Disable multi-stream in maybe_execute_i… by @liji-nv in #12659
- [None][test] remove unused tests by @xinhe-nv in #12625
- [https://nvbugs/6000658][fix] Fix disagg gen-only hang where 10s sleep in can_forward blocks KV transfers and overflows CTX memory by @peihu-nv in #12640
- [#12593][feat] AutoDeploy: onboard DeepSeek-R1 by @galagam in #12601
- [#11548][feat] AutoDeploy: Optimize Qwen3.5 perf by @taylor-yb-lee in #12265
- [None][chore] Set the use_one_model flag to True by default on llm ap… by @nv-guomingz in #12836
- [https://nvbugs/5921674][fix] unwaive TestNemotronNanoV3 fp8 tests by @tcherckez-nvidia in #12792
- [None][feat] Add NvTelemetry/GXT-compliant usage telemetry by @venkywonka in #12384
- [https://nvbugs/5996776][fix] Fix test OOM by @dongfengy in #12856
- [None][feat] Support loading FP8 LoRA weight files by @achartier in #12848
- [None][test] check unused fixtures by @xinhe-nv in #12730
- [TRTLLM-11804][feat] Mechanical refactoring VisualGen API by @zhenhuaw-me in #12807
- [TRTLLM-11324][perf] Add host performance regression test suite for PyExecutor by @hyukn in #12148
- [None][chore] unwaive some dis-agg tests by @Shixiaowei02 in #12828
- [TRTLLM-11707][feat] Add CUDA graph support (torch compile compatible) for LTX-2 by @luyiyun1021 in #12653
- [https://nvbugs/6055474][test] Fix RTX-6000 with wrong moe backend by @yufeiwu-nv in #12886
- [None][chore] Waive failing pre-merge test by @brb-nv in #12916
- [None][docs] Add README for custom Claude Code skills and agents by @kaiyux in #12920
- [TRTLLM-11421][feat] Support better kv cache statistics monitoring by @eopXD in #12413
- [https://nvbugs/5448464][fix] Partially fix LoRA overallocation for Nemotron NAS by @brb-nv in #12817
- [https://nvbugs/5996776][fix] Unwaive tests after fix by @dongfengy in #12906
- [https://nvbugs/5940463][fix] remove test_cli_flow.py::TestSantacoder case by @QiJune in #12845
- [None][test] Add Nemotron-3-Super-120B-A12B-NVFP4 func and perf cases on DGX-spark by @JennyLiu-nv in #12830
- [None][infra] Fix plc nightly pipeline and show more data by @yuanjingx87 in #12940
- [TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO by @2ez4bz in #12649
- [https://nvbugs/5910749][https://nvbugs/5995486][test] Fix Qwen3 skip softmax attention CI tests by @bobboli in #12789
- [https://nvbugs/6043312][fix] fix_mooncake_transfer_agent_binding by @chuangz0 in #12723
- [#12699][feat] consolidate piecewise CUDA graph VLM updates by @nvchenghaoz in #12852
- [TRTLLM-11770][feat] Skip nvfp4 fused norm if the dim doesn't meet the requirement by @pamelap-nvidia in #12901
- [None][fix] skip inference_mode() when torch.compile=True for gemma3 fp8 by @amukkara in #12367
- [None][feat] AutoDeploy: Onboard google/gemma-4-31B-it dense model, including nvfp4 by @suyoggupta in #12866
- [#12634][feat] AutoDeploy: Support rank 256 MLA in flashinfer_mla by @bmarimuthu-nv in #12519
- [https://nvbugs/5997534][fix] AutoDeploy: Skip Eagle3 One Model Test on pre-Hopper by @govind-ramnarayan in #12757
- [None][fix] Fix multi_stream_moe accuracy with MLIR and piecewise cudagraphs by @suyoggupta in #12847
- [https://nvbugs/5961739][fix] Unwaiving failing tests by @greg-kwasniewski1 in #12936
- [#12954][fix] AutoDeploy: Fix Gemma4 MoE config (disable multi_stream_moe, lower free_gpu_memory_fraction) by @suyoggupta in #12955
- [TRTLLM-11540][feat] Add EAGLE3 dynamic tree speculative decoding support by @sunnyqgg in #12062
- [https://nvbugs/6064029][fix] Eliminate double PNG encoding in visual gen serving by @karljang in #12903
- [TRTLLM-11532][refactor] Unify VisualGen parallelism by @NVShreyas in #12509
- [None][fix] Fix 'max_batch_size' conflict in AD dashboard script by @tcherckez-nvidia in #12967
- [TRTLLM-11797][feat] Add cutedsl moe backend supporting for qwen3.5. by @nv-guomingz in #12799
- [TRTLLM-11315][feat] Extend python cache transceiver to support Qwen-Next by @bo-nv in #12772
- [https://nvbugs/5991576][test] Add E2E test for PP+disagg+block_reuse+chunked_prefill hang by @yingguo-trt in #12913
- [None][feat] Align AttentionPlugin with EdgeLLM interface by @nvyocox in #12233
- [None][infra] Waive 1 failed cases for main in post-merge 2648 by @ZhanruiSunCh in #12975
- [https://nvbugs/5983390][perf] Reduce host overhead in DSA MLA attent… by @liji-nv in #12631
- [None][infra] Waive 8 failed cases for main in post-merge 2646 by @ZhanruiSunCh in #12934
- [None][fix] Unwaive phi4 accuracy tests by @Wanli-Jiang in #12832
- [None][feat] Add benchmark for all allreduce backend by @yilin-void in #12887
- [TRTLLM-11893][feat] Convert VisualGenParams to Pydantic with extra_params, per-model defaults, and request validation by @zhenhuaw-me in #12922
- [None][infra] Waive 4 failed cases for main in pre-merge 33523 by @ZhanruiSunCh in #12977
- [None][infra] Waive 4 failed cases for main in post-merge by @xinhe-nv in #12973
- [TRTLLM-11657][feat] Conversation affinity disagg router by @reasonsolo in #12526
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #12953
- [None][feat] optimize GDN prefill with indexed in-kernel state updates by @nv-guomingz in #12791
- [https://nvbugs/6061812][fix] Unblock ruff check by @VALLIS-NERIA in #12996
- [None][fix] Update moe hidden_size in communicator for nemotron-h by @Wanli-Jiang in #12890
- [TRTLLM-11492][fix] Fix benchmark disagg deadlock by eliminating blocking fill loop by @chienchunhung in #12208
- [TRTLLM-10938][feat] Enable block reuse with overlap scheduler by @chienchunhung in #12816
- [#12617][feat] Add support for speculative decoding with LoRA by @Funatiq in #12661
- [None][fix] Fix contrained decoding for GLM5 by @cascade812 in #12869
- [None][test] Waive two dsv3lite cases due to nvbug 6071081. by @nv-guomingz in #13001
- [None][feat] Add production-level Prometheus metrics (iteration stats, config info, token counters, phase histograms) by @nvyutwu in #12545
- [None][infra] Remove invalid test case in waive list by @yuanjingx87 in #13008
- [None][chore] Fix failing KV Cache Transceiver Tests from #11574 by @ekou24 in #12554
- [https://nvbugs/6060281][fix] Treat whitespace-only content in nano-v3 reasoning swap by @tijyojwad in #12912
- [None][feat] KVConnector shorthand paths for "lmcache" and "kvbm" with examples by @sammshen in #12626
- [#12712][feat] AutoDeploy Model Onboarding Sprint 03/19 - Part 1 (infra only) by @govind-ramnarayan in #12708
- [https://nvbugs/6059036][fix] AutoDeploy fix registry accuracy tests by @nvchenghaoz in #12942
- [https://nvbugs/5963665][refactor] Refactor warmup orchestration in ModelEngine by @liji-nv in #12407
- [https://nvbugs/5973214][fix] unwaive qwen3 ci test by @byshiue in #12237
- [https://nvbugs/5781383][chore] Unwaive test by @shuyixiong in #12282
- [None][chore] AutoDeploy: Added Qwen3.5 accuracy test for NVFP4 by @taylor-yb-lee in #13014
- [None][chore] Unify code path for reuse/non-reuse when adding sequence in kv cache manager by @eopXD in #10437
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #13016
- [None][chore] Waive failed tests by @yiqingy0 in #13035
- [TRTLLM-11091][feat] Add tunable nvfp4 quantize with additional FlashInfer backend by @chang-l in #12126
- [TRTLLM-11540][feat] Revert EAGLE3 dynamic tree speculative decoding support (#12062) by @brb-nv in #13006
- [None][fix] fix Wan unit tests by @zhenhuaw-me in #13026
- [None][fix] Update CUTLASS C++ to 4.4.2 by @depaulmillz in #12897
- [None][chore] Waive failing tests 04/14 by @brb-nv in #13049
- [None][chore] Unwaive broader test lists by @brb-nv in #13053
- [None][chore] Update waived test name by @brb-nv in #13058
- [None][infra] Waive 4 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13067
- [None][fix] Fix moe_chunking_tokens during MoE A2A by @Wanli-Jiang in #12929
- [None][infra] Support nv sa benchmark in CI Perf Test by @chenfeiz0326 in #13004
- [None][fix] Pin Ray version to 2.54.1 by @shuyixiong in #13071
- [None][infra] Add K2.5 Perf Tests into CI by @chenfeiz0326 in #12931
- [https://nvbugs/5846024][fix] Remove waivers by @VALLIS-NERIA in #12979
- [https://nvbugs/5838178][fix] Fix failing lora test for Llama by @brb-nv in #12950
- [None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash by @yifjiang in #12868
- [None][fix] Pin Ray version to 2.54.1 in slurm CI stage by @shuyixiong in #13085
- [TRTLLM-11266][feat] Unify image as tensor to avoid multiple converting for nano model by @Wanli-Jiang in #12994
- [None][fix] Update CODING_GUIDELINES.md to say Python >= 3.10 by @hnover-nv in #13094
- [None][chore] Remove onboard block switch for KV cache manager by @eopXD in #12449
- [None][infra] Waive 3 failed cases for main in post-merge 2652 by @ZhanruiSunCh in #13070
- [None][infra] Exclude QA nodes when running TRTLLM CI by @yuanjingx87 in #13102
- [None][fix] Remove leftover onboardBlocks param in kvCacheManagerTest by @eopXD in #13107
- [None][infra] Waive 1 failed cases for main in post-merge 2653 by @ZhanruiSunCh in #13109
- [TRTLLM-11990][infra] Move PY312-UB2404 sanityCheck test to A100X node by @yiqingy0 in #13077
- [None][chore] Bump version to 1.3.0rc12 by @VALLIS-NERIA in #13129
New Contributors
- @stefanpantic made their first contribution in #12284
- @nvyutwu made their first contribution in #12545
- @sammshen made their first contribution in #12626
- @depaulmillz made their first contribution in #12897
Full Changelog: v1.3.0rc11...v1.3.0rc12