Highlights
- Known Issues
- DeepSeek V3.2 will crash with an illegal memory access during long-running performance tests under various agg/disagg configurations.
- Model Support
- Add MoT World Model support (#14012)
- Enable multi-node tensor parallelism for MiniMax-M2 (#14314)
- Restore Mistral Large 3 text-only processor (#14248)
- Support Gemma4 multi-head_dim pools and host-side slicing for SWA Triton kernels (#13745)
- Add a reasoning parser for Qwen3.5 (#14659)
- Add LTX-2 Ulysses cross-attention for v2a with audio padding (#14044)
- Add Poolside Laguna tool parser (#14638)
- Replace Parakeet audio encoder with native TensorRT-LLM layers (#14474)
- Set Mamba SSM cache to fp32 for NemotronV2 (#14448)
- API
- Allow
content: nullinCustomChatCompletionMessageParam(#14368) - Enforce
trust_remote_codeflag (#13527) - Add thinking token budget control (#14665)
- Expose host/GPU per-iter time and clarify iter labeling in
/metrics(#14127) - Make attention backend case-insensitive (#14635)
- Feature
- Add FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron (#13773)
- Integrate the FlashInfer GDN prefill kernel for Qwen3.5 (#13644)
- Add LoRA support to LLMAPI Triton backend (#14079)
- Log KV cache utilization and context tokens per iteration (#14206)
- Remove one-warp-per-token policy from MoE A2A kernels (#14550)
- Support non-divisible expert parallelism in MoE all-to-all and Slurm benchmark (#13888)
- Add CuTe DSL attention via exported binaries in VisualGen (#13721)
- Enable NVFP4 KV cache support in trtllm-gen attention (#12544)
- Add GMS-only weight sharing support (#13926)
- Add VisualGen tensor parallelism support (#13614)
- Enable NCCL symmetric zero-copy by default (#14472)
- Improve disaggregated TTFT (#14719)
- Fix
- Restore K2.5 multimodal dep8 accuracy test on Transformers 5.5.x (#14392)
- Remove sync after FlashInfer attention
plan()(#14634) - Add a compatibility shim in
load_hf_tokenizerforbytes_to_unicode(#14090) - Route
trtllm-benchandtrtllm-servetokenizer load throughTransformersTokenizer(#14452) - Fix crash in
deep_ep.pyby falling back to the pre-quant dispatch path whenhidden_states_sfis missing (#14404) - Fix gpt-oss accuracy issue by moving TinyGEMM PDL release after reduction (#14537)
- Fix Mistral-Large-3 weight loading crash (#14033)
- Bypass FlashInfer SSD prefill to fix state dtype precision (#14600)
- Fix qwen3 hang on SM120/121 (#14424)
- Fix NVFP4 engine size estimation and attention DP batch size in
trtllm-bench(#13498) - Catch
OSErrorinconfig_file_lockfor NFS compatibility (#11960) - Fix MoE DeepGEMM workspace size with attention DP (#13310)
- Fix inf/NaN issues in Triton Mamba softplus (#14652)
- Cap per-rank
max_num_active_requestsbymax_num_tokensunder attention DP (#14481) - Propagate external SWA window to FMHA kernel in V2 KV cache (#13719)
- Resolve NVML device index mismatch in
get_numa_aware_cpu_affinitywhenCUDA_VISIBLE_DEVICESis set (#12985) - Replace fixed disagg fill throttle with slow-start ramp (#14475)
- Reuse
batch_indices_cudaacross CUDA graph captures in EAGLE3 (#14381) - Make FA4 a proper pip dependency (#13788)
- Fix GSM8K accuracy tests for LagunaXS on B200/GB200/B300 (#14580)
- Documentation
- Test & Infra
- Update flashinfer-python from 0.6.11.post1 to 0.6.12rc2 (#14512, #14607)
- Add disagg local one-step run script for CI submit (#14557)
- Update model path definitions in
test_perf.pyand clean upwaives.txt(#14393) - Dedup executor unit tests on H100/B200 (#14556)
- Add disagg cancellation stress-test harness skeleton (#14375)
- Add UCX TLS env in disagg-related tests (#14626)
- Replace ONNX spec with
onnx>=1.21.0inrequirements.txt(#14577) - Add test lists with multi-GPU tests to CI multi-GPU test trigger files (#14087)
- Add offline equivalence test for sharding IR (#13963)
- Enable
kv_cache_manager_v2test for A10 (#12885) - Remove two-model EAGLE3 spec-decoding tests (#14735)
- Add
TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENSin spec decoding perf test (#14438)
What's Changed
- [https://nvbugs/6182617][fix] Restore K2.5 multimodal dep8 accuracy test on transformers 5.5.x by @tianyuxbear in #14392
- [None][feat] FlashInfer NVFP4 MoE backend (SM120/SM121) for Nemotron … by @farazkh80 in #13773
- [None][perf] Integrate the flashinfer gdn prefill kernel for qwen3.5 by @nv-guomingz in #13644
- [None][chore] Update flashinfer-python from 0.6.11.post1 to 0.6.12rc1 by @yihwang-nv in #14512
- [https://nvbugs/6162328][fix] Add a tiny compat shim in
load_hf_tokenizerthat, whenbytes_to_unicodeis m by @tensorrt-cicd in #14090 - [https://nvbugs/6114610][test] unwaive disagg tests fixed by UCX_TLS setter by @xwang233 in #14440
- [None][fix] Route trtllm-bench and trtllm-serve tokenizer load through TransformersTokenizer by @dc3671 in #14452
- [https://nvbugs/6184914][test] Unwaive related tests by @yuxianq in #14523
- [https://nvbugs/6186880][fix] In deep_ep.py, fall back to the pre-quant dispatch path when hidden_states_sf is by @tensorrt-cicd in #14404
- [None][infra] Waive 2 failed cases for main in post-merge 2734 by @ZhanruiSunCh in #14526
- [None][infra] Waive 1 failed cases for main in post-merge 2735 by @ZhanruiSunCh in #14542
- [#11257][feat] Add LoRA support to llmapi triton backend by @karljang in #14079
- [None][chore] Include layer_idx in MoE backend fallback warnings by @dc3671 in #13409
- [None][chore] Add disagg local one-step run script for CI submit by @fredricz-20070104 in #14557
- [https://nvbugs/5974335][refactor] Update model path definitions in test_perf.py and clean up waives.txt by @yufeiwu-nv in #14393
- [TRTLLM-12968][ci] Dedup executor unit tests on H100/B200 by @YihuiLu512 in #14556
- [TRTLLM-12949][refactor] visual_gen: unify fused QK-norm+rope dispatch by @luyiyun1021 in #14529
- [https://nvbugs/6143579][fix] Allow content: null in CustomChatCompletionMessageParam by @tijyojwad in #14368
- [None][chore] log KV cache utilization and context tokens per iter by @pcicotti in #14206
- [https://nvbugs/6168859][fix] move tinygemm PDL release after reduction by @dongfengy in #14537
- [None][chore] Unwaive
test_cp_tp_broadcast_objectby @brb-nv in #14328 - [https://nvbugs/6211185][fix] Fix failed GSM8K accuracy tests for LagunaXS on B200/GB200/B300 by @DomBrown in #14580
- [TRTLLMINF-106][infra] Use B300 frontend platforms by @mlefeb01 in #14581
- [None] [refactor] Unify compressed-tensors quant config parsing by @DomBrown in #14468
- [None][feat] AutoDeploy push the rope buffer to later stage by @nvchenghaoz in #13859
- [https://nvbugs/6215736][infra] Unwaive test_fp8_blockscale[throughput_mtp] by @bobboli in #14541
- [https://nvbugs/6175923][test] Revert gpt_oss_20b perf MoE-backend pin by @ruodil in #14612
- [https://nvbugs/6221621][test] Update trust_remote to nemotron and phi4 models by @yufeiwu-nv in #14570
- [None][chore] update VisualGen codeowner settings by @zhenhuaw-me in #14530
- [None][infra] Waive 8 failed cases for main in post-merge 2738 by @ZhanruiSunCh in #14615
- [None][perf] Fuse FlashInfer GDN prefill state I/O into Triton kernels by @nv-guomingz in #14548
- [https://nvbugs/6164924][fix] Lower free_gpu_memory_fraction for Exaone tests by @tensorrt-cicd in #14486
- [https://nvbugs/6163033][fix] Guard
q_a_proj.weightdict access behindnvfp4_fused_a; update test to `chec by @tensorrt-cicd in #14033 - [None][fix] Bypass FlashInfer SSD prefill to fix state dtype precision by @tijyojwad in #14600
- [None][fix] Exclude Qwen3 VL vision model from quantization by @2ez4bz in #12851
- [https://nvbugs/6162860][fix] Set free_gpu_memory_fraction=0.6 only when torch_compile=True for test_bfloat16_ by @tensorrt-cicd in #14109
- [None][chore] Remove one-warp-per-token policy from MoE A2A kernels by @bobboli in #14550
- [None][test] Waive 7 failed cases for main in QA CI by @xinhe-nv in #14498
- [None][doc] Add CUTLASS DSL uninstall step to installation guide by @yihwang-nv in #14621
- [https://nvbugs/6099723][fix] Gate supports_mnnvl() False on SM120/121 in _mnnvl_utils.py and add the same Mnn by @tensorrt-cicd in #14424
- [https://nvbugs/6114464][fix] Add kv_cache_config to TestQwen3VL_MOE::test_auto_dtype by @tensorrt-cicd in #13668
- [None][chore] Update flashinfer-python from 0.6.12rc1 to 0.6.12rc2 by @yihwang-nv in #14607
- [https://nvbugs/6109750][test] Unwaive passing GPTOSS tests by @dongfengy in #14596
- [https://nvbugs/6215690][fix] AutoDeploy: FlashInfer 128-byte alignment for Mamba inputs (also addresses nvbugs/6162114) by @galagam in #14535
- [None][docs] Add deprecation notice to legacy support-matrix.md by @fuergaosi233 in #14495
- [None][fix] make FA4 proper pip dependency by @o-stoner in #13788
- [TRTLLM-12648][test] add disagg cancellation stress-test harness skeleton by @chienchunhung in #14375
- [None][feat] support non-divisible EP in MoE alltoall and slurm benchmark by @JacobHu-NV in #13888
- [https://nvbugs/6094100][fix] add ucx tls env in disagg related tests by @chuangz0 in #14626
- [None][test] Update stress tests by @xinhe-nv in #14454
- [None][chore] Bump version to 1.3.0rc17 by @VALLIS-NERIA in #14657
- [None][infra] Fix hang when generating report by @EmmaQiaoCh in #14625
- [None][infra] Update blossom-ci allowlist: add nv-anants, guqiqi, jonghyunchoe, belgarten-nv by @yiqingy0 in #14662
- [https://nvbugs/6115560][fix] catch OSError in config_file_lock for NFS compatibility by @sara4dev in #11960
- [None][fix] Fix OSRB source header and provenance issues in AutoDeploy modeling code by @bmarimuthu-nv in #14670
- [https://nvbugs/6185173][fix] Set mamba ssm cache to fp32 for NemotronV2 by @tensorrt-cicd in #14448
- [https://nvbugs/5800725][fix] Restore Mistral Large 3 text-only processor by @byshiue in #14248
- [None][test] Unwaive fp8 blockscale baseline mtp1 by @sunnyqgg in #14666
- [None][docs] fix incorrect auto sampler behavior description for beam search by @fuergaosi233 in #14487
- [None][feat] Expose host/GPU per-iter time and clarify iter labeling in /metrics by @eopXD in #14127
- [None][chore] Add test lists with multi-gpu test to CI multi-gpu test trigger files by @pengbowang-nv in #14087
- [None][refactor] Add derived properties for the thop.attention call site by @yuxianq in #14279
- [https://nvbugs/6160085][fix] At
tensorrt_llm/tokenizer/tokenizer.pyimport time, re-export `bytes_to_unicod by @tensorrt-cicd in #14116 - [None][infra] Waive 1 failed cases for main in post-merge 2740 by @ZhanruiSunCh in #14688
- [https://nvbugs/6221483][fix] Revert auto_deploy _mamba_ssm_prepare_metadata to pre-#13566 state by @greg-kwasniewski1 in #14640
- [TRTLLM-12762][fix] Enable multi-node TP for MiniMax-M2 by @pcicotti in #14314
- [https://nvbugs/6043248][fix] Validate tensor payload size on deserialization by @yibinl-nvidia in #14648
- [None][perf] Add AutoDeploy NVFP4 RMSNorm quant fusion by @tcherckez-nvidia in #14361
- [None][feat] Support Gemma4 multi-head_dim pools and host-side slicing to provide local view to Triton kernels for SWA by @eopXD in #13745
- [TRTLLM-13960][test] Offline equivalence test for sharding IR by @greg-kwasniewski1 in #13963
- [TRTLLM-11410][feat] MoT World Model Support by @NVShreyas in #14012
- [https://nvbugs/6115036][fix] Fix NVFP4 engine size estimation and attention DP batch size in trtllm-bench by @hyukn in #13498
- [https://nvbugs/5972776][fix] Pass IPC HMAC key through file descriptor by @yibinl-nvidia in #14378
- [https://nvbugs/5911594][fix] Restrict HTTP cluster storage to loopback by @yibinl-nvidia in #14161
- [None][fix] Exclude post-merge stages from CBTS force-keep filters by @achartier in #14594
- [TRTLLMINF-67][infra] use pre-configured idle GPU exemption by @tburt-nv in #14587
- [https://nvbugs/6207749][fix] Replace the spec with
onnx>=1.21.0inrequirements.txt; mirror in `security_ by @tensorrt-cicd in #14577 - [https://nvbugs/6185480][fix] Autodeploy skip the GLM accuracy test for pre-hopper by @nvchenghaoz in #14656
- [https://nvbugs/6165866][infra] Waive 1 failed cases for main in pre-merge 40081 by @taylor-yb-lee in #14653
- [https://nvbugs/6187185][fix] Apply the existing
low_memory_overrides()helper in `TestNemotronV2.test_auto_ by @tensorrt-cicd in #14584 - [https://nvbugs/6192201][fix] AutoDeploy: unwaive llama perf test and increase its concurrency to 256 by @MrGeva in #14691
- [TRTLLM-12982][feat] improve attention backend selection by @ixlmar in #14635
- [None][test] Enable test for kv_cache_manager_v2 for A10 by @lowsfer in #12885
- [None][infra] Generate json with cmake fetched contents in build stage by @yuanjingx87 in #13607
- [TRTLLM-12436][feat] visual_gen: add CuTe DSL attention via exported binaries by @xrq-phys in #13721
- [https://nvbugs/6229221][fix] Add a reasoning parser for qwen3_5 by @moraxu in #14659
- [None][fix] Reuse batch_indices_cuda across CUDA graph captures in EAGLE3 by @achartier in #14381
- [https://nvbugs/6084447][fix] Fix MoE DeepGEMM workspace size with attention_dp by @tensorrt-cicd in #13310
- [https://nvbugs/6189416][fix] Add a Blackwell-specific reference entry (extra_acc_spec=sm100_fp8, accuracy=46. by @tensorrt-cicd in #14484
- [#14619][perf] AutoDeploy: tune Llama-3.1-8B-Instruct-FP8 TP=2/4 config and handle CG max bs when it is unset in the yaml by @MrGeva in #14622
- [#13561][feat] AutoDeploy: enable MLIR elementwise fusion and trtllm_gen MoE on Nano NVFP4 by @MrGeva in #14554
- [https://nvbugs/6221841][fix] Detect via the raw config_dict whether the user actually set a top-level rope_th by @tensorrt-cicd in #14624
- [None][perf] Replace Parakeet audio encoder with native trtllm layers by @aswinvisva in #14474
- [https://nvbugs/6194552][fix] stabilize Triton Mamba softplus by @hnover-nv in #14652
- [TRTLLM-13043][chroe] add VisualGen context to AGENTS.md by @zhenhuaw-me in #14732
- [None][feature] Add thinking token budget control by @tijyojwad in #14665
- [TRTLLM-13050][test] Remove two-model eagle3 spec-decoding tests by @QiJune in #14735
- [TRTLLM-12653][feat] LTX-2 Ulysses cross-attention for v2a with audio padding by @luyiyun1021 in #14044
- [None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_* by @yuxianq in #14569
- [https://nvbugs/6162857][fix] Use generation metrics for VisualGen perf sanity by @taianz-nv in #14176
- [https://nvbugs/6045177][fix] resolve mypy error by @ixlmar in #14689
- [None][feat] add Poolside Laguna tool parser by @DomBrown in #14638
- [TRTLLM-10004][chore] Enable NCCL symmetric zero-copy by default by @nv-lschneider in #14472
- [None][infra] Fix cbts tokenmacro b64 by @crazydemo in #14718
- [TRTLLM-12982][perf] remove sync after FlashInfer attention plan() by @ixlmar in #14634
- [https://nvbugs/6156233][test] unwaive GPT-OSS dflash test since bug has been closed as fix unknown by @dongfengy in #14713
- [TRTLLM-12901][fix] cap per-rank max_num_active_requests by max_num_tokens under attention DP by @xwang233 in #14481
- [None][feat] Enable NVFP4 KV cache support in trtllm-gen attention by @yihwang-nv in #12544
- [https://nvbugs/6185480][fix] autodeploy unwaive the test by @nvchenghaoz in #14716
- [https://nvbugs/6136737][fix] Propagate external SWA window to FMHA kernel in V2 KV cache by @tensorrt-cicd in #13719
- [https://nvbugs/5996024][fix] Enforce trust_remote_code flag by @yibinl-nvidia in #13527
- [https://nvbugs/5979710][fix] Bound transfer destinations by @yibinl-nvidia in #13525
- [None][fix] Resolve NVML device index mismatch in get_numa_aware_cpu_affinity when CUDA_VISIBLE_DEVICES is set by @YPxHolic in #12985
- [None][infra] revert #13607 by @tburt-nv in #14757
- [TRTLLM-12535][chore] Refactor fast path (token ID space) preprocessing logic out to the input preprocessor methods only by @moraxu in #14370
- [None][infra] Waive 2 failed cases for main in post-merge 2741 by @ZhanruiSunCh in #14737
- [None][infra] Waive 1 failed cases for main in pre-merge 40562 by @ZhanruiSunCh in #14776
- [TRTLLM-12440][feat] Add GMS-only weight sharing support by @chienchunhung in #13926
- [None][test] Unwaive some Perf Tests by @chenfeiz0326 in #14664
- [https://nvbugs/6204488][fix] Replace fixed disagg fill throttle with slow-start ramp by @chienchunhung in #14475
- [None][chore] Waive failing multi-gpu test by @brb-nv in #14788
- [TRTLLM-11408][feat] Add VisualGen TP Support by @belgarten-nv in #13614
- [None][test] Add TLLM_SPEC_DECODE_FORCE_NUM_ACCEPTED_TOKENS in Spec Decoding Perf Test by @chenfeiz0326 in #14438
- [https://nvbugs/6165866][infra] Waive 1 failed cases for main in pre-merge 40081 - Fix prefix by @taylor-yb-lee in #14756
- [https://nvbugs/6196391][fix] Carryover disagg TTFT improvements by @brb-nv in #14719
- [https://nvbugs/6244695][fix] Revert Pass IPC HMAC key through file descriptor by @chenfeiz0326 in #14782
New Contributors
- @fuergaosi233 made their first contribution in #14495
- @sara4dev made their first contribution in #11960
- @YPxHolic made their first contribution in #12985
- @belgarten-nv made their first contribution in #13614
Full Changelog: v1.3.0rc16...v1.3.0rc17