NVIDIA/TensorRT-LLM v1.2.0rc8 on GitHub

Highlights

Model Support
- Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
- Eagle: qwen2 capture hidden states (#10091)
- Add pp support for DeepSeek-v3.2 (#10449)
- Pass lora_params through Qwen2/3 model forward (#10174)
- Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
- Mistral large 3 few code refine (#10405)
- EPD for Qwen3 VL (#10470)
- Remove some model support; add device constraint (#10563)
- Enable AttentionDP on Qwen3-VL and fix test (#10435)
API
- Add stability tags for serve subcommand (#10012)
Feature
- Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
- Sm100 weight-only kernel (#10190)
- AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
- Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
- Add transferAgent binding (step 1) (#10113)
- Add the eos tokens in generation config to stop words in the sampler (#10389)
- Apply fusion for W4AFP8_AWQ MoE (#9838)
- Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
- Run sample_async on extra stream (#10215)
- Optimize qk rope/nope concat for DSA (#10571)
Fix
- Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
- Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
- Fix buffer reuse for CUDA graph attention metadata (#10393)
- Force release torch memory when LLM is destroyed (#10314)
- Swap TP-CP grouping order (#10350)
- TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
- Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
- Fixed recursive node traversals (#10379)
- Fix undefined tokens_per_block (#10438)
- Skip spec dec for non-last rank (#10445)
- Setup dist before using autotuner (#10491)
- Fix broken cast (#9975)
- Fix sm120 speculation (#10049)
- Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
- Choose register model config over root config for VLM (#10553)
Documentation
- Update SWA + spec dec support matrix (#10421)
- Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
- Adding parallelism types in feature combination matrix (#9849)
- Update GPTOSS Doc (#10536)
- Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
- Update Qwen3-Next doc by adding known issues section (#10582)
Test & Infra
- Add tests for DeepSeek v3.2 (#10561)
- Add accuracy tests for super-v3 with multiple-gpus (#10234)
- Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
- Add disag-serving kimi k2 thinking tests (#10357)
- Partition test_llm_pytorch.py for parallel execution (#10400)
- Only Use Throughput Metrics to Check Regression (#10404)
- Add vswa test cases coverage (#10146)
- Use random port in container port section (#10432)
- Remove redundant retries while binding to arbitrary port (#10452)
- Add qwen3-4b accuracy test case (#10382)
- Update kimi-k2-1k1k dataset (#10473)
- Fix concurrency list in Wide-EP perf tests (#10529)
- Restrict max_num_tokens in disagg mtp config (#10442)
- Add kimi_k2 single node perf test (#10436)
- Add MMMU test for mistral small (#10530)
- Workaround OCI-NRT slowdown issue (#10587)

What's Changed

[#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
[https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
[https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
[https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
[TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
[https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
[None][feat] sm100 weight-only kernel by @Njuapp in #10190
[https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
[None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
[None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
[TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
[TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
[https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
[None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
[https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
[TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
[None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
[TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
[#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
[TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
[None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
[TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
[https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
[https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
[TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
[TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
[https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
[https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
[#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
[https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
[None][feat] precompiled installation from local src dir by @lucaslie in #10419
[TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
[None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
[None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
[None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
[None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
[https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
[#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
[None][docs] Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md by @venkywonka in #10426
[#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
[None][chore] unwaive qwen3 30b test by @kris1025 in #10115
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
[None][test] update test case constraint by @crazydemo in #10381
[https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
[TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
[TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
[None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
[https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz in #10424
[https://nvbugs/5785206][infra] Waive TestQwen3_30B_A3B::test_fp8[latency-torch_compile=False]. by @bobboli in #10441
[None][infra] Waive failed cases on 1/6 by @EmmaQiaoCh in #10440
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10427
[https://nvbugs/5760726][fix] Use random port in container port section by @JunyiXu-nv in #10432
[None][chore] remove redundant retries while binding to arbitrary port by @reasonsolo in #10452
[https://nvbugs/5748600][ci] Unwaive disagg guided decoding test by @syuoni in #10409
[https://nvbugs/5749988][fix] Remove redundant qwen3 spec dec test by @mikeiovine in #10387
[None][feat] precompiled installation from local src dir with fnmatch only by @lucaslie in #10430
[https://nvbugs/5732942][fix] AutoDeploy: handle transformers 4.57.1 upgrade fixes by @lucaslie in #10466
[None] [feat] Add test script and raster M for gather fc1 kernel by @zongfeijing in #10429
[https://nvbugs/5721907][fix] AutoDeploy: improve numerical stability of flashinfer attention test by @lucaslie in #10467
[https://nvbugs/5698434][test] add qwen3-4b accuracy test case by @crazydemo in #10382
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #10365
[https://nvbugs/5767223][feat] add pp support for DeepSeek-v3.2 by @lfr-0531 in #10449
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10457
[https://nvbugs/5726086][fix] update kimi-k2-1k1k dataset by @yingguo-trt in #10473
[#4745][fix] Pass lora_params through Qwen2/3 model forward by @karljang in #10174
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10474
[None][bug] fix export for microsoft/Phi-3-medium-128k-instruct by @tcherckez-nvidia in #10455
[None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10487
[None][doc] Adding parallelism types in feature combination matrix by @pcastonguay in #9849
[https://nvbugs/5781589][fix] Skip spec dec for non-last rank by @ziyixiong-nv in #10445
[https://nvbugs/5761665][fix] AutoDeploy: using Dim.DYNAMIC for robust dynamic shape by @lucaslie in #10511
[https://nvbugs/5707392][fix] unwaive test_fused_moe_fp8_blockwise_wide_ep[NotEnabled] by @xxi-nv in #10428
[TRTLLM-9661][chore] Further reduce tuning time for cuteDSL nvFP4 dense gemm. by @hyukn in #10339
[https://nvbugs/5784543][fix] Setup dist before using autotuner. by @yuxianq in #10491
[None][chore] Support multiple job submission at the same time by @yingguo-trt in #10492
[https://nvbugs/5747878][fix] unwaive llama4 scout tests by @lucaslie in #10468
[https://nvbugs/5775402][fix] Fix concurrency list in Wide-EP perf tests by @Barry-Delaney in #10529
[None][chore] Enable seg fault cases since one race condition is fixed by @HuiGao-NV in #10398
[None][doc] Update GPTOSS Doc by @dongfengy in #10536
[TRTLLM-9642][infra] Increase pytest verbosity for failed tests by @EmmaQiaoCh in #9657
[None][chore] Bump version to 1.2.0rc8 by @yiqingy0 in #10542
[None][fix] Mistral large 3 few code refine by @byshiue in #10405
[#10417][fix] AutoDepoloy - Reverted to direct computation of minusA by @MrGeva in #10509
[None][feat] EPD for Qwen3 VL by @2ez4bz in #10470
[TRTLLM-9522][fix] broken cast by @ixlmar in #9975
[#10513][fix] AutoDeploy: removed self.mlp_type leftovers from last moe refactor by @MrGeva in #10512
[https://nvbugs/5740075][fix] Fix sm120 speculation by @mikeiovine in #10049
[None][chore] Waive tests blocking premerge 01/08 by @brb-nv in #10555
[None][fix] revert #10445. by @yuxianq in #10547
[None][test] restrict max_num_tokens in disagg mtp config by @ruodil in #10442
[None][chore] Add failed cases into waives.txt by @jieli-matrix in #10541
[None][fix] Setup dist for AutoTuner in Layerwise benchmarking. by @hyukn in #10534
[TRTLLM-9676][fix] Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case by @JadoTu in #9873
[https://nvbugs/5785206][infra] unwaive the accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B by @byshiue in #10560
[https://nvbugs/5787453][fix] Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 by @chang-l in #10552
[https://nvbugs/5622938][feat] Run sample_async on extra stream. by @yuxianq in #10215
[None][doc] blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs by @lfr-0531 in #10565
[TRTLLM-9932][test] add kimi_k2 single node perf test by @ruodil in #10436
[None][chore] remove some model support; add device constraint by @jieli-matrix in #10563
[https://nvbugs/5756008][fix] unwaive test by @Tabrizian in #10523
[TRTLLM-10309] [feat] Optimize qk rope/nope concat for DSA by @kaiyux in #10571
[None][fix] Enable AttentionDP on Qwen3-VL and fix test by @yechank-nvidia in #10435
[None][feat] Add support for DeepSeek v3.2 tests by @yingguo-trt in #10561
[https://nvbugs/5752687][fix] Choose register model config over root config for VLM by @farazkh80 in #10553
[https://nvbugs/5628848][fix] Fix nanobind stub generation by @Linda-Stadter in #10516
[https://nvbugs/5548861][fix] AutoDeploy: Fix the test by @nvchenghaoz in #10521
[https://nvbugs/5669097][tests] Add MMMU test for mistral small by @2ez4bz in #10530
[None][chore] Update AutoDeploy model list by @tcherckez-nvidia in #10505
[None][chore] Fix Gitlab CI termination issues by @fredricz-20070104 in #10576
[None][chore] waive test case by @HuiGao-NV in #10581
[None][doc] Update Qwen3-Next doc by adding known issues section by @nv-guomingz in #10582
[None][ci] Workaround OCI-NRT slowdown issue by @chzblych in #10587

New Contributors

@karthikvetrivel made their first contribution in #10169
@XiaoXuan42 made their first contribution in #10091

Full Changelog: v1.2.0rc7...v1.2.0rc8