github NVIDIA/TensorRT-LLM v1.2.0rc8

pre-releaseone day ago

Highlights

  • Model Support

    • Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
    • Eagle: qwen2 capture hidden states (#10091)
    • Add pp support for DeepSeek-v3.2 (#10449)
    • Pass lora_params through Qwen2/3 model forward (#10174)
    • Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
    • Mistral large 3 few code refine (#10405)
    • EPD for Qwen3 VL (#10470)
    • Remove some model support; add device constraint (#10563)
    • Enable AttentionDP on Qwen3-VL and fix test (#10435)
  • API

    • Add stability tags for serve subcommand (#10012)
  • Feature

    • Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
    • Sm100 weight-only kernel (#10190)
    • AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
    • Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
    • Add transferAgent binding (step 1) (#10113)
    • Add the eos tokens in generation config to stop words in the sampler (#10389)
    • Apply fusion for W4AFP8_AWQ MoE (#9838)
    • Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
    • Run sample_async on extra stream (#10215)
    • Optimize qk rope/nope concat for DSA (#10571)
  • Fix

    • Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
    • Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
    • Fix buffer reuse for CUDA graph attention metadata (#10393)
    • Force release torch memory when LLM is destroyed (#10314)
    • Swap TP-CP grouping order (#10350)
    • TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
    • Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
    • Fixed recursive node traversals (#10379)
    • Fix undefined tokens_per_block (#10438)
    • Skip spec dec for non-last rank (#10445)
    • Setup dist before using autotuner (#10491)
    • Fix broken cast (#9975)
    • Fix sm120 speculation (#10049)
    • Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
    • Choose register model config over root config for VLM (#10553)
  • Documentation

    • Update SWA + spec dec support matrix (#10421)
    • Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
    • Adding parallelism types in feature combination matrix (#9849)
    • Update GPTOSS Doc (#10536)
    • Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
    • Update Qwen3-Next doc by adding known issues section (#10582)
  • Test & Infra

    • Add tests for DeepSeek v3.2 (#10561)
    • Add accuracy tests for super-v3 with multiple-gpus (#10234)
    • Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
    • Add disag-serving kimi k2 thinking tests (#10357)
    • Partition test_llm_pytorch.py for parallel execution (#10400)
    • Only Use Throughput Metrics to Check Regression (#10404)
    • Add vswa test cases coverage (#10146)
    • Use random port in container port section (#10432)
    • Remove redundant retries while binding to arbitrary port (#10452)
    • Add qwen3-4b accuracy test case (#10382)
    • Update kimi-k2-1k1k dataset (#10473)
    • Fix concurrency list in Wide-EP perf tests (#10529)
    • Restrict max_num_tokens in disagg mtp config (#10442)
    • Add kimi_k2 single node perf test (#10436)
    • Add MMMU test for mistral small (#10530)
    • Workaround OCI-NRT slowdown issue (#10587)

What's Changed

  • [#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
  • [https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
  • [https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
  • [https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
  • [TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
  • [https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
  • [None][feat] sm100 weight-only kernel by @Njuapp in #10190
  • [https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
  • [None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
  • [None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
  • [TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
  • [TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
  • [https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
  • [None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
  • [https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
  • [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
  • [None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
  • [TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
  • [#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
  • [TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
  • [None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
  • [TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
  • [https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
  • [https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
  • [TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
  • [TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
  • [https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
  • [https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
  • [#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
  • [https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
  • [None][feat] precompiled installation from local src dir by @lucaslie in #10419
  • [TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
  • [None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
  • [None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
  • [None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
  • [None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
  • [https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
  • [#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
  • [None][docs] Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md by @venkywonka in #10426
  • [#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
  • [None][chore] unwaive qwen3 30b test by @kris1025 in #10115
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
  • [None][test] update test case constraint by @crazydemo in #10381
  • [https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
  • [TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
  • [TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
  • [None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
  • [https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz in #10424
  • [https://nvbugs/5785206][infra] Waive TestQwen3_30B_A3B::test_fp8[latency-torch_compile=False]. by @bobboli in #10441
  • [None][infra] Waive failed cases on 1/6 by @EmmaQiaoCh in #10440
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10427
  • [https://nvbugs/5760726][fix] Use random port in container port section by @JunyiXu-nv in #10432
  • [None][chore] remove redundant retries while binding to arbitrary port by @reasonsolo in #10452
  • [https://nvbugs/5748600][ci] Unwaive disagg guided decoding test by @syuoni in #10409
  • [https://nvbugs/5749988][fix] Remove redundant qwen3 spec dec test by @mikeiovine in #10387
  • [None][feat] precompiled installation from local src dir with fnmatch only by @lucaslie in #10430
  • [https://nvbugs/5732942][fix] AutoDeploy: handle transformers 4.57.1 upgrade fixes by @lucaslie in #10466
  • [None] [feat] Add test script and raster M for gather fc1 kernel by @zongfeijing in #10429
  • [https://nvbugs/5721907][fix] AutoDeploy: improve numerical stability of flashinfer attention test by @lucaslie in #10467
  • [https://nvbugs/5698434][test] add qwen3-4b accuracy test case by @crazydemo in #10382
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10365
  • [https://nvbugs/5767223][feat] add pp support for DeepSeek-v3.2 by @lfr-0531 in #10449
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10457
  • [https://nvbugs/5726086][fix] update kimi-k2-1k1k dataset by @yingguo-trt in #10473
  • [#4745][fix] Pass lora_params through Qwen2/3 model forward by @karljang in #10174
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10474
  • [None][bug] fix export for microsoft/Phi-3-medium-128k-instruct by @tcherckez-nvidia in #10455
  • [None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10487
  • [None][doc] Adding parallelism types in feature combination matrix by @pcastonguay in #9849
  • [https://nvbugs/5781589][fix] Skip spec dec for non-last rank by @ziyixiong-nv in #10445
  • [https://nvbugs/5761665][fix] AutoDeploy: using Dim.DYNAMIC for robust dynamic shape by @lucaslie in #10511
  • [https://nvbugs/5707392][fix] unwaive test_fused_moe_fp8_blockwise_wide_ep[NotEnabled] by @xxi-nv in #10428
  • [TRTLLM-9661][chore] Further reduce tuning time for cuteDSL nvFP4 dense gemm. by @hyukn in #10339
  • [https://nvbugs/5784543][fix] Setup dist before using autotuner. by @yuxianq in #10491
  • [None][chore] Support multiple job submission at the same time by @yingguo-trt in #10492
  • [https://nvbugs/5747878][fix] unwaive llama4 scout tests by @lucaslie in #10468
  • [https://nvbugs/5775402][fix] Fix concurrency list in Wide-EP perf tests by @Barry-Delaney in #10529
  • [None][chore] Enable seg fault cases since one race condition is fixed by @HuiGao-NV in #10398
  • [None][doc] Update GPTOSS Doc by @dongfengy in #10536
  • [TRTLLM-9642][infra] Increase pytest verbosity for failed tests by @EmmaQiaoCh in #9657
  • [None][chore] Bump version to 1.2.0rc8 by @yiqingy0 in #10542
  • [None][fix] Mistral large 3 few code refine by @byshiue in #10405
  • [#10417][fix] AutoDepoloy - Reverted to direct computation of minusA by @MrGeva in #10509
  • [None][feat] EPD for Qwen3 VL by @2ez4bz in #10470
  • [TRTLLM-9522][fix] broken cast by @ixlmar in #9975
  • [#10513][fix] AutoDeploy: removed self.mlp_type leftovers from last moe refactor by @MrGeva in #10512
  • [https://nvbugs/5740075][fix] Fix sm120 speculation by @mikeiovine in #10049
  • [None][chore] Waive tests blocking premerge 01/08 by @brb-nv in #10555
  • [None][fix] revert #10445. by @yuxianq in #10547
  • [None][test] restrict max_num_tokens in disagg mtp config by @ruodil in #10442
  • [None][chore] Add failed cases into waives.txt by @jieli-matrix in #10541
  • [None][fix] Setup dist for AutoTuner in Layerwise benchmarking. by @hyukn in #10534
  • [TRTLLM-9676][fix] Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case by @JadoTu in #9873
  • [https://nvbugs/5785206][infra] unwaive the accuracy/test_llm_api_pytorch.py::TestQwen3_30B_A3B by @byshiue in #10560
  • [https://nvbugs/5787453][fix] Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 by @chang-l in #10552
  • [https://nvbugs/5622938][feat] Run sample_async on extra stream. by @yuxianq in #10215
  • [None][doc] blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs by @lfr-0531 in #10565
  • [TRTLLM-9932][test] add kimi_k2 single node perf test by @ruodil in #10436
  • [None][chore] remove some model support; add device constraint by @jieli-matrix in #10563
  • [https://nvbugs/5756008][fix] unwaive test by @Tabrizian in #10523
  • [TRTLLM-10309] [feat] Optimize qk rope/nope concat for DSA by @kaiyux in #10571
  • [None][fix] Enable AttentionDP on Qwen3-VL and fix test by @yechank-nvidia in #10435
  • [None][feat] Add support for DeepSeek v3.2 tests by @yingguo-trt in #10561
  • [https://nvbugs/5752687][fix] Choose register model config over root config for VLM by @farazkh80 in #10553
  • [https://nvbugs/5628848][fix] Fix nanobind stub generation by @Linda-Stadter in #10516
  • [https://nvbugs/5548861][fix] AutoDeploy: Fix the test by @nvchenghaoz in #10521
  • [https://nvbugs/5669097][tests] Add MMMU test for mistral small by @2ez4bz in #10530
  • [None][chore] Update AutoDeploy model list by @tcherckez-nvidia in #10505
  • [None][chore] Fix Gitlab CI termination issues by @fredricz-20070104 in #10576
  • [None][chore] waive test case by @HuiGao-NV in #10581
  • [None][doc] Update Qwen3-Next doc by adding known issues section by @nv-guomingz in #10582
  • [None][ci] Workaround OCI-NRT slowdown issue by @chzblych in #10587

New Contributors

Full Changelog: v1.2.0rc7...v1.2.0rc8

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.