github NVIDIA/TensorRT-LLM v1.2.0rc6

pre-release10 hours ago

Highlights

  • Model Support

    • Add DeepSeek-V3.2 to the supported models (#9893)
    • Add llama4 scaling (#9771)
      Support Mistral Large3 LLM part (#9820)
  • API

    • Add docs and examples for Responses API (#9946)
    • Add eos_token_id in generation_config to sampling params (#9514)
    • Update nspect version for api change (#9899)
  • Feature

    • 2D parallel EP TP support (#9459)
    • Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe (#9852)
    • Add gather fc1 kernel by cuteDSL (#9618)
    • Add GB300 support since it does not support segment (#9731)
    • Add helixPostProcessNative kernel for cp_dim=2 (#9924)
    • Added symetric memory AllReduce strategy (#8919)
    • ConfigurableMoE support (#9772, #9858)
    • Enable multistream for Linear Attention in Qwen3 (#9696)
    • Enable PDL for indexer topK (#9843)
    • Implement distributed tuning system (#9621)
    • Implement sampling on 1-model EAGLE3 (#9885)
    • Move D->H copies to a worker thread (#8463)
    • Optimize the host overhead of _sample_async (#9935)
    • Port fp4 quantization kernel optimization from FlashInfer (#9854)
    • Support larger topK for NVLinkOneSided AlltoAll. (#9816)
  • Fix

    • Fix CUDA stream sync issue in ModelRunnerCPP (#6426)
    • Fix accuracy issue in TRTLLM MoE (#9999)
    • Fix PDL in TRTLLM MOE for dsv3 (#9799)
    • Fix unterminated process issue for RemoteOpenAIServer (#9490)
    • Fix PDL bugs with trtllm-gen fmha kernels (#9863)
    • Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659)
  • Documentation

    • Add config db and docs (#9420)
    • Update doc for NVFP4 KV cache (#9475)
    • Update documents for GB300 NVL72 (#9987)
    • Update wide EP documents (#9724)
  • Test & Infra

What's Changed

  • [https://nvbugs/5703953][fix] Preserving ip:port for trtllm-serve before initializing llm by @JunyiXu-nv in #9646
  • [None][infra] Waive failed cases for main branch on 12/07 by @EmmaQiaoCh in #9769
  • [None][fix] Several minor fixes to CI setting by @chzblych in #9765
  • [OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer by @cjluo-nv in #9679
  • [None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce by @nv-lschneider in #9314
  • [TRTLLM-9000][feat] Add multi-node Perf Tests into CI by @chenfeiz0326 in #8800
  • [None][test] add ntp tolerance in time metrics verification by @zhengd-nv in #9741
  • [TRTLLM-9603][feat] Enable ConfigurableMoE test in the CI by @xxi-nv in #9645
  • [https://nvbugs/5422621][test] Add GB 200 WIDEEP test case for RCCA 5422621 by @fredricz-20070104 in #9506
  • [None][fix] Fix two tuning cache miss issues. by @hyukn in #9743
  • [TRTLLM-9706] [doc] Update wide EP documents by @kaiyux in #9724
  • [https://nvbugs/5666804][test] only adding sampler config for limited models by @ruodil in #9512
  • [None][infra] Waive failed cases for main on 12/08 by @EmmaQiaoCh in #9773
  • [None][chore] Move the rocketkv e2e test to post-merge by @lfr-0531 in #9768
  • [None][chore] Enable tvm_ffi for cute dsl nvfp4_gemm to reduce host overhead. by @limin2021 in #9690
  • [TRTLLM-9431][perf] Enable multistream for Linear Attention in Qwen3-… by @nv-guomingz in #9696
  • [None][chore] Remove closed bugs by @xinhe-nv in #9770
  • [None][infra] update mooncake in docker images by @zhengd-nv in #9584
  • [None][test] Add Kimi k2 WIDEEP perf and accuracy cases by @fredricz-20070104 in #9686
  • [https://nvbugs/5527655][test] Add test case for RCCA 5527655 by @fredricz-20070104 in #9511
  • [http://nvbugs/5649010][fix] fix test_auto_scaling.py::test_worker_restart timeout by @reasonsolo in #9775
  • [None][fix] Switch AutoDeploy's default allreduce strategy to NCCL by @MrGeva in #9666
  • [TRTLLM-9506][fix] Fix AR for DeepSeek-R1 2 model path by @sunnyqgg in #9661
  • [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench by @FrankD412 in #9250
  • [https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model by @jhaotingc in #8383
  • [TRTLLM-7967][chore] Add more tests by @yibinl-nvidia in #9415
  • [https://nvbugs/5508267][fix] Proper handling of inactive canceled requests by @thorjohnsen in #9280
  • [#8921][feat] Added symetric memory AllReduce strategy by @MrGeva in #8919
  • [None][fix] Fix #8383 introduced TRTLLM backend python error by @jhaotingc in #9804
  • [#9753][feat] AutoDeploy: Implement add rms_norm fusion by @nvchenghaoz in #9754
  • [None][infra] Correct the waived test names due to a merge conflict by @yuanjingx87 in #9803
  • [None][fix] Fix PDL in TRTLLM MOE for dsv3 by @dmtri35 in #9799
  • [None][feat] Add llama4 scaling by @byshiue in #9771
  • [https://nvbugs/5677746][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang by @jiaganc in #9659
  • [None][fix] Fix unterminated process issue for RemoteOpenAIServer by @JunyiXu-nv in #9490
  • [https://nvbugs/5726066][infra] Waive timeout disaggregated/test_auto_scaling tests. by @bobboli in #9815
  • [None][chore] Fix tests failing on pre-merge 12/08 by @brb-nv in #9819
  • [https://nvbugs/5722653][fix] Fix config file used by disagg_client by @JunyiXu-nv in #9783
  • [TRTLLM-6537][chore] Shorten the time limit for dis-agg accuracy testing by @Shixiaowei02 in #9614
  • [None][infra] Use artifactory pypi mirror for Cython install by @ZhanruiSunCh in #9774
  • [TRTLLM-9794][ci] remove duplicated test cases in DGX B200 by @QiJune in #9817
  • [None][test] Refactor qa/llm_perf_nim.yml test list by @yufeiwu-nv in #9700
  • [None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically by @yiqingy0 in #9829
  • [None][fix] Additional model outputs for pipeline parallelism by @Funatiq in #9794
  • [TRTLLM-6756][feat] Update BeamSearch for TorchSampler by @stnie in #9660
  • [TRTLLM-9794][ci] move qwen3-next test cases to gb200 by @QiJune in #9827
  • [None][infra] Waive failed cases for main branch on 12/09 by @EmmaQiaoCh in #9839
  • [https://nvbugs/5575841] [fix] Nvbug 5575841: Remove additional test waivers for TestMoEFP4 by @DomBrown in #9788
  • [None][feat] Make 2-model spec dec use the 1-model kernels (Hopper) by @mikeiovine in #8810
  • [None][chore] Adding flaky auto scaling test to waives by @pcastonguay in #9851
  • [#8921][chore] AutoDeploy NanoV3 to use SYMM_MEM allreduce strategy by @MrGeva in #9797
  • [TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts by @mlefeb01 in #9600
  • [https://nvbugs/5688388][chore] Unwaiving fixed disagg test by @pcastonguay in #9800
  • [https://nvbugs/5719561][chore] Unwaive tests for nvbug 5719561 by @pcastonguay in #9801
  • [https://nvbugs/5508301][feat] Move D->H copies to a worker thread whe… by @dhansen-nvidia in #8463
  • [None][chore] Add unittest for otlp tracing by @zhanghaotong in #8716
  • [None][chore] Support larger topK for NVLinkOneSided AlltoAll. by @bobboli in #9816
  • [TRTLLM-9794][ci] move some deepseek test cases to gb200 by @QiJune in #9841
  • [TRTLLM-9661][fix] Fix nvfp4 gemm allowed backends arg passing by @hyukn in #9837
  • [https://nvbugs/5702791][fix] Unwaive fixed test by @dominicshanshan in #9844
  • [TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue by @ZhanruiSunCh in #9823
  • [None][chore] Enable L0 multi-gpus testing for Qwen3-next by @nv-guomingz in #9789
  • [https://nvbugs/5727952][fix] PDL bugs with trtllm-gen fmha kernels by @PerkzZheng in #9863
  • [None][infra] Fail fast if SLURM entrypoint fails by @mlefeb01 in #9744
  • [None][feat] Port fp4 quantization kernel optimization from FlashInfer by @bkryu in #9854
  • [TRTINFRA-7328][infra] - Move half B200 tests to lbd by @mlefeb01 in #9853
  • [None][fix] Fully resolve the tactic recovery issues in AutoTuner serialized cache by @hyukn in #9835
  • [None][chore] bump version to 1.2.0rc6 by @yiqingy0 in #9874
  • [TRTLLM-9228][infra] Verify thirdparty C++ process by @cheshirekow in #9367
  • [None][doc] Update doc for NVFP4 KV cache by @Tom-Zheng in #9475
  • [https://nvbugs/5601682][fix] Unwaiving disagg test by @pcastonguay in #9627
  • [None][chore] Add set_segment arg to slurm scripts by @fredricz-20070104 in #9731
  • [https://nvbugs/5582258][fix] unwaive by @bo-nv in #9650
  • [None][chore] Fix warning when capturing CUDA graph by @ziyixiong-nv in #9746
  • [https://nvbugs/5718004][fix] Add warmup for cancellation test by @JunyiXu-nv in #9860
  • [#2730][fix] Fix circular import bug in medusa/weight.py by @karljang in #9866
  • [None][feat] Enable PDL for indexer topK by @ChristinaZ in #9843
  • [TRTLLM-9685] [feat] Add gather fc1 kernel by cuteDSL by @zongfeijing in #9618
  • [None][chore] enable test_ipc.py by @Superjomn in #9865
  • [None][doc] Add DeepSeek-V3.2 to the supported models by @lfr-0531 in #9893
  • [TRTLLM-8959][feat] ConfigurableMoE support CUTLASS by @xxi-nv in #9772
  • [None] [feat] add eos_token_id in generation_config to sampling params by @JadoTu in #9514
  • [TRTLLM-9736][feat] AsyncLLM and verl integ by @hchings in #9353
  • [https://nvbugs/5597647][ci] Unwaive fixed tests. by @SimengLiu-nv in #9812
  • [TRTC-43] [feat] Add config db and docs by @venkywonka in #9420
  • [None][infra] Add workflow to auto-label 'waiting for feedback' on team comments by @karljang in #9886
  • [None][perf] Fix TPOT when min_tokens set by @jthomson04 in #9862
  • [None][infra] Ignore comments from bots and CI accounts by @karljang in #9929
  • [https://nvbugs/5727517][fix] Preserve ip:port for disagg by @JunyiXu-nv in #9859
  • [None][infra] update ucx to 1.20 by @chuangz0 in #9786
  • [TRTLLM-9717][fix] fix multi nodes tests cases by @xinhe-nv in #9736
  • [None][infra] Fix mergeWaiveList stage by @yiqingy0 in #9892
  • [https://nvbugs/5599176][fix] Unwaive fixed test for Ray by @dominicshanshan in #9861
  • [None][infra] update nspect version for api change by @niukuo in #9899
  • [None][infra] revert ucx to 1.19 by @chuangz0 in #9936
  • [TRTLLM-9792] [feat] Support multiple instances on single node for slurm scripts by @kaiyux in #9900
  • [TRTLLM-9262][test] add groupgemm ada case for rcca by @crazydemo in #9833
  • [#6425][fix] address CUDA stream sync issue in ModelRunnerCPP by @xsxszab in #6426
  • [None][infra] Replace the deprecated github token by @yuanjingx87 in #9915
  • [https://nvbugs/5736923][infra] Waive timeout disaggregated/test_auto_scaling[http-round_robin] test by @yihwang-nv in #9942
  • [None][chore] Modify python ipc_util to align with C++ path by @yufeiwu-nv in #9894
  • [None][chore] unwaive qwen3 accuracy test by @kris1025 in #9895
  • [https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test by @chenfeiz0326 in #9896
  • [None][test] fix a typo in model name in script by @ruodil in #9867
  • [None][chore] Degrade log level in cublas fp4 runner when using default configs by @hyukn in #9951
  • [None][feat] AutoDeploy: prepare_metadata revisited by @lucaslie in #9764
  • [None][feat] Upgrade NIXL to v0.8.0 by @zackyoray in #9707
  • [TRTLLM-5972][chore] Load balance decode token KV cache with helix parallelism by @brb-nv in #9757
  • [None][fix] Introduce inline namespace to avoid symbol collision by @yihwang-nv in #9541
  • [TRTLLM-9637][feat] Support tool parser for Kimi K2 by @JunyiXu-nv in #9830
  • [https://nvbugs/5643787][fix] remove the war path for notify to itself by @chuangz0 in #9834
  • [https://nvbugs/5716787][fix] terminate nixl running when exiting by @chuangz0 in #9785
  • [None][feat] Async pp send. by @yuxianq in #9952
  • [None][infra] Remove generate lockfile schedule for 1.2.0rc4.post1 branch by @yuanjingx87 in #9945
  • [https://nvbugs/4141427][chore] Add more details to LICENSE file by @tburt-nv in #9881
  • [TRTLLM-9493][feat] Add helixPostProcessNative kernel for cp_dim=2 by @brb-nv in #9924
  • [None][feat] spark cublas LUT table for llama-8b-bf16 perf by @farazkh80 in #9811
  • [None][feat] Support Mistral Large3 LLM part by @byshiue in #9820
  • [TRTLLM-9784][fix] Resolve port conflicts by @shuyixiong in #9780
  • [TRTLLM-9468][chore] Update disagg benchmarking scripts to support context parallelism by @brb-nv in #9720
  • [TRTLLM-9738][chore] Guard accuracy with nccl allreduce strategy by @shuyixiong in #9793
  • [https://nvbugs/5720482][fix] Fix test rpc streaming by @Superjomn in #9902
  • [None][feat] Graceful Error Handling for Guided Decoder by @jellysnack in #9078
  • [None][feat] Implement sampling on 1-model EAGLE3 by @mikeiovine in #9885
  • [None][chore] Add namespace to header to fix tot failure by @farazkh80 in #9973
  • [None][feat] Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe by @nvxuanyuc in #9852
  • [None][fix] disable async pp send for ray cases. by @yuxianq in #9959
  • [https://nvbugs/5666816][fix] Unwaive llama3 eagle3 test by @mikeiovine in #9964
  • [None][infra] Delete container before attempting import by @mlefeb01 in #9967
  • [None][infra] Waive failed tests for main branch on 12/14 by @EmmaQiaoCh in #9982
  • [TRTLLM-9493][noop] Refactor fusedMoeCommKernels to enable code sharing by @brb-nv in #9922
  • [TRTLLM-9601][feat] Expose mmKeys for multimodal to integrate with dynamo. by @SimengLiu-nv in #9604
  • [None] [chore] Comments cleanup by @zongfeijing in #9978
  • [None][fix] Fix regex pattern for cubin filtering by @rosenrodt in #9914
  • [https://nvbugs/5580297][fix] Skip capture request error test from Ray stage by @dominicshanshan in #9947
  • [None][doc] update readme for rpc by @Superjomn in #9972
  • [TRTLLM-8961][feat] ConfigurableMoE support DeepGemm by @xxi-nv in #9858
  • [TRTLLM-9794][ci] move test cases of gpt-oss to gb200 by @QiJune in #9934
  • [TRTLLM-9762] [doc] Update documents for GB300 NVL72 by @kaiyux in #9987
  • [TRTLLM-9416][feat] Skip DS-v3.2 indexer MQA and Top-K for short sequences. by @lfr-0531 in #9524
  • [https://nvbugs/5669114][fix] Switch to MMMU benchmark for Gemma3 27B by @brb-nv in #9966
  • [https://nvbugs/5741060][chore] Waive all pg operator tests by @shuyixiong in #9991
  • [TRTLLM-9854][feat] Optimize the host overhead of _sample_async by @ziyixiong-nv in #9935
  • [TRTLLM-9860][doc] Add docs and examples for Responses API by @JunyiXu-nv in #9946
  • [None][feat] Async pp send for PPCommTorch. by @yuxianq in #9976
  • [https://nvbugs/5655885][fix] fix invalid instruction error in 2shot ar kernel on Ampere by @yilin-void in #9394
  • [None] [fix] Fix nsys_on argument for slurm scripts by @kaiyux in #9995
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9941
  • [None][infra] Add multi gpu Ray tests into L0 merge change request list. by @dominicshanshan in #9996
  • [TRTLLM-9136][feat] 2D parallel EP TP support by @greg-kwasniewski1 in #9459
  • [None][infra] Fully waive test_worker_restart test_disagg_server_restart. by @bobboli in #9988
  • [https://nvbugs/5661741][fix] Fix accuracy issue in TRTLLM MoE introduced in #9377 by @rosenrodt in #9999
  • [None] [fix] Fix slrum scripts by @kaiyux in #10007
  • [TRTLLM-9615][feat] Implement a distributed tuning system by @hyukn in #9621
  • [None][feat] Update reasoning parser for nano-v3 by @Wanli-Jiang in #9944
  • [https://nvbugs/5540979][fix] Potential fix for 5540979 by @arekay-nv in #9716
  • [None][infra] Update ucx to 1.20.x by @zackyoray in #9977
  • [None] [fix] Revert "[None] [feat] add eos_token_id in generation_config to sampling params" by @kaiyux in #10002
  • [None][feat] disable fused gemm for sm121 by @farazkh80 in #9916
  • [None][infra] Waive failed tests for main branch on 12/15 by @EmmaQiaoCh in #10001
  • [https://nvbugs/5673559][fix] Unwaiving disagg test for nvbug 5673559 by @pcastonguay in #9957

New Contributors

Full Changelog: v1.2.0rc5...v1.2.0rc6

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.