NVIDIA/TensorRT-LLM v1.2.0rc6 on GitHub

Highlights

Model Support
- Add DeepSeek-V3.2 to the supported models (#9893)
- Add llama4 scaling (#9771)
  Support Mistral Large3 LLM part (#9820)
API
- Add docs and examples for Responses API (#9946)
- Add eos_token_id in generation_config to sampling params (#9514)
- Update nspect version for api change (#9899)
Feature
- 2D parallel EP TP support (#9459)
- Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe (#9852)
- Add gather fc1 kernel by cuteDSL (#9618)
- Add GB300 support since it does not support segment (#9731)
- Add helixPostProcessNative kernel for cp_dim=2 (#9924)
- Added symetric memory AllReduce strategy (#8919)
- ConfigurableMoE support (#9772, #9858)
- Enable multistream for Linear Attention in Qwen3 (#9696)
- Enable PDL for indexer topK (#9843)
- Implement distributed tuning system (#9621)
- Implement sampling on 1-model EAGLE3 (#9885)
- Move D->H copies to a worker thread (#8463)
- Optimize the host overhead of _sample_async (#9935)
- Port fp4 quantization kernel optimization from FlashInfer (#9854)
- Support larger topK for NVLinkOneSided AlltoAll. (#9816)
Fix
- Fix CUDA stream sync issue in ModelRunnerCPP (#6426)
- Fix accuracy issue in TRTLLM MoE (#9999)
- Fix PDL in TRTLLM MOE for dsv3 (#9799)
- Fix unterminated process issue for RemoteOpenAIServer (#9490)
- Fix PDL bugs with trtllm-gen fmha kernels (#9863)
- Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659)
Documentation
- Add config db and docs (#9420)
- Update doc for NVFP4 KV cache (#9475)
- Update documents for GB300 NVL72 (#9987)
- Update wide EP documents (#9724)
Test & Infra
- Add Test Cases (#9506, #9833, #9881, #9415, #8800, #9511, #8716, #9886, #9645)
- Load balance decode token KV cache with helix parallelism (#9757)
- Update disagg benchmarking scripts to support context parallelism (#9720)
- Upgrade UCX to 1.20 (#9786, #9977)
- Upgrade NIXL to v0.8.0 (#9707)

What's Changed

[https://nvbugs/5703953][fix] Preserving ip:port for trtllm-serve before initializing llm by @JunyiXu-nv in #9646
[None][infra] Waive failed cases for main branch on 12/07 by @EmmaQiaoCh in #9769
[None][fix] Several minor fixes to CI setting by @chzblych in #9765
[OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer by @cjluo-nv in #9679
[None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce by @nv-lschneider in #9314
[TRTLLM-9000][feat] Add multi-node Perf Tests into CI by @chenfeiz0326 in #8800
[None][test] add ntp tolerance in time metrics verification by @zhengd-nv in #9741
[TRTLLM-9603][feat] Enable ConfigurableMoE test in the CI by @xxi-nv in #9645
[https://nvbugs/5422621][test] Add GB 200 WIDEEP test case for RCCA 5422621 by @fredricz-20070104 in #9506
[None][fix] Fix two tuning cache miss issues. by @hyukn in #9743
[TRTLLM-9706] [doc] Update wide EP documents by @kaiyux in #9724
[https://nvbugs/5666804][test] only adding sampler config for limited models by @ruodil in #9512
[None][infra] Waive failed cases for main on 12/08 by @EmmaQiaoCh in #9773
[None][chore] Move the rocketkv e2e test to post-merge by @lfr-0531 in #9768
[None][chore] Enable tvm_ffi for cute dsl nvfp4_gemm to reduce host overhead. by @limin2021 in #9690
[TRTLLM-9431][perf] Enable multistream for Linear Attention in Qwen3-… by @nv-guomingz in #9696
[None][chore] Remove closed bugs by @xinhe-nv in #9770
[None][infra] update mooncake in docker images by @zhengd-nv in #9584
[None][test] Add Kimi k2 WIDEEP perf and accuracy cases by @fredricz-20070104 in #9686
[https://nvbugs/5527655][test] Add test case for RCCA 5527655 by @fredricz-20070104 in #9511
[http://nvbugs/5649010][fix] fix test_auto_scaling.py::test_worker_restart timeout by @reasonsolo in #9775
[None][fix] Switch AutoDeploy's default allreduce strategy to NCCL by @MrGeva in #9666
[TRTLLM-9506][fix] Fix AR for DeepSeek-R1 2 model path by @sunnyqgg in #9661
[TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench by @FrankD412 in #9250
[https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model by @jhaotingc in #8383
[TRTLLM-7967][chore] Add more tests by @yibinl-nvidia in #9415
[https://nvbugs/5508267][fix] Proper handling of inactive canceled requests by @thorjohnsen in #9280
[#8921][feat] Added symetric memory AllReduce strategy by @MrGeva in #8919
[None][fix] Fix #8383 introduced TRTLLM backend python error by @jhaotingc in #9804
[#9753][feat] AutoDeploy: Implement add rms_norm fusion by @nvchenghaoz in #9754
[None][infra] Correct the waived test names due to a merge conflict by @yuanjingx87 in #9803
[None][fix] Fix PDL in TRTLLM MOE for dsv3 by @dmtri35 in #9799
[None][feat] Add llama4 scaling by @byshiue in #9771
[https://nvbugs/5677746][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang by @jiaganc in #9659
[None][fix] Fix unterminated process issue for RemoteOpenAIServer by @JunyiXu-nv in #9490
[https://nvbugs/5726066][infra] Waive timeout disaggregated/test_auto_scaling tests. by @bobboli in #9815
[None][chore] Fix tests failing on pre-merge 12/08 by @brb-nv in #9819
[https://nvbugs/5722653][fix] Fix config file used by disagg_client by @JunyiXu-nv in #9783
[TRTLLM-6537][chore] Shorten the time limit for dis-agg accuracy testing by @Shixiaowei02 in #9614
[None][infra] Use artifactory pypi mirror for Cython install by @ZhanruiSunCh in #9774
[TRTLLM-9794][ci] remove duplicated test cases in DGX B200 by @QiJune in #9817
[None][test] Refactor qa/llm_perf_nim.yml test list by @yufeiwu-nv in #9700
[None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically by @yiqingy0 in #9829
[None][fix] Additional model outputs for pipeline parallelism by @Funatiq in #9794
[TRTLLM-6756][feat] Update BeamSearch for TorchSampler by @stnie in #9660
[TRTLLM-9794][ci] move qwen3-next test cases to gb200 by @QiJune in #9827
[None][infra] Waive failed cases for main branch on 12/09 by @EmmaQiaoCh in #9839
[https://nvbugs/5575841] [fix] Nvbug 5575841: Remove additional test waivers for TestMoEFP4 by @DomBrown in #9788
[None][feat] Make 2-model spec dec use the 1-model kernels (Hopper) by @mikeiovine in #8810
[None][chore] Adding flaky auto scaling test to waives by @pcastonguay in #9851
[#8921][chore] AutoDeploy NanoV3 to use SYMM_MEM allreduce strategy by @MrGeva in #9797
[TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts by @mlefeb01 in #9600
[https://nvbugs/5688388][chore] Unwaiving fixed disagg test by @pcastonguay in #9800
[https://nvbugs/5719561][chore] Unwaive tests for nvbug 5719561 by @pcastonguay in #9801
[https://nvbugs/5508301][feat] Move D->H copies to a worker thread whe… by @dhansen-nvidia in #8463
[None][chore] Add unittest for otlp tracing by @zhanghaotong in #8716
[None][chore] Support larger topK for NVLinkOneSided AlltoAll. by @bobboli in #9816
[TRTLLM-9794][ci] move some deepseek test cases to gb200 by @QiJune in #9841
[TRTLLM-9661][fix] Fix nvfp4 gemm allowed backends arg passing by @hyukn in #9837
[https://nvbugs/5702791][fix] Unwaive fixed test by @dominicshanshan in #9844
[TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue by @ZhanruiSunCh in #9823
[None][chore] Enable L0 multi-gpus testing for Qwen3-next by @nv-guomingz in #9789
[https://nvbugs/5727952][fix] PDL bugs with trtllm-gen fmha kernels by @PerkzZheng in #9863
[None][infra] Fail fast if SLURM entrypoint fails by @mlefeb01 in #9744
[None][feat] Port fp4 quantization kernel optimization from FlashInfer by @bkryu in #9854
[TRTINFRA-7328][infra] - Move half B200 tests to lbd by @mlefeb01 in #9853
[None][fix] Fully resolve the tactic recovery issues in AutoTuner serialized cache by @hyukn in #9835
[None][chore] bump version to 1.2.0rc6 by @yiqingy0 in #9874
[TRTLLM-9228][infra] Verify thirdparty C++ process by @cheshirekow in #9367
[None][doc] Update doc for NVFP4 KV cache by @Tom-Zheng in #9475
[https://nvbugs/5601682][fix] Unwaiving disagg test by @pcastonguay in #9627
[None][chore] Add set_segment arg to slurm scripts by @fredricz-20070104 in #9731
[https://nvbugs/5582258][fix] unwaive by @bo-nv in #9650
[None][chore] Fix warning when capturing CUDA graph by @ziyixiong-nv in #9746
[https://nvbugs/5718004][fix] Add warmup for cancellation test by @JunyiXu-nv in #9860
[#2730][fix] Fix circular import bug in medusa/weight.py by @karljang in #9866
[None][feat] Enable PDL for indexer topK by @ChristinaZ in #9843
[TRTLLM-9685] [feat] Add gather fc1 kernel by cuteDSL by @zongfeijing in #9618
[None][chore] enable test_ipc.py by @Superjomn in #9865
[None][doc] Add DeepSeek-V3.2 to the supported models by @lfr-0531 in #9893
[TRTLLM-8959][feat] ConfigurableMoE support CUTLASS by @xxi-nv in #9772
[None] [feat] add eos_token_id in generation_config to sampling params by @JadoTu in #9514
[TRTLLM-9736][feat] AsyncLLM and verl integ by @hchings in #9353
[https://nvbugs/5597647][ci] Unwaive fixed tests. by @SimengLiu-nv in #9812
[TRTC-43] [feat] Add config db and docs by @venkywonka in #9420
[None][infra] Add workflow to auto-label 'waiting for feedback' on team comments by @karljang in #9886
[None][perf] Fix TPOT when min_tokens set by @jthomson04 in #9862
[None][infra] Ignore comments from bots and CI accounts by @karljang in #9929
[https://nvbugs/5727517][fix] Preserve ip:port for disagg by @JunyiXu-nv in #9859
[None][infra] update ucx to 1.20 by @chuangz0 in #9786
[TRTLLM-9717][fix] fix multi nodes tests cases by @xinhe-nv in #9736
[None][infra] Fix mergeWaiveList stage by @yiqingy0 in #9892
[https://nvbugs/5599176][fix] Unwaive fixed test for Ray by @dominicshanshan in #9861
[None][infra] update nspect version for api change by @niukuo in #9899
[None][infra] revert ucx to 1.19 by @chuangz0 in #9936
[TRTLLM-9792] [feat] Support multiple instances on single node for slurm scripts by @kaiyux in #9900
[TRTLLM-9262][test] add groupgemm ada case for rcca by @crazydemo in #9833
[#6425][fix] address CUDA stream sync issue in ModelRunnerCPP by @xsxszab in #6426
[None][infra] Replace the deprecated github token by @yuanjingx87 in #9915
[https://nvbugs/5736923][infra] Waive timeout disaggregated/test_auto_scaling[http-round_robin] test by @yihwang-nv in #9942
[None][chore] Modify python ipc_util to align with C++ path by @yufeiwu-nv in #9894
[None][chore] unwaive qwen3 accuracy test by @kris1025 in #9895
[https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test by @chenfeiz0326 in #9896
[None][test] fix a typo in model name in script by @ruodil in #9867
[None][chore] Degrade log level in cublas fp4 runner when using default configs by @hyukn in #9951
[None][feat] AutoDeploy: prepare_metadata revisited by @lucaslie in #9764
[None][feat] Upgrade NIXL to v0.8.0 by @zackyoray in #9707
[TRTLLM-5972][chore] Load balance decode token KV cache with helix parallelism by @brb-nv in #9757
[None][fix] Introduce inline namespace to avoid symbol collision by @yihwang-nv in #9541
[TRTLLM-9637][feat] Support tool parser for Kimi K2 by @JunyiXu-nv in #9830
[https://nvbugs/5643787][fix] remove the war path for notify to itself by @chuangz0 in #9834
[https://nvbugs/5716787][fix] terminate nixl running when exiting by @chuangz0 in #9785
[None][feat] Async pp send. by @yuxianq in #9952
[None][infra] Remove generate lockfile schedule for 1.2.0rc4.post1 branch by @yuanjingx87 in #9945
[https://nvbugs/4141427][chore] Add more details to LICENSE file by @tburt-nv in #9881
[TRTLLM-9493][feat] Add helixPostProcessNative kernel for cp_dim=2 by @brb-nv in #9924
[None][feat] spark cublas LUT table for llama-8b-bf16 perf by @farazkh80 in #9811
[None][feat] Support Mistral Large3 LLM part by @byshiue in #9820
[TRTLLM-9784][fix] Resolve port conflicts by @shuyixiong in #9780
[TRTLLM-9468][chore] Update disagg benchmarking scripts to support context parallelism by @brb-nv in #9720
[TRTLLM-9738][chore] Guard accuracy with nccl allreduce strategy by @shuyixiong in #9793
[https://nvbugs/5720482][fix] Fix test rpc streaming by @Superjomn in #9902
[None][feat] Graceful Error Handling for Guided Decoder by @jellysnack in #9078
[None][feat] Implement sampling on 1-model EAGLE3 by @mikeiovine in #9885
[None][chore] Add namespace to header to fix tot failure by @farazkh80 in #9973
[None][feat] Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe by @nvxuanyuc in #9852
[None][fix] disable async pp send for ray cases. by @yuxianq in #9959
[https://nvbugs/5666816][fix] Unwaive llama3 eagle3 test by @mikeiovine in #9964
[None][infra] Delete container before attempting import by @mlefeb01 in #9967
[None][infra] Waive failed tests for main branch on 12/14 by @EmmaQiaoCh in #9982
[TRTLLM-9493][noop] Refactor fusedMoeCommKernels to enable code sharing by @brb-nv in #9922
[TRTLLM-9601][feat] Expose mmKeys for multimodal to integrate with dynamo. by @SimengLiu-nv in #9604
[None] [chore] Comments cleanup by @zongfeijing in #9978
[None][fix] Fix regex pattern for cubin filtering by @rosenrodt in #9914
[https://nvbugs/5580297][fix] Skip capture request error test from Ray stage by @dominicshanshan in #9947
[None][doc] update readme for rpc by @Superjomn in #9972
[TRTLLM-8961][feat] ConfigurableMoE support DeepGemm by @xxi-nv in #9858
[TRTLLM-9794][ci] move test cases of gpt-oss to gb200 by @QiJune in #9934
[TRTLLM-9762] [doc] Update documents for GB300 NVL72 by @kaiyux in #9987
[TRTLLM-9416][feat] Skip DS-v3.2 indexer MQA and Top-K for short sequences. by @lfr-0531 in #9524
[https://nvbugs/5669114][fix] Switch to MMMU benchmark for Gemma3 27B by @brb-nv in #9966
[https://nvbugs/5741060][chore] Waive all pg operator tests by @shuyixiong in #9991
[TRTLLM-9854][feat] Optimize the host overhead of _sample_async by @ziyixiong-nv in #9935
[TRTLLM-9860][doc] Add docs and examples for Responses API by @JunyiXu-nv in #9946
[None][feat] Async pp send for PPCommTorch. by @yuxianq in #9976
[https://nvbugs/5655885][fix] fix invalid instruction error in 2shot ar kernel on Ampere by @yilin-void in #9394
[None] [fix] Fix nsys_on argument for slurm scripts by @kaiyux in #9995
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9941
[None][infra] Add multi gpu Ray tests into L0 merge change request list. by @dominicshanshan in #9996
[TRTLLM-9136][feat] 2D parallel EP TP support by @greg-kwasniewski1 in #9459
[None][infra] Fully waive test_worker_restart test_disagg_server_restart. by @bobboli in #9988
[https://nvbugs/5661741][fix] Fix accuracy issue in TRTLLM MoE introduced in #9377 by @rosenrodt in #9999
[None] [fix] Fix slrum scripts by @kaiyux in #10007
[TRTLLM-9615][feat] Implement a distributed tuning system by @hyukn in #9621
[None][feat] Update reasoning parser for nano-v3 by @Wanli-Jiang in #9944
[https://nvbugs/5540979][fix] Potential fix for 5540979 by @arekay-nv in #9716
[None][infra] Update ucx to 1.20.x by @zackyoray in #9977
[None] [fix] Revert "[None] [feat] add eos_token_id in generation_config to sampling params" by @kaiyux in #10002
[None][feat] disable fused gemm for sm121 by @farazkh80 in #9916
[None][infra] Waive failed tests for main branch on 12/15 by @EmmaQiaoCh in #10001
[https://nvbugs/5673559][fix] Unwaiving disagg test for nvbug 5673559 by @pcastonguay in #9957

New Contributors

@cjluo-nv made their first contribution in #9679
@nv-lschneider made their first contribution in #9314
@bkryu made their first contribution in #9854
@xsxszab made their first contribution in #6426
@arekay-nv made their first contribution in #9716

Full Changelog: v1.2.0rc5...v1.2.0rc6