Highlights
-
Model Support
-
API
-
Feature
- 2D parallel EP TP support (#9459)
- Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe (#9852)
- Add gather fc1 kernel by cuteDSL (#9618)
- Add GB300 support since it does not support segment (#9731)
- Add helixPostProcessNative kernel for cp_dim=2 (#9924)
- Added symetric memory AllReduce strategy (#8919)
- ConfigurableMoE support (#9772, #9858)
- Enable multistream for Linear Attention in Qwen3 (#9696)
- Enable PDL for indexer topK (#9843)
- Implement distributed tuning system (#9621)
- Implement sampling on 1-model EAGLE3 (#9885)
- Move D->H copies to a worker thread (#8463)
- Optimize the host overhead of _sample_async (#9935)
- Port fp4 quantization kernel optimization from FlashInfer (#9854)
- Support larger topK for NVLinkOneSided AlltoAll. (#9816)
-
Fix
- Fix CUDA stream sync issue in ModelRunnerCPP (#6426)
- Fix accuracy issue in TRTLLM MoE (#9999)
- Fix PDL in TRTLLM MOE for dsv3 (#9799)
- Fix unterminated process issue for RemoteOpenAIServer (#9490)
- Fix PDL bugs with trtllm-gen fmha kernels (#9863)
- Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659)
-
Documentation
-
Test & Infra
What's Changed
- [https://nvbugs/5703953][fix] Preserving ip:port for trtllm-serve before initializing llm by @JunyiXu-nv in #9646
- [None][infra] Waive failed cases for main branch on 12/07 by @EmmaQiaoCh in #9769
- [None][fix] Several minor fixes to CI setting by @chzblych in #9765
- [OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer by @cjluo-nv in #9679
- [None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce by @nv-lschneider in #9314
- [TRTLLM-9000][feat] Add multi-node Perf Tests into CI by @chenfeiz0326 in #8800
- [None][test] add ntp tolerance in time metrics verification by @zhengd-nv in #9741
- [TRTLLM-9603][feat] Enable ConfigurableMoE test in the CI by @xxi-nv in #9645
- [https://nvbugs/5422621][test] Add GB 200 WIDEEP test case for RCCA 5422621 by @fredricz-20070104 in #9506
- [None][fix] Fix two tuning cache miss issues. by @hyukn in #9743
- [TRTLLM-9706] [doc] Update wide EP documents by @kaiyux in #9724
- [https://nvbugs/5666804][test] only adding sampler config for limited models by @ruodil in #9512
- [None][infra] Waive failed cases for main on 12/08 by @EmmaQiaoCh in #9773
- [None][chore] Move the rocketkv e2e test to post-merge by @lfr-0531 in #9768
- [None][chore] Enable tvm_ffi for cute dsl nvfp4_gemm to reduce host overhead. by @limin2021 in #9690
- [TRTLLM-9431][perf] Enable multistream for Linear Attention in Qwen3-… by @nv-guomingz in #9696
- [None][chore] Remove closed bugs by @xinhe-nv in #9770
- [None][infra] update mooncake in docker images by @zhengd-nv in #9584
- [None][test] Add Kimi k2 WIDEEP perf and accuracy cases by @fredricz-20070104 in #9686
- [https://nvbugs/5527655][test] Add test case for RCCA 5527655 by @fredricz-20070104 in #9511
- [http://nvbugs/5649010][fix] fix test_auto_scaling.py::test_worker_restart timeout by @reasonsolo in #9775
- [None][fix] Switch AutoDeploy's default allreduce strategy to NCCL by @MrGeva in #9666
- [TRTLLM-9506][fix] Fix AR for DeepSeek-R1 2 model path by @sunnyqgg in #9661
- [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench by @FrankD412 in #9250
- [https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model by @jhaotingc in #8383
- [TRTLLM-7967][chore] Add more tests by @yibinl-nvidia in #9415
- [https://nvbugs/5508267][fix] Proper handling of inactive canceled requests by @thorjohnsen in #9280
- [#8921][feat] Added symetric memory AllReduce strategy by @MrGeva in #8919
- [None][fix] Fix #8383 introduced TRTLLM backend python error by @jhaotingc in #9804
- [#9753][feat] AutoDeploy: Implement add rms_norm fusion by @nvchenghaoz in #9754
- [None][infra] Correct the waived test names due to a merge conflict by @yuanjingx87 in #9803
- [None][fix] Fix PDL in TRTLLM MOE for dsv3 by @dmtri35 in #9799
- [None][feat] Add llama4 scaling by @byshiue in #9771
- [https://nvbugs/5677746][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang by @jiaganc in #9659
- [None][fix] Fix unterminated process issue for RemoteOpenAIServer by @JunyiXu-nv in #9490
- [https://nvbugs/5726066][infra] Waive timeout disaggregated/test_auto_scaling tests. by @bobboli in #9815
- [None][chore] Fix tests failing on pre-merge 12/08 by @brb-nv in #9819
- [https://nvbugs/5722653][fix] Fix config file used by disagg_client by @JunyiXu-nv in #9783
- [TRTLLM-6537][chore] Shorten the time limit for dis-agg accuracy testing by @Shixiaowei02 in #9614
- [None][infra] Use artifactory pypi mirror for Cython install by @ZhanruiSunCh in #9774
- [TRTLLM-9794][ci] remove duplicated test cases in DGX B200 by @QiJune in #9817
- [None][test] Refactor qa/llm_perf_nim.yml test list by @yufeiwu-nv in #9700
- [None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically by @yiqingy0 in #9829
- [None][fix] Additional model outputs for pipeline parallelism by @Funatiq in #9794
- [TRTLLM-6756][feat] Update BeamSearch for TorchSampler by @stnie in #9660
- [TRTLLM-9794][ci] move qwen3-next test cases to gb200 by @QiJune in #9827
- [None][infra] Waive failed cases for main branch on 12/09 by @EmmaQiaoCh in #9839
- [https://nvbugs/5575841] [fix] Nvbug 5575841: Remove additional test waivers for TestMoEFP4 by @DomBrown in #9788
- [None][feat] Make 2-model spec dec use the 1-model kernels (Hopper) by @mikeiovine in #8810
- [None][chore] Adding flaky auto scaling test to waives by @pcastonguay in #9851
- [#8921][chore] AutoDeploy NanoV3 to use SYMM_MEM allreduce strategy by @MrGeva in #9797
- [TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts by @mlefeb01 in #9600
- [https://nvbugs/5688388][chore] Unwaiving fixed disagg test by @pcastonguay in #9800
- [https://nvbugs/5719561][chore] Unwaive tests for nvbug 5719561 by @pcastonguay in #9801
- [https://nvbugs/5508301][feat] Move D->H copies to a worker thread whe… by @dhansen-nvidia in #8463
- [None][chore] Add unittest for otlp tracing by @zhanghaotong in #8716
- [None][chore] Support larger topK for NVLinkOneSided AlltoAll. by @bobboli in #9816
- [TRTLLM-9794][ci] move some deepseek test cases to gb200 by @QiJune in #9841
- [TRTLLM-9661][fix] Fix nvfp4 gemm allowed backends arg passing by @hyukn in #9837
- [https://nvbugs/5702791][fix] Unwaive fixed test by @dominicshanshan in #9844
- [TRTLLM-9811][infra] Update urllib3 version >= 2.6.0 to fix high vulnerability issue by @ZhanruiSunCh in #9823
- [None][chore] Enable L0 multi-gpus testing for Qwen3-next by @nv-guomingz in #9789
- [https://nvbugs/5727952][fix] PDL bugs with trtllm-gen fmha kernels by @PerkzZheng in #9863
- [None][infra] Fail fast if SLURM entrypoint fails by @mlefeb01 in #9744
- [None][feat] Port fp4 quantization kernel optimization from FlashInfer by @bkryu in #9854
- [TRTINFRA-7328][infra] - Move half B200 tests to lbd by @mlefeb01 in #9853
- [None][fix] Fully resolve the tactic recovery issues in AutoTuner serialized cache by @hyukn in #9835
- [None][chore] bump version to 1.2.0rc6 by @yiqingy0 in #9874
- [TRTLLM-9228][infra] Verify thirdparty C++ process by @cheshirekow in #9367
- [None][doc] Update doc for NVFP4 KV cache by @Tom-Zheng in #9475
- [https://nvbugs/5601682][fix] Unwaiving disagg test by @pcastonguay in #9627
- [None][chore] Add set_segment arg to slurm scripts by @fredricz-20070104 in #9731
- [https://nvbugs/5582258][fix] unwaive by @bo-nv in #9650
- [None][chore] Fix warning when capturing CUDA graph by @ziyixiong-nv in #9746
- [https://nvbugs/5718004][fix] Add warmup for cancellation test by @JunyiXu-nv in #9860
- [#2730][fix] Fix circular import bug in medusa/weight.py by @karljang in #9866
- [None][feat] Enable PDL for indexer topK by @ChristinaZ in #9843
- [TRTLLM-9685] [feat] Add gather fc1 kernel by cuteDSL by @zongfeijing in #9618
- [None][chore] enable test_ipc.py by @Superjomn in #9865
- [None][doc] Add DeepSeek-V3.2 to the supported models by @lfr-0531 in #9893
- [TRTLLM-8959][feat] ConfigurableMoE support CUTLASS by @xxi-nv in #9772
- [None] [feat] add eos_token_id in generation_config to sampling params by @JadoTu in #9514
- [TRTLLM-9736][feat] AsyncLLM and verl integ by @hchings in #9353
- [https://nvbugs/5597647][ci] Unwaive fixed tests. by @SimengLiu-nv in #9812
- [TRTC-43] [feat] Add config db and docs by @venkywonka in #9420
- [None][infra] Add workflow to auto-label 'waiting for feedback' on team comments by @karljang in #9886
- [None][perf] Fix TPOT when
min_tokensset by @jthomson04 in #9862 - [None][infra] Ignore comments from bots and CI accounts by @karljang in #9929
- [https://nvbugs/5727517][fix] Preserve ip:port for disagg by @JunyiXu-nv in #9859
- [None][infra] update ucx to 1.20 by @chuangz0 in #9786
- [TRTLLM-9717][fix] fix multi nodes tests cases by @xinhe-nv in #9736
- [None][infra] Fix mergeWaiveList stage by @yiqingy0 in #9892
- [https://nvbugs/5599176][fix] Unwaive fixed test for Ray by @dominicshanshan in #9861
- [None][infra] update nspect version for api change by @niukuo in #9899
- [None][infra] revert ucx to 1.19 by @chuangz0 in #9936
- [TRTLLM-9792] [feat] Support multiple instances on single node for slurm scripts by @kaiyux in #9900
- [TRTLLM-9262][test] add groupgemm ada case for rcca by @crazydemo in #9833
- [#6425][fix] address CUDA stream sync issue in ModelRunnerCPP by @xsxszab in #6426
- [None][infra] Replace the deprecated github token by @yuanjingx87 in #9915
- [https://nvbugs/5736923][infra] Waive timeout disaggregated/test_auto_scaling[http-round_robin] test by @yihwang-nv in #9942
- [None][chore] Modify python ipc_util to align with C++ path by @yufeiwu-nv in #9894
- [None][chore] unwaive qwen3 accuracy test by @kris1025 in #9895
- [https://nvbugs/5727481][ci] Fix Port Conflict in Perf-Sanity CI Test by @chenfeiz0326 in #9896
- [None][test] fix a typo in model name in script by @ruodil in #9867
- [None][chore] Degrade log level in cublas fp4 runner when using default configs by @hyukn in #9951
- [None][feat] AutoDeploy: prepare_metadata revisited by @lucaslie in #9764
- [None][feat] Upgrade NIXL to v0.8.0 by @zackyoray in #9707
- [TRTLLM-5972][chore] Load balance decode token KV cache with helix parallelism by @brb-nv in #9757
- [None][fix] Introduce inline namespace to avoid symbol collision by @yihwang-nv in #9541
- [TRTLLM-9637][feat] Support tool parser for Kimi K2 by @JunyiXu-nv in #9830
- [https://nvbugs/5643787][fix] remove the war path for notify to itself by @chuangz0 in #9834
- [https://nvbugs/5716787][fix] terminate nixl running when exiting by @chuangz0 in #9785
- [None][feat] Async pp send. by @yuxianq in #9952
- [None][infra] Remove generate lockfile schedule for 1.2.0rc4.post1 branch by @yuanjingx87 in #9945
- [https://nvbugs/4141427][chore] Add more details to LICENSE file by @tburt-nv in #9881
- [TRTLLM-9493][feat] Add helixPostProcessNative kernel for cp_dim=2 by @brb-nv in #9924
- [None][feat] spark cublas LUT table for llama-8b-bf16 perf by @farazkh80 in #9811
- [None][feat] Support Mistral Large3 LLM part by @byshiue in #9820
- [TRTLLM-9784][fix] Resolve port conflicts by @shuyixiong in #9780
- [TRTLLM-9468][chore] Update disagg benchmarking scripts to support context parallelism by @brb-nv in #9720
- [TRTLLM-9738][chore] Guard accuracy with nccl allreduce strategy by @shuyixiong in #9793
- [https://nvbugs/5720482][fix] Fix test rpc streaming by @Superjomn in #9902
- [None][feat] Graceful Error Handling for Guided Decoder by @jellysnack in #9078
- [None][feat] Implement sampling on 1-model EAGLE3 by @mikeiovine in #9885
- [None][chore] Add namespace to header to fix tot failure by @farazkh80 in #9973
- [None][feat] Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe by @nvxuanyuc in #9852
- [None][fix] disable async pp send for ray cases. by @yuxianq in #9959
- [https://nvbugs/5666816][fix] Unwaive llama3 eagle3 test by @mikeiovine in #9964
- [None][infra] Delete container before attempting import by @mlefeb01 in #9967
- [None][infra] Waive failed tests for main branch on 12/14 by @EmmaQiaoCh in #9982
- [TRTLLM-9493][noop] Refactor fusedMoeCommKernels to enable code sharing by @brb-nv in #9922
- [TRTLLM-9601][feat] Expose mmKeys for multimodal to integrate with dynamo. by @SimengLiu-nv in #9604
- [None] [chore] Comments cleanup by @zongfeijing in #9978
- [None][fix] Fix regex pattern for cubin filtering by @rosenrodt in #9914
- [https://nvbugs/5580297][fix] Skip capture request error test from Ray stage by @dominicshanshan in #9947
- [None][doc] update readme for rpc by @Superjomn in #9972
- [TRTLLM-8961][feat] ConfigurableMoE support DeepGemm by @xxi-nv in #9858
- [TRTLLM-9794][ci] move test cases of gpt-oss to gb200 by @QiJune in #9934
- [TRTLLM-9762] [doc] Update documents for GB300 NVL72 by @kaiyux in #9987
- [TRTLLM-9416][feat] Skip DS-v3.2 indexer MQA and Top-K for short sequences. by @lfr-0531 in #9524
- [https://nvbugs/5669114][fix] Switch to MMMU benchmark for Gemma3 27B by @brb-nv in #9966
- [https://nvbugs/5741060][chore] Waive all pg operator tests by @shuyixiong in #9991
- [TRTLLM-9854][feat] Optimize the host overhead of _sample_async by @ziyixiong-nv in #9935
- [TRTLLM-9860][doc] Add docs and examples for Responses API by @JunyiXu-nv in #9946
- [None][feat] Async pp send for PPCommTorch. by @yuxianq in #9976
- [https://nvbugs/5655885][fix] fix invalid instruction error in 2shot ar kernel on Ampere by @yilin-void in #9394
- [None] [fix] Fix nsys_on argument for slurm scripts by @kaiyux in #9995
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9941
- [None][infra] Add multi gpu Ray tests into L0 merge change request list. by @dominicshanshan in #9996
- [TRTLLM-9136][feat] 2D parallel EP TP support by @greg-kwasniewski1 in #9459
- [None][infra] Fully waive test_worker_restart test_disagg_server_restart. by @bobboli in #9988
- [https://nvbugs/5661741][fix] Fix accuracy issue in TRTLLM MoE introduced in #9377 by @rosenrodt in #9999
- [None] [fix] Fix slrum scripts by @kaiyux in #10007
- [TRTLLM-9615][feat] Implement a distributed tuning system by @hyukn in #9621
- [None][feat] Update reasoning parser for nano-v3 by @Wanli-Jiang in #9944
- [https://nvbugs/5540979][fix] Potential fix for 5540979 by @arekay-nv in #9716
- [None][infra] Update ucx to 1.20.x by @zackyoray in #9977
- [None] [fix] Revert "[None] [feat] add eos_token_id in generation_config to sampling params" by @kaiyux in #10002
- [None][feat] disable fused gemm for sm121 by @farazkh80 in #9916
- [None][infra] Waive failed tests for main branch on 12/15 by @EmmaQiaoCh in #10001
- [https://nvbugs/5673559][fix] Unwaiving disagg test for nvbug 5673559 by @pcastonguay in #9957
New Contributors
- @cjluo-nv made their first contribution in #9679
- @nv-lschneider made their first contribution in #9314
- @bkryu made their first contribution in #9854
- @xsxszab made their first contribution in #6426
- @arekay-nv made their first contribution in #9716
Full Changelog: v1.2.0rc5...v1.2.0rc6