NVIDIA/TensorRT-LLM v1.2.0rc3 on GitHub

Announcement Highlights

Model Support
- Enable Nemotron H MoE sharding (#8744)
- Support Latent MOE for Nemotron (#8955)
- Add TP support for DeepSeek-V3.2 (#8943)
- Support Glm4MoeForCausalLM (#8256)
- Add support for disagg in DSv3.2 (#8735)
- Add tool call parsing fixes and Qwen3 coder parser (#8817)
API
- Add trtllm_ prefix for exposed metrics (#8845)
- Return logprobs incrementally in torch backend (#8785)
- Enable n > 1 in OpenAI API with PyTorch backend (#8951)
- Support json_schema in response_format (#8934)
- Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075)
- Prevent negative max_tokens passed into tllm request (#9037)
Feature
- Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt (#8771)
- Add swapsMmaAb sparseMla kernels (#8913)
- Implement Deep Research with scaffolding (#8452)
- Add rope and uk-bgemm overlap for MLA generation (#8495)
- Add NUMA-aware CPU affinity autoconfig (#8805)
- Add custom indexer k cache scatter op (#8960)
- Allow env variable to specify spawn process IPC address (#8922)
- Implement sampling using FlashInfer.sampling (#8581)
- Enhance the overlap scheduler for two-model spec decoding (#8706)
- Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
- Unify MPI & Ray's req/response handling with RPC Client/Server (#8765)
- Use triton kernels for RocketKV prediction module (#8682)
- Support accuracy test and install from wheel (#9038)
- Add tree attention support for blackwell arch (#8975)
- Add simple optimizations for MTP 2-model (#9176)
- Enable early exit with overlap scheduler (#8587)
- Add dynamic draft length in spec decode (stage 1) (#8194)
- Add bias for FP4 TRT-LLM Gen MoE (#9220)
- Integrate CuteDSL NVFP4 grouped GEMM (#8880)
- Add ability to cancel disagg request if KV cache resources are exhausted (#9155)
- Make factory sharding the default (#9144)
- Enable simple sharding for latent experts (#9099)
- Update the indexer topK (#9255)
- Add fp8 dense for sm120 (#9174)
- Add specdec to nemotron nas (#8985)
- Use CUDAGraph to improve the tuning accuracy for AutoTuner (#9089)
- Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191)
- Add pp_partition to customize each rank's layer number (#9003)
- Enable EPLB for trtllm-gen and cutlass backend (#8886)
- Add optimized trtllm-gen attention kernels on sm103 (#9081)
- Add MTP>1 support for DS-v3.2 (#9045)
Benchmark
- Add Qwen3-Next to layer-wise benchmarks (#9065)
- Refactor benchmark infrastructure (#9207)
- Print device info in trtllm-bench report (#8584)
- Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
- Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
- Adjust select_alltoall_method_type (#8950)
Documentation
- Replace the relative links with absolute links in README.md (#8995)
- Update llama and llama4 example doc (#9048)
- Update doc/tests/chat_template for nano-v2-vlm (#8840)
- Add Mixed Precision Context and Generation section to Disagg (#8769)
- Add DeepSeek-V3.2-Exp document (#9141)
- Update docs for EPLB (#9166)
- Update the Flux autodeploy example (#8434)
- Update DS-R1 example doc (#9231)
- Update license (#8807)
Fix & Infra
- Fix the logger once key issue and further compress log in AutoTuner (#8873)
- Fix disagg GPT-OSS test (#8870)
- Remove PyTorchConfig completely (#8856)
- Fix boost issue (#8996)
- Lock onnx version <1.20.0 and remove WAR for TRT 10.13 (#9006)
- Fix eagle3 accuracy issue on sm120 (#8944)
- Add customized topk and related unit tests for DSA (#8882)
- Improve type annotations on ResourceManager.get_resource_manager (#9013)
- Add sm103 to CutlassFP8RowwiseGemm (#9042)
- Add context manager to fix FakeTensorProp (#9047)
- Initialize HF modules in worker_main for models with trust_remote=true (#8931)
- Use async send_requests_to_next_pp (#9041)
- Display the GPU memory information in GiB unit (#9070)
- Add unit tests for TorchSampler batched sampling (#9012)
- Remove circular dependency between model engine and cuda graph runner (#7572)
- Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
- Clear indexer k cache reference before releasing CUDA memory (#9110)
- Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
- Fix KV cache manager test warnings (#9103)
- Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
- Avoid torch.compile being applied multiple times (#9135)
- Upgrade tritonserver DLFW 25.10 (#8929)
- Make the sliced nvfp4 output contiguous (#9123)
- Update the attention layers counting for Qwen3-next (#9072)
- Fix the rank to access all_rank_chunk_size_list when chunked MoE is used (#8723)
- Fix missing ActivationType issue (#9171)
- Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117)
- Fix lock file generation script (#9180)
- Fix a deepseekv3 error when debug mode is on (#9217)
- Fix DeepSeek V3.2 indexer RoPE (#9232)
- Exclude number of draft tokens from mMaxSeqLenKv (#9210)
- Upgrade NIXL to 0.7.1 (#9055)
- Fix EPLB for DeepSeek-V3.2-Exp (#9245)
- Log the LLM args for main branch (#9120, #9205)
- Update TRTLLM MoE cubins, reduce mxfp4 weight padding requirement, and tighten TMA bound (#9025)
- Upgrade precommit-hooks to v6.0.0 (#9097)

What's Changed

[https://nvbugs/5623960][fix] Fix the logger once key issue and further compress log in AutoTuner. by @hyukn in #8873
[None][infra] update github token name by @niukuo in #8907
[https://nvbugs/5624367][fix] Fix disagg GPT-OSS test by @chuangz0 in #8870
[https://nvbugs/5630345][chore] unwaive DS-v32 nvfp4 and fp8 tests by @lfr-0531 in #8887
[TRTLLM-7251][test] Get submit eplb slots empty key work by @fredricz-20070104 in #8945
[TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt by @chang-l in #8771
[None][feat] add swapsMmaAb sparseMla kernels by @PerkzZheng in #8913
[TRTLLM-8201][feat] Nemotron H MoE Sharding by @lucaslie in #8744
[#8924][fix] Fix AutoDeploy pattern matcher for torch 2.9 by @Fridah-nv in #8920
[https://nvbugs/5606166][fix] AutoDeploy: unwaive test for use tuples for cudagraph shape lookup by @lucaslie in #8957
[None][feat] Deep Research Implemented with Scaffolding by @Boreas618 in #8452
[None][infra] allow to choose repo when generate lock files by @yuanjingx87 in #8659
[None][feat] add waive by sm version by @xinhe-nv in #8928
[None][feat] Add trtllm_ prefix for exposed metrics by @nv-yilinf in #8845
[TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation by @yunruis in #8495
[https://nvbugs/5630345] [chore] skip deepseek-v3.2 fp8 kv tests on pre-Blackwell architectures by @lfr-0531 in #8973
[None][chore] Use cached model in all ray tests by @shuyixiong in #8962
[https://nvbugs/5498478][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill by @DylanChen-NV in #8910
[TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear by @nvchenghaoz in #8820
[https://nvbugs/5527655][feat] Add NUMA-aware CPU affinity autoconfig by @dhansen-nvidia in #8805
[None][feat] AutoDeploy: Support Latent MOE for Nemotron by @nvchenghaoz in #8955
[None][fix] Fix KV cache clearing with KV Connector API by @jthomson04 in #8750
[https://nvbugs/5637012][fix] Bugfix when config is None for MLA by @chang-l in #8978
[https://nvbugs/5606136][ci] Remove tests for deprecating models. by @SimengLiu-nv in #8926
[None][feat] Return logprobs incrementally in torch backend by @dcaox in #8785
[https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls by @VALLIS-NERIA in #8939
[None][fix] Switch AD AllReduce strategy to NCCL by @MrGeva in #8979
[https://nvbugs/5633340][fix] kill processes properly after test by @reasonsolo in #8970
[TRTLLM-9065][chore] remove PyTorchConfig completely by @QiJune in #8856
[https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) by @stnie in #8794
[None][fix] type annotations in fuse_input_embeds by @ixlmar in #8976
[None][fix] add missing CLI option in multimodal example by @ixlmar in #8977
[None][chore] Bump version to 1.2.0rc3 by @yiqingy0 in #9004
[TRTLLM-9213][infra] Fix boost issue by @ZhanruiSunCh in #8996
[https://nvbugs/5629790][chore] unwaive test. by @yuxianq in #8967
[None][fix] Moving transfer timeout test to test_llm_pytorch, fixing broken kv transfer timeout by @pcastonguay in #8892
[None][doc] Replace the relative links with absolute links in README.md. by @nv-guomingz in #8995
[None][perf] Add custom indexer k cache scatter op by @chang-l in #8960
[None][infra] Update allowed list 2025.11.06 by @yuanjingx87 in #8987
[None][feat] Allow env variable to specify spawn process IPC address by @hvagadia in #8922
[TRTLLM-8598][feat] enable n > 1 in OpenAI API with PyTorch backend by @ixlmar in #8951
[TRTLLM-8999][infra] Reduce gb200 multi-node test stages by @EmmaQiaoCh in #8778
[None][infra] Waive failed tests for main 11/07 by @EmmaQiaoCh in #9008
[https://nvbugs/5637037][fix] Update unwaive list. by @bobboli in #9001
[None][chore] Lock onnx version <1.20.0 and remove WAR for TRT 10.13 by @yiqingy0 in #9006
[TRTLLM-9001][feat] add TP support for DeepSeek-V3.2 by @lfr-0531 in #8943
[None][fix] fix eagle3 accuracy issue on sm120 by @byshiue in #8944
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #9030
[None][feat] Add customized topk and related unit tests for DSA by @ChristinaZ in #8882
[None][fix] Improve type annotations on ResourceManager.get_resource_manager by @ixlmar in #9013
[https://nvbugs/5619396][fix] Add sm103 to CutlassFP8RowwiseGemm by @VALLIS-NERIA in #9042
[https://nvbugs/5625972][fix] Add context manager to fix FakeTensorProp by @Fridah-nv in #9047
[https://nvbugs/5644187][fix] Llava-Next MMMU bugfix and Phi4 test bugfix by @yechank-nvidia in #9034
[https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true by @lancelly in #8931
[None][chore] Clean up unused and confusing code in moe test by @dongfengy in #9019
[None][chore] Relocate rlhf_utils.py by @shuyixiong in #8938
[TRTLLM-9198][perf] Add torch.compile + multi-stream support for k-cache scatter and weight scaling by @chang-l in #8988
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #8998
[None][infra] Waive failed tests on main 11/11 by @EmmaQiaoCh in #9058
[None][infra] install mooncake in docker images by @bo-nv in #8447
[None][doc] update llama and llama4 example doc by @jiahanc in #9048
[#8763][feature] AutoDeploy: configurable dtype for caching by @lucaslie in #8812
[https://nvbugs/5622938][fix] Use async send_requests_to_next_pp. by @yuxianq in #9041
[None][chore] Remove duplicated waive test by @yiqingy0 in #9067
[None][chore] Add tensorrt_llm/scripts to .gitignore by @elvischenv in #8895
[None][ci] waive test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[False] by @QiJune in #9069
[None][infra] Only print and don't fail the check if there are duplicated items in waives.txt by @EmmaQiaoCh in #9068
[https://nvbugs/5616189][fix] Make more cases use local cached models by @HuiGao-NV in #8935
[TRTLLM-7723][feat] sampling using FlashInfer.sampling by @ixlmar in #8581
[None][fix] Display the GPU memory information in GiB unit. by @nv-guomingz in #9070
[TRTLLM-8377][test] unit tests for TorchSampler batched sampling by @ixlmar in #9012
[None][fix] type annotation by @ixlmar in #9071
[TRTLLM-8119][feat] Update doc/tests/chat_template for nano-v2-vlm by @Wanli-Jiang in #8840
[None][feat] AutoDeploy: Perf improvement for mamba layers by @nvchenghaoz in #8991
[TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner by @QiJune in #7572
[None][fix] AutoDeploy: update nano3 accuracy test by @lucaslie in #9061
[TRTLLM-9259][perf] Use torch.compile to fuse copy + layernorm within the LayerNorm module by @chang-l in #9052
[None][ci] run speculative unit tests serially by @QiJune in #9080
[None][fix] Remove unnecessary attention workspace memory check by @jiaganc in #9064
[TRTLLM-9018][infra] add mirror for Build-Docker-Images stage by @ZhanruiSunCh in #9063
[None][infra] Waive a failed case of disaggregated/test_disaggregated.py by @EmmaQiaoCh in #9074
[None][ci] waive some test cases of disaggregated serving by @QiJune in #9085
[None] [doc] Add Mixed Precision Context and Generation section to Disagg by @timothygao8710 in #8769
[https://nvbugs/5568991][test] Remove Phi-3 models by @yufeiwu-nv in #9066
[TRTLLM-9175][test] ensure sampling is async by @ixlmar in #9076
[TRTLLM-8540][feat] Add support for disagg in DSv3.2 by @Tabrizian in #8735
[#9023][feat] reduce AD graph optimization time for non-participating passes by @nzmora-nvidia in #9024
[None][feat] Add MTP>1 support for DS-v3.2 by @lfr-0531 in #9045
[None][chore] Remove is_disaggregated param in executor request queue by @pcastonguay in #9049
[https://nvbugs/5636912][fix] AutoDeploy: Unwaive the test by @nvchenghaoz in #9018
[None][feat] Enable EPLB for trtllm-gen and cutlass backend by @dongxuy04 in #8886
[None][fix] AutoDeploy: Use tmp folder for the load_moe_align by @nvchenghaoz in #9101
[None][ci] waive test_disaggregated_serving.py::TestQwen3_8B::test_chunked_prefill by @QiJune in #9111
[TRTLLM-9179][feat] add pp_partition to customize each rank's layer number by @dc3671 in #9003
[TRTLLM-9212][chore] move MoeLoadBalancerConfig to llm_args.py by @QiJune in #9002
[None][chore] Waive test_llm_rpc_streaming by @Superjomn in #9113
[None] [infra] Update CODEOWNERS for pre-commit-config.yaml by @venkywonka in #9108
[TRTLLM-9209][infra] Upgrade precommit-hooks to v6.0.0 by @cheshirekow in #9097
[None][ci] Waive test_llm_rpc and test_llm_rpc_streaming by @Superjomn in #9118
[#6507][fix] Fix precision issue due to KV layout mismatch for split/concat kernels by @ZhangGe6 in #6917
[TRTLLM-8816][feat] add optimized trtllm-gen attention kernels on sm103 by @PerkzZheng in #9081
[https://nvbugs/5640873][fix] Move thop tests to pre-merge by @HuiGao-NV in #9094
[None][fix] Clear indexer k cache reference before release cuda memory by @chang-l in #9110
[None][test] add deepseek and qwen cases for rtx series by @ruodil in #8839
[None][chore] Remove closed bugs by @xinhe-nv in #9114
[None][fix] waive failed tests by @xinhe-nv in #9090
[None][infra] Waive failed tests for main 11/13 by @EmmaQiaoCh in #9132
[https://nvbugs/5633340][chore] waive test_auto_scaling.py::test_disagg_server_restart by @reasonsolo in #9131
[None] [fix] Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade by @kaiyux in #9126
[None][fixes] Add tool call parsing fixes and Qwen3 coder parser by @2ez4bz in #8817
[TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding by @ziyixiong-nv in #8706
[None][fix] Fix KV cache manager test warnings by @Tabrizian in #9103
[None][fix] Fix the aux_stream in Llama4MinLatencyFusedMoE by @jinyangyuan-nvidia in #9035
[None][autodeploy] minor refactor to rmsnorm transforms by @Fridah-nv in #8657
[None][autodeploy] fix weight extraction for graph based quantized checkpoints by @Fridah-nv in #9109
[https://nvbugs/5652552][fix] Log the llm args for main branch by @leslie-fang25 in #9120
[None][fix] support topk autotuner input for expert slot per group larger than 32 by @dongxuy04 in #9087
[#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 by @nzmora-nvidia in #9011
[TRTLLM-8988][feat] Unify MPI & Ray's req/response handling with RPC Client/Server by @hchings in #8765
[None][chore] Support json_schema in response_format by @JunyiXu-nv in #8934
[None][feat] Add Qwen3-Next to layer-wise benchmarks by @yuantailing in #9065
[None] [feat] Use triton kernels for RocketKV prediction module by @heyuhhh in #8682
[None][ci] waive test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] by @QiJune in #9162
[None][feat] Autodeploy add triton configs and optimize mamba prefill by @suyoggupta in #9083
[https://nvbugs/5631254][fix] avoid torch.compile for multiple times by @reasonsolo in #9135
[None][doc] Add DeepSeek-V3.2-Exp document by @lfr-0531 in #9141
[None][doc] update docs for EPLB by @dongxuy04 in #9166
[TRTLLM-9053][feat] Support accuracy test and install from wheel by @zerollzeng in #9038
[#9102][feat] AutoDeploy: Support fp8 kv cache by @nvchenghaoz in #9107
[None][ci] Waive unittest/_torch/sampler/test_torch_sampler.py::TestBatchedSampling by @yuanjingx87 in #9161
[TRTLLM-9295][fix] unflake test_overlap_scheduler.py::test_overlap_scheduler_consis… by @ixlmar in #9146
[TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9156
[https://nvbugs/5629887][fix] Add missing device count guard for DSv32 multiGPU tests by @chang-l in #9159
[None][infra] Lock generation pipeline update by @yuanjingx87 in #9084
[None][infra] Fix medata.json generated by lock file genreation pipeline by @yuanjingx87 in #9179
[None][infra] Update allowlist 2025.11.14 by @yuanjingx87 in #9183
[TRTLLM-9079][infra] upgrade tritonserver DLFW 25.10 by @ZhanruiSunCh in #8929
[None][chore] Add placement test for ray executor by @hchings in #9122
[None][infra] Add trt-llm-kv-cache-manager-devs as code owner for appropriate files by @thorjohnsen in #9182
[None][fix] Make the sliced nvfp4 output contiguous by @JadoTu in #9123
[None][chore] Waive failing tests blocking pre-merge by @brb-nv in #9189
[None][infra] Waive failed tests for main branch 11/15 by @EmmaQiaoCh in #9187
[None][fix] Update the attention layers counting for Qwen3-next. by @nv-guomingz in #9072
[TRTLLM-8778][feat] Add tree attention support for blackwell arch by @sunnyqgg in #8975
[None][infra] Waive a failed case in pre-merge stage 11/16 by @EmmaQiaoCh in #9192
[https://nvbugs/5613089][fix] Fix the rank to access all_rank_chunk_size_list when chunked MoE is used by @jinyangyuan-nvidia in #8723
[None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound by @rosenrodt in #9025
[None] [fix] Fix missing ActivationType issue by @kaiyux in #9171
[TRTLLM-8000][infra] Catch error in merge waive list stage by @yiqingy0 in #7289
[None][feat] Add simple optimizations for MTP 2-model by @mikeiovine in #9176
[TRTLLM-8831][feat] Enable early exit with overlap scheduler by @Funatiq in #8587
[TRTINFRA-7280][infra] Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge by @mlefeb01 in #9117
[None][infra] Fix lock file generation script by @yuanjingx87 in #9180
[None][feat] Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection by @zackyoray in #9075
[None][chore] local imports for AutoDeploy in serve and bench by @lucaslie in #9199
[None][ci] split speculative test case into several small cases by @QiJune in #9209
[None][feat] Support Glm4MoeForCausalLM by @dmtri35 in #8256
[#8732][feat] Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels by @galagam in #9191
[None][chore] Change trt-server to trtlllm-server in opentelemetry readme by @StanleySun639 in #9173
[None][chore] benchmark refactor by @zerollzeng in #9207
[https://nvbugs/5652552][fix] add printing for llm args by @ruodil in #9205
[None][chore] fix a deepseekv3 error when debug mode is on by @reasonsolo in #9217
[None][fix] DeepSeek V3.2 indexer RoPE fix by @chang-l in #9232
[TRTLLM-8948][test] Add long bench case by @crazydemo in #9165
[None][refactor] decoding inputs, part 2 by @Funatiq in #5799
[TRTLLM-8949][test] Add rcca test case for eagle3 consistency check by @crazydemo in #9088
[TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). by @zheyuf in #8194
[None] [tests] Unwaive wide ep related tests by @kaiyux in #9204
[None][chore] Print device info in trtllm-bench report by @galagam in #8584
[TRTLLM-9295][fix] restore greedy sampling in _test_openai_chat_guided_decoding by @ixlmar in #9178
[None][feat] bias for FP4 TRT-LLM Gen MoE by @nekorobov in #9220
[None][feat] AutoDeploy: Perf improvement for small batch size by @nvchenghaoz in #9163
[#9152][fix] AutoDeploy fused_allreduce_residual_rmsnorm to support demollm mode by @MrGeva in #9197
[https://nvbugs/5590408][fix] Exclude num of draft tokens from mMaxSeqLenKv by @ziyixiong-nv in #9210
[None][chore] Update the Flux autodeploy example by @ajrasane in #8434
[TRTLLM-9287][infra] Use NIXL backend for accuracy tests by @bo-nv in #9247
[https://nvbugs/5649010][fix] increase status-checking interval to avoid instability by @reasonsolo in #9203
[TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM by @syuoni in #8880
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #9193
[None][feat] Have ability to cancel disagg request if KV cache resource are exhausted by @pcastonguay in #9155
[#9137][feat] Factory sharding as default by @greg-kwasniewski1 in #9144
[None][fix] Update the default invalid value for deepseek mode of routing by @ChristinaZ in #9222
[#9098][feat] Simple sharding latent experts by @greg-kwasniewski1 in #9099
[TRTLLM-9050][test] add llama4 disagg case to cover kv cache overflow error by @crazydemo in #9172
[None][fix] logits device and shape issues in dynamic draft path by @jellysnack in #9079
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #9242
[None][feat] Update the indexer topK by @ChristinaZ in #9255
[None][infra] Waive failed cases for main branch on 11/17 by @EmmaQiaoCh in #9266
[None][doc] Update DS-R1 example doc by @jiahanc in #9231
[None][fix] Update GLM model accuracy test by @nvxuanyuc in #9286
[https://nvbugs/5456493][feat] add fp8 dense for sm120 by @CarstyYou in #9174
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #9289
[https://nvbugs/5661877][fix] fix test regression in TestBatchedSampling::test_samples by @ixlmar in #9215
[None][perf] Adjust select_alltoall_method_type. by @bobboli in #8950
[None][feature] AutoDeploy: tighter MoE UT thresholds by @nzmora-nvidia in #9195
[None][feat] add specdec to nemotron nas by @NVShreyas in #8985
[#9237][feat] enable iter stats in autodeploy by @NVShreyas in #9278
[None][fix] change logging for weight loading on unified memory by @farazkh80 in #9177
[None][chore] Waive tests timing out on main by @brb-nv in #9315
[None][fix] fix EPLB for DeepSeek-V3.2-Exp by @lfr-0531 in #9245
[#8476][chore] Update license by @karljang in #8807
[TRTLLM-7963][feat] Use CUDAGraph to improve the tuning accuracy for AutoTuner. by @hyukn in #9089
[None][chore] Prevent negative max_tokens passed into tllm request by @JunyiXu-nv in #9037
[TRTLLM-9247][infra] Upgrade NIXL to 0.7.1 by @bo-nv in #9055

New Contributors

@JadoTu made their first contribution in #8526
@Boreas618 made their first contribution in #8452
@dhansen-nvidia made their first contribution in #8805
@hvagadia made their first contribution in #8922
@elvischenv made their first contribution in #8895
@timothygao8710 made their first contribution in #8769
@ZhangGe6 made their first contribution in #6917
@heyuhhh made their first contribution in #8682
@zackyoray made their first contribution in #9075
@dmtri35 made their first contribution in #8256

Full Changelog: v1.2.0rc2...v1.2.0rc3

NVIDIA/TensorRT-LLM v1.2.0rc3 Release v1.2.0rc3 on GitHub

Announcement Highlights

What's Changed

New Contributors

NVIDIA/TensorRT-LLM v1.2.0rc3
Release v1.2.0rc3

on GitHub