github NVIDIA/TensorRT-LLM v1.2.0rc3
Release v1.2.0rc3

pre-release5 hours ago

Announcement Highlights

  • Model Support

    • Enable Nemotron H MoE sharding (#8744)
    • Support Latent MOE for Nemotron (#8955)
    • Add TP support for DeepSeek-V3.2 (#8943)
    • Support Glm4MoeForCausalLM (#8256)
    • Add support for disagg in DSv3.2 (#8735)
    • Add tool call parsing fixes and Qwen3 coder parser (#8817)
  • API

    • Add trtllm_ prefix for exposed metrics (#8845)
    • Return logprobs incrementally in torch backend (#8785)
    • Enable n > 1 in OpenAI API with PyTorch backend (#8951)
    • Support json_schema in response_format (#8934)
    • Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075)
    • Prevent negative max_tokens passed into tllm request (#9037)
  • Feature

    • Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt (#8771)
    • Add swapsMmaAb sparseMla kernels (#8913)
    • Implement Deep Research with scaffolding (#8452)
    • Add rope and uk-bgemm overlap for MLA generation (#8495)
    • Add NUMA-aware CPU affinity autoconfig (#8805)
    • Add custom indexer k cache scatter op (#8960)
    • Allow env variable to specify spawn process IPC address (#8922)
    • Implement sampling using FlashInfer.sampling (#8581)
    • Enhance the overlap scheduler for two-model spec decoding (#8706)
    • Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
    • Unify MPI & Ray's req/response handling with RPC Client/Server (#8765)
    • Use triton kernels for RocketKV prediction module (#8682)
    • Support accuracy test and install from wheel (#9038)
    • Add tree attention support for blackwell arch (#8975)
    • Add simple optimizations for MTP 2-model (#9176)
    • Enable early exit with overlap scheduler (#8587)
    • Add dynamic draft length in spec decode (stage 1) (#8194)
    • Add bias for FP4 TRT-LLM Gen MoE (#9220)
    • Integrate CuteDSL NVFP4 grouped GEMM (#8880)
    • Add ability to cancel disagg request if KV cache resources are exhausted (#9155)
    • Make factory sharding the default (#9144)
    • Enable simple sharding for latent experts (#9099)
    • Update the indexer topK (#9255)
    • Add fp8 dense for sm120 (#9174)
    • Add specdec to nemotron nas (#8985)
    • Use CUDAGraph to improve the tuning accuracy for AutoTuner (#9089)
    • Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191)
    • Add pp_partition to customize each rank's layer number (#9003)
    • Enable EPLB for trtllm-gen and cutlass backend (#8886)
    • Add optimized trtllm-gen attention kernels on sm103 (#9081)
    • Add MTP>1 support for DS-v3.2 (#9045)
  • Benchmark

    • Add Qwen3-Next to layer-wise benchmarks (#9065)
    • Refactor benchmark infrastructure (#9207)
    • Print device info in trtllm-bench report (#8584)
    • Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
    • Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
    • Adjust select_alltoall_method_type (#8950)
  • Documentation

    • Replace the relative links with absolute links in README.md (#8995)
    • Update llama and llama4 example doc (#9048)
    • Update doc/tests/chat_template for nano-v2-vlm (#8840)
    • Add Mixed Precision Context and Generation section to Disagg (#8769)
    • Add DeepSeek-V3.2-Exp document (#9141)
    • Update docs for EPLB (#9166)
    • Update the Flux autodeploy example (#8434)
    • Update DS-R1 example doc (#9231)
    • Update license (#8807)
  • Fix & Infra

    • Fix the logger once key issue and further compress log in AutoTuner (#8873)
    • Fix disagg GPT-OSS test (#8870)
    • Remove PyTorchConfig completely (#8856)
    • Fix boost issue (#8996)
    • Lock onnx version <1.20.0 and remove WAR for TRT 10.13 (#9006)
    • Fix eagle3 accuracy issue on sm120 (#8944)
    • Add customized topk and related unit tests for DSA (#8882)
    • Improve type annotations on ResourceManager.get_resource_manager (#9013)
    • Add sm103 to CutlassFP8RowwiseGemm (#9042)
    • Add context manager to fix FakeTensorProp (#9047)
    • Initialize HF modules in worker_main for models with trust_remote=true (#8931)
    • Use async send_requests_to_next_pp (#9041)
    • Display the GPU memory information in GiB unit (#9070)
    • Add unit tests for TorchSampler batched sampling (#9012)
    • Remove circular dependency between model engine and cuda graph runner (#7572)
    • Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
    • Clear indexer k cache reference before releasing CUDA memory (#9110)
    • Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
    • Fix KV cache manager test warnings (#9103)
    • Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
    • Avoid torch.compile being applied multiple times (#9135)
    • Upgrade tritonserver DLFW 25.10 (#8929)
    • Make the sliced nvfp4 output contiguous (#9123)
    • Update the attention layers counting for Qwen3-next (#9072)
    • Fix the rank to access all_rank_chunk_size_list when chunked MoE is used (#8723)
    • Fix missing ActivationType issue (#9171)
    • Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117)
    • Fix lock file generation script (#9180)
    • Fix a deepseekv3 error when debug mode is on (#9217)
    • Fix DeepSeek V3.2 indexer RoPE (#9232)
    • Exclude number of draft tokens from mMaxSeqLenKv (#9210)
    • Upgrade NIXL to 0.7.1 (#9055)
    • Fix EPLB for DeepSeek-V3.2-Exp (#9245)
    • Log the LLM args for main branch (#9120, #9205)
    • Update TRTLLM MoE cubins, reduce mxfp4 weight padding requirement, and tighten TMA bound (#9025)
    • Upgrade precommit-hooks to v6.0.0 (#9097)

What's Changed

  • [https://nvbugs/5623960][fix] Fix the logger once key issue and further compress log in AutoTuner. by @hyukn in #8873
  • [None][infra] update github token name by @niukuo in #8907
  • [https://nvbugs/5624367][fix] Fix disagg GPT-OSS test by @chuangz0 in #8870
  • [https://nvbugs/5630345][chore] unwaive DS-v32 nvfp4 and fp8 tests by @lfr-0531 in #8887
  • [TRTLLM-7251][test] Get submit eplb slots empty key work by @fredricz-20070104 in #8945
  • [TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt by @chang-l in #8771
  • [None][feat] add swapsMmaAb sparseMla kernels by @PerkzZheng in #8913
  • [TRTLLM-8201][feat] Nemotron H MoE Sharding by @lucaslie in #8744
  • [#8924][fix] Fix AutoDeploy pattern matcher for torch 2.9 by @Fridah-nv in #8920
  • [https://nvbugs/5606166][fix] AutoDeploy: unwaive test for use tuples for cudagraph shape lookup by @lucaslie in #8957
  • [None][feat] Deep Research Implemented with Scaffolding by @Boreas618 in #8452
  • [None][infra] allow to choose repo when generate lock files by @yuanjingx87 in #8659
  • [None][feat] add waive by sm version by @xinhe-nv in #8928
  • [None][feat] Add trtllm_ prefix for exposed metrics by @nv-yilinf in #8845
  • [TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation by @yunruis in #8495
  • [https://nvbugs/5630345] [chore] skip deepseek-v3.2 fp8 kv tests on pre-Blackwell architectures by @lfr-0531 in #8973
  • [None][chore] Use cached model in all ray tests by @shuyixiong in #8962
  • [https://nvbugs/5498478][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill by @DylanChen-NV in #8910
  • [TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear by @nvchenghaoz in #8820
  • [https://nvbugs/5527655][feat] Add NUMA-aware CPU affinity autoconfig by @dhansen-nvidia in #8805
  • [None][feat] AutoDeploy: Support Latent MOE for Nemotron by @nvchenghaoz in #8955
  • [None][fix] Fix KV cache clearing with KV Connector API by @jthomson04 in #8750
  • [https://nvbugs/5637012][fix] Bugfix when config is None for MLA by @chang-l in #8978
  • [https://nvbugs/5606136][ci] Remove tests for deprecating models. by @SimengLiu-nv in #8926
  • [None][feat] Return logprobs incrementally in torch backend by @dcaox in #8785
  • [https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls by @VALLIS-NERIA in #8939
  • [None][fix] Switch AD AllReduce strategy to NCCL by @MrGeva in #8979
  • [https://nvbugs/5633340][fix] kill processes properly after test by @reasonsolo in #8970
  • [TRTLLM-9065][chore] remove PyTorchConfig completely by @QiJune in #8856
  • [https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) by @stnie in #8794
  • [None][fix] type annotations in fuse_input_embeds by @ixlmar in #8976
  • [None][fix] add missing CLI option in multimodal example by @ixlmar in #8977
  • [None][chore] Bump version to 1.2.0rc3 by @yiqingy0 in #9004
  • [TRTLLM-9213][infra] Fix boost issue by @ZhanruiSunCh in #8996
  • [https://nvbugs/5629790][chore] unwaive test. by @yuxianq in #8967
  • [None][fix] Moving transfer timeout test to test_llm_pytorch, fixing broken kv transfer timeout by @pcastonguay in #8892
  • [None][doc] Replace the relative links with absolute links in README.md. by @nv-guomingz in #8995
  • [None][perf] Add custom indexer k cache scatter op by @chang-l in #8960
  • [None][infra] Update allowed list 2025.11.06 by @yuanjingx87 in #8987
  • [None][feat] Allow env variable to specify spawn process IPC address by @hvagadia in #8922
  • [TRTLLM-8598][feat] enable n > 1 in OpenAI API with PyTorch backend by @ixlmar in #8951
  • [TRTLLM-8999][infra] Reduce gb200 multi-node test stages by @EmmaQiaoCh in #8778
  • [None][infra] Waive failed tests for main 11/07 by @EmmaQiaoCh in #9008
  • [https://nvbugs/5637037][fix] Update unwaive list. by @bobboli in #9001
  • [None][chore] Lock onnx version <1.20.0 and remove WAR for TRT 10.13 by @yiqingy0 in #9006
  • [TRTLLM-9001][feat] add TP support for DeepSeek-V3.2 by @lfr-0531 in #8943
  • [None][fix] fix eagle3 accuracy issue on sm120 by @byshiue in #8944
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9030
  • [None][feat] Add customized topk and related unit tests for DSA by @ChristinaZ in #8882
  • [None][fix] Improve type annotations on ResourceManager.get_resource_manager by @ixlmar in #9013
  • [https://nvbugs/5619396][fix] Add sm103 to CutlassFP8RowwiseGemm by @VALLIS-NERIA in #9042
  • [https://nvbugs/5625972][fix] Add context manager to fix FakeTensorProp by @Fridah-nv in #9047
  • [https://nvbugs/5644187][fix] Llava-Next MMMU bugfix and Phi4 test bugfix by @yechank-nvidia in #9034
  • [https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true by @lancelly in #8931
  • [None][chore] Clean up unused and confusing code in moe test by @dongfengy in #9019
  • [None][chore] Relocate rlhf_utils.py by @shuyixiong in #8938
  • [TRTLLM-9198][perf] Add torch.compile + multi-stream support for k-cache scatter and weight scaling by @chang-l in #8988
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8998
  • [None][infra] Waive failed tests on main 11/11 by @EmmaQiaoCh in #9058
  • [None][infra] install mooncake in docker images by @bo-nv in #8447
  • [None][doc] update llama and llama4 example doc by @jiahanc in #9048
  • [#8763][feature] AutoDeploy: configurable dtype for caching by @lucaslie in #8812
  • [https://nvbugs/5622938][fix] Use async send_requests_to_next_pp. by @yuxianq in #9041
  • [None][chore] Remove duplicated waive test by @yiqingy0 in #9067
  • [None][chore] Add tensorrt_llm/scripts to .gitignore by @elvischenv in #8895
  • [None][ci] waive test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[False] by @QiJune in #9069
  • [None][infra] Only print and don't fail the check if there are duplicated items in waives.txt by @EmmaQiaoCh in #9068
  • [https://nvbugs/5616189][fix] Make more cases use local cached models by @HuiGao-NV in #8935
  • [TRTLLM-7723][feat] sampling using FlashInfer.sampling by @ixlmar in #8581
  • [None][fix] Display the GPU memory information in GiB unit. by @nv-guomingz in #9070
  • [TRTLLM-8377][test] unit tests for TorchSampler batched sampling by @ixlmar in #9012
  • [None][fix] type annotation by @ixlmar in #9071
  • [TRTLLM-8119][feat] Update doc/tests/chat_template for nano-v2-vlm by @Wanli-Jiang in #8840
  • [None][feat] AutoDeploy: Perf improvement for mamba layers by @nvchenghaoz in #8991
  • [TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner by @QiJune in #7572
  • [None][fix] AutoDeploy: update nano3 accuracy test by @lucaslie in #9061
  • [TRTLLM-9259][perf] Use torch.compile to fuse copy + layernorm within the LayerNorm module by @chang-l in #9052
  • [None][ci] run speculative unit tests serially by @QiJune in #9080
  • [None][fix] Remove unnecessary attention workspace memory check by @jiaganc in #9064
  • [TRTLLM-9018][infra] add mirror for Build-Docker-Images stage by @ZhanruiSunCh in #9063
  • [None][infra] Waive a failed case of disaggregated/test_disaggregated.py by @EmmaQiaoCh in #9074
  • [None][ci] waive some test cases of disaggregated serving by @QiJune in #9085
  • [None] [doc] Add Mixed Precision Context and Generation section to Disagg by @timothygao8710 in #8769
  • [https://nvbugs/5568991][test] Remove Phi-3 models by @yufeiwu-nv in #9066
  • [TRTLLM-9175][test] ensure sampling is async by @ixlmar in #9076
  • [TRTLLM-8540][feat] Add support for disagg in DSv3.2 by @Tabrizian in #8735
  • [#9023][feat] reduce AD graph optimization time for non-participating passes by @nzmora-nvidia in #9024
  • [None][feat] Add MTP>1 support for DS-v3.2 by @lfr-0531 in #9045
  • [None][chore] Remove is_disaggregated param in executor request queue by @pcastonguay in #9049
  • [https://nvbugs/5636912][fix] AutoDeploy: Unwaive the test by @nvchenghaoz in #9018
  • [None][feat] Enable EPLB for trtllm-gen and cutlass backend by @dongxuy04 in #8886
  • [None][fix] AutoDeploy: Use tmp folder for the load_moe_align by @nvchenghaoz in #9101
  • [None][ci] waive test_disaggregated_serving.py::TestQwen3_8B::test_chunked_prefill by @QiJune in #9111
  • [TRTLLM-9179][feat] add pp_partition to customize each rank's layer number by @dc3671 in #9003
  • [TRTLLM-9212][chore] move MoeLoadBalancerConfig to llm_args.py by @QiJune in #9002
  • [None][chore] Waive test_llm_rpc_streaming by @Superjomn in #9113
  • [None] [infra] Update CODEOWNERS for pre-commit-config.yaml by @venkywonka in #9108
  • [TRTLLM-9209][infra] Upgrade precommit-hooks to v6.0.0 by @cheshirekow in #9097
  • [None][ci] Waive test_llm_rpc and test_llm_rpc_streaming by @Superjomn in #9118
  • [#6507][fix] Fix precision issue due to KV layout mismatch for split/concat kernels by @ZhangGe6 in #6917
  • [TRTLLM-8816][feat] add optimized trtllm-gen attention kernels on sm103 by @PerkzZheng in #9081
  • [https://nvbugs/5640873][fix] Move thop tests to pre-merge by @HuiGao-NV in #9094
  • [None][fix] Clear indexer k cache reference before release cuda memory by @chang-l in #9110
  • [None][test] add deepseek and qwen cases for rtx series by @ruodil in #8839
  • [None][chore] Remove closed bugs by @xinhe-nv in #9114
  • [None][fix] waive failed tests by @xinhe-nv in #9090
  • [None][infra] Waive failed tests for main 11/13 by @EmmaQiaoCh in #9132
  • [https://nvbugs/5633340][chore] waive test_auto_scaling.py::test_disagg_server_restart by @reasonsolo in #9131
  • [None] [fix] Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade by @kaiyux in #9126
  • [None][fixes] Add tool call parsing fixes and Qwen3 coder parser by @2ez4bz in #8817
  • [TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding by @ziyixiong-nv in #8706
  • [None][fix] Fix KV cache manager test warnings by @Tabrizian in #9103
  • [None][fix] Fix the aux_stream in Llama4MinLatencyFusedMoE by @jinyangyuan-nvidia in #9035
  • [None][autodeploy] minor refactor to rmsnorm transforms by @Fridah-nv in #8657
  • [None][autodeploy] fix weight extraction for graph based quantized checkpoints by @Fridah-nv in #9109
  • [https://nvbugs/5652552][fix] Log the llm args for main branch by @leslie-fang25 in #9120
  • [None][fix] support topk autotuner input for expert slot per group larger than 32 by @dongxuy04 in #9087
  • [#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 by @nzmora-nvidia in #9011
  • [TRTLLM-8988][feat] Unify MPI & Ray's req/response handling with RPC Client/Server by @hchings in #8765
  • [None][chore] Support json_schema in response_format by @JunyiXu-nv in #8934
  • [None][feat] Add Qwen3-Next to layer-wise benchmarks by @yuantailing in #9065
  • [None] [feat] Use triton kernels for RocketKV prediction module by @heyuhhh in #8682
  • [None][ci] waive test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] by @QiJune in #9162
  • [None][feat] Autodeploy add triton configs and optimize mamba prefill by @suyoggupta in #9083
  • [https://nvbugs/5631254][fix] avoid torch.compile for multiple times by @reasonsolo in #9135
  • [None][doc] Add DeepSeek-V3.2-Exp document by @lfr-0531 in #9141
  • [None][doc] update docs for EPLB by @dongxuy04 in #9166
  • [TRTLLM-9053][feat] Support accuracy test and install from wheel by @zerollzeng in #9038
  • [#9102][feat] AutoDeploy: Support fp8 kv cache by @nvchenghaoz in #9107
  • [None][ci] Waive unittest/_torch/sampler/test_torch_sampler.py::TestBatchedSampling by @yuanjingx87 in #9161
  • [TRTLLM-9295][fix] unflake test_overlap_scheduler.py::test_overlap_scheduler_consis… by @ixlmar in #9146
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9156
  • [https://nvbugs/5629887][fix] Add missing device count guard for DSv32 multiGPU tests by @chang-l in #9159
  • [None][infra] Lock generation pipeline update by @yuanjingx87 in #9084
  • [None][infra] Fix medata.json generated by lock file genreation pipeline by @yuanjingx87 in #9179
  • [None][infra] Update allowlist 2025.11.14 by @yuanjingx87 in #9183
  • [TRTLLM-9079][infra] upgrade tritonserver DLFW 25.10 by @ZhanruiSunCh in #8929
  • [None][chore] Add placement test for ray executor by @hchings in #9122
  • [None][infra] Add trt-llm-kv-cache-manager-devs as code owner for appropriate files by @thorjohnsen in #9182
  • [None][fix] Make the sliced nvfp4 output contiguous by @JadoTu in #9123
  • [None][chore] Waive failing tests blocking pre-merge by @brb-nv in #9189
  • [None][infra] Waive failed tests for main branch 11/15 by @EmmaQiaoCh in #9187
  • [None][fix] Update the attention layers counting for Qwen3-next. by @nv-guomingz in #9072
  • [TRTLLM-8778][feat] Add tree attention support for blackwell arch by @sunnyqgg in #8975
  • [None][infra] Waive a failed case in pre-merge stage 11/16 by @EmmaQiaoCh in #9192
  • [https://nvbugs/5613089][fix] Fix the rank to access all_rank_chunk_size_list when chunked MoE is used by @jinyangyuan-nvidia in #8723
  • [None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound by @rosenrodt in #9025
  • [None] [fix] Fix missing ActivationType issue by @kaiyux in #9171
  • [TRTLLM-8000][infra] Catch error in merge waive list stage by @yiqingy0 in #7289
  • [None][feat] Add simple optimizations for MTP 2-model by @mikeiovine in #9176
  • [TRTLLM-8831][feat] Enable early exit with overlap scheduler by @Funatiq in #8587
  • [TRTINFRA-7280][infra] Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge by @mlefeb01 in #9117
  • [None][infra] Fix lock file generation script by @yuanjingx87 in #9180
  • [None][feat] Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection by @zackyoray in #9075
  • [None][chore] local imports for AutoDeploy in serve and bench by @lucaslie in #9199
  • [None][ci] split speculative test case into several small cases by @QiJune in #9209
  • [None][feat] Support Glm4MoeForCausalLM by @dmtri35 in #8256
  • [#8732][feat] Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels by @galagam in #9191
  • [None][chore] Change trt-server to trtlllm-server in opentelemetry readme by @StanleySun639 in #9173
  • [None][chore] benchmark refactor by @zerollzeng in #9207
  • [https://nvbugs/5652552][fix] add printing for llm args by @ruodil in #9205
  • [None][chore] fix a deepseekv3 error when debug mode is on by @reasonsolo in #9217
  • [None][fix] DeepSeek V3.2 indexer RoPE fix by @chang-l in #9232
  • [TRTLLM-8948][test] Add long bench case by @crazydemo in #9165
  • [None][refactor] decoding inputs, part 2 by @Funatiq in #5799
  • [TRTLLM-8949][test] Add rcca test case for eagle3 consistency check by @crazydemo in #9088
  • [TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). by @zheyuf in #8194
  • [None] [tests] Unwaive wide ep related tests by @kaiyux in #9204
  • [None][chore] Print device info in trtllm-bench report by @galagam in #8584
  • [TRTLLM-9295][fix] restore greedy sampling in _test_openai_chat_guided_decoding by @ixlmar in #9178
  • [None][feat] bias for FP4 TRT-LLM Gen MoE by @nekorobov in #9220
  • [None][feat] AutoDeploy: Perf improvement for small batch size by @nvchenghaoz in #9163
  • [#9152][fix] AutoDeploy fused_allreduce_residual_rmsnorm to support demollm mode by @MrGeva in #9197
  • [https://nvbugs/5590408][fix] Exclude num of draft tokens from mMaxSeqLenKv by @ziyixiong-nv in #9210
  • [None][chore] Update the Flux autodeploy example by @ajrasane in #8434
  • [TRTLLM-9287][infra] Use NIXL backend for accuracy tests by @bo-nv in #9247
  • [https://nvbugs/5649010][fix] increase status-checking interval to avoid instability by @reasonsolo in #9203
  • [TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM by @syuoni in #8880
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9193
  • [None][feat] Have ability to cancel disagg request if KV cache resource are exhausted by @pcastonguay in #9155
  • [#9137][feat] Factory sharding as default by @greg-kwasniewski1 in #9144
  • [None][fix] Update the default invalid value for deepseek mode of routing by @ChristinaZ in #9222
  • [#9098][feat] Simple sharding latent experts by @greg-kwasniewski1 in #9099
  • [TRTLLM-9050][test] add llama4 disagg case to cover kv cache overflow error by @crazydemo in #9172
  • [None][fix] logits device and shape issues in dynamic draft path by @jellysnack in #9079
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9242
  • [None][feat] Update the indexer topK by @ChristinaZ in #9255
  • [None][infra] Waive failed cases for main branch on 11/17 by @EmmaQiaoCh in #9266
  • [None][doc] Update DS-R1 example doc by @jiahanc in #9231
  • [None][fix] Update GLM model accuracy test by @nvxuanyuc in #9286
  • [https://nvbugs/5456493][feat] add fp8 dense for sm120 by @CarstyYou in #9174
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9289
  • [https://nvbugs/5661877][fix] fix test regression in TestBatchedSampling::test_samples by @ixlmar in #9215
  • [None][perf] Adjust select_alltoall_method_type. by @bobboli in #8950
  • [None][feature] AutoDeploy: tighter MoE UT thresholds by @nzmora-nvidia in #9195
  • [None][feat] add specdec to nemotron nas by @NVShreyas in #8985
  • [#9237][feat] enable iter stats in autodeploy by @NVShreyas in #9278
  • [None][fix] change logging for weight loading on unified memory by @farazkh80 in #9177
  • [None][chore] Waive tests timing out on main by @brb-nv in #9315
  • [None][fix] fix EPLB for DeepSeek-V3.2-Exp by @lfr-0531 in #9245
  • [#8476][chore] Update license by @karljang in #8807
  • [TRTLLM-7963][feat] Use CUDAGraph to improve the tuning accuracy for AutoTuner. by @hyukn in #9089
  • [None][chore] Prevent negative max_tokens passed into tllm request by @JunyiXu-nv in #9037
  • [TRTLLM-9247][infra] Upgrade NIXL to 0.7.1 by @bo-nv in #9055

New Contributors

Full Changelog: v1.2.0rc2...v1.2.0rc3

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.