Announcement Highlights
-
Model Support
-
API
- Add
trtllm_prefix for exposed metrics (#8845) - Return logprobs incrementally in torch backend (#8785)
- Enable n > 1 in OpenAI API with PyTorch backend (#8951)
- Support json_schema in response_format (#8934)
- Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075)
- Prevent negative
max_tokenspassed into tllm request (#9037)
- Add
-
Feature
- Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt (#8771)
- Add swapsMmaAb sparseMla kernels (#8913)
- Implement Deep Research with scaffolding (#8452)
- Add rope and uk-bgemm overlap for MLA generation (#8495)
- Add NUMA-aware CPU affinity autoconfig (#8805)
- Add custom indexer k cache scatter op (#8960)
- Allow env variable to specify spawn process IPC address (#8922)
- Implement sampling using FlashInfer.sampling (#8581)
- Enhance the overlap scheduler for two-model spec decoding (#8706)
- Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
- Unify MPI & Ray's req/response handling with RPC Client/Server (#8765)
- Use triton kernels for RocketKV prediction module (#8682)
- Support accuracy test and install from wheel (#9038)
- Add tree attention support for blackwell arch (#8975)
- Add simple optimizations for MTP 2-model (#9176)
- Enable early exit with overlap scheduler (#8587)
- Add dynamic draft length in spec decode (stage 1) (#8194)
- Add bias for FP4 TRT-LLM Gen MoE (#9220)
- Integrate CuteDSL NVFP4 grouped GEMM (#8880)
- Add ability to cancel disagg request if KV cache resources are exhausted (#9155)
- Make factory sharding the default (#9144)
- Enable simple sharding for latent experts (#9099)
- Update the indexer topK (#9255)
- Add fp8 dense for sm120 (#9174)
- Add specdec to nemotron nas (#8985)
- Use CUDAGraph to improve the tuning accuracy for AutoTuner (#9089)
- Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191)
- Add pp_partition to customize each rank's layer number (#9003)
- Enable EPLB for trtllm-gen and cutlass backend (#8886)
- Add optimized trtllm-gen attention kernels on sm103 (#9081)
- Add MTP>1 support for DS-v3.2 (#9045)
-
Benchmark
- Add Qwen3-Next to layer-wise benchmarks (#9065)
- Refactor benchmark infrastructure (#9207)
- Print device info in trtllm-bench report (#8584)
- Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
- Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
- Adjust select_alltoall_method_type (#8950)
-
Documentation
- Replace the relative links with absolute links in README.md (#8995)
- Update llama and llama4 example doc (#9048)
- Update doc/tests/chat_template for nano-v2-vlm (#8840)
- Add Mixed Precision Context and Generation section to Disagg (#8769)
- Add DeepSeek-V3.2-Exp document (#9141)
- Update docs for EPLB (#9166)
- Update the Flux autodeploy example (#8434)
- Update DS-R1 example doc (#9231)
- Update license (#8807)
-
Fix & Infra
- Fix the logger once key issue and further compress log in AutoTuner (#8873)
- Fix disagg GPT-OSS test (#8870)
- Remove PyTorchConfig completely (#8856)
- Fix boost issue (#8996)
- Lock onnx version <1.20.0 and remove WAR for TRT 10.13 (#9006)
- Fix eagle3 accuracy issue on sm120 (#8944)
- Add customized topk and related unit tests for DSA (#8882)
- Improve type annotations on ResourceManager.get_resource_manager (#9013)
- Add sm103 to CutlassFP8RowwiseGemm (#9042)
- Add context manager to fix FakeTensorProp (#9047)
- Initialize HF modules in worker_main for models with trust_remote=true (#8931)
- Use async
send_requests_to_next_pp(#9041) - Display the GPU memory information in GiB unit (#9070)
- Add unit tests for TorchSampler batched sampling (#9012)
- Remove circular dependency between model engine and cuda graph runner (#7572)
- Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
- Clear indexer k cache reference before releasing CUDA memory (#9110)
- Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
- Fix KV cache manager test warnings (#9103)
- Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
- Avoid
torch.compilebeing applied multiple times (#9135) - Upgrade tritonserver DLFW 25.10 (#8929)
- Make the sliced nvfp4 output contiguous (#9123)
- Update the attention layers counting for Qwen3-next (#9072)
- Fix the rank to access
all_rank_chunk_size_listwhen chunked MoE is used (#8723) - Fix missing
ActivationTypeissue (#9171) - Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117)
- Fix lock file generation script (#9180)
- Fix a deepseekv3 error when debug mode is on (#9217)
- Fix DeepSeek V3.2 indexer RoPE (#9232)
- Exclude number of draft tokens from
mMaxSeqLenKv(#9210) - Upgrade NIXL to 0.7.1 (#9055)
- Fix EPLB for DeepSeek-V3.2-Exp (#9245)
- Log the LLM args for main branch (#9120, #9205)
- Update TRTLLM MoE cubins, reduce mxfp4 weight padding requirement, and tighten TMA bound (#9025)
- Upgrade precommit-hooks to v6.0.0 (#9097)
What's Changed
- [https://nvbugs/5623960][fix] Fix the logger once key issue and further compress log in AutoTuner. by @hyukn in #8873
- [None][infra] update github token name by @niukuo in #8907
- [https://nvbugs/5624367][fix] Fix disagg GPT-OSS test by @chuangz0 in #8870
- [https://nvbugs/5630345][chore] unwaive DS-v32 nvfp4 and fp8 tests by @lfr-0531 in #8887
- [TRTLLM-7251][test] Get submit eplb slots empty key work by @fredricz-20070104 in #8945
- [TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt by @chang-l in #8771
- [None][feat] add swapsMmaAb sparseMla kernels by @PerkzZheng in #8913
- [TRTLLM-8201][feat] Nemotron H MoE Sharding by @lucaslie in #8744
- [#8924][fix] Fix AutoDeploy pattern matcher for torch 2.9 by @Fridah-nv in #8920
- [https://nvbugs/5606166][fix] AutoDeploy: unwaive test for use tuples for cudagraph shape lookup by @lucaslie in #8957
- [None][feat] Deep Research Implemented with Scaffolding by @Boreas618 in #8452
- [None][infra] allow to choose repo when generate lock files by @yuanjingx87 in #8659
- [None][feat] add waive by sm version by @xinhe-nv in #8928
- [None][feat] Add
trtllm_prefix for exposed metrics by @nv-yilinf in #8845 - [TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation by @yunruis in #8495
- [https://nvbugs/5630345] [chore] skip deepseek-v3.2 fp8 kv tests on pre-Blackwell architectures by @lfr-0531 in #8973
- [None][chore] Use cached model in all ray tests by @shuyixiong in #8962
- [https://nvbugs/5498478][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill by @DylanChen-NV in #8910
- [TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear by @nvchenghaoz in #8820
- [https://nvbugs/5527655][feat] Add NUMA-aware CPU affinity autoconfig by @dhansen-nvidia in #8805
- [None][feat] AutoDeploy: Support Latent MOE for Nemotron by @nvchenghaoz in #8955
- [None][fix] Fix KV cache clearing with KV Connector API by @jthomson04 in #8750
- [https://nvbugs/5637012][fix] Bugfix when config is None for MLA by @chang-l in #8978
- [https://nvbugs/5606136][ci] Remove tests for deprecating models. by @SimengLiu-nv in #8926
- [None][feat] Return logprobs incrementally in torch backend by @dcaox in #8785
- [https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls by @VALLIS-NERIA in #8939
- [None][fix] Switch AD AllReduce strategy to NCCL by @MrGeva in #8979
- [https://nvbugs/5633340][fix] kill processes properly after test by @reasonsolo in #8970
- [TRTLLM-9065][chore] remove PyTorchConfig completely by @QiJune in #8856
- [https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) by @stnie in #8794
- [None][fix] type annotations in fuse_input_embeds by @ixlmar in #8976
- [None][fix] add missing CLI option in multimodal example by @ixlmar in #8977
- [None][chore] Bump version to 1.2.0rc3 by @yiqingy0 in #9004
- [TRTLLM-9213][infra] Fix boost issue by @ZhanruiSunCh in #8996
- [https://nvbugs/5629790][chore] unwaive test. by @yuxianq in #8967
- [None][fix] Moving transfer timeout test to test_llm_pytorch, fixing broken kv transfer timeout by @pcastonguay in #8892
- [None][doc] Replace the relative links with absolute links in README.md. by @nv-guomingz in #8995
- [None][perf] Add custom indexer k cache scatter op by @chang-l in #8960
- [None][infra] Update allowed list 2025.11.06 by @yuanjingx87 in #8987
- [None][feat] Allow env variable to specify spawn process IPC address by @hvagadia in #8922
- [TRTLLM-8598][feat] enable n > 1 in OpenAI API with PyTorch backend by @ixlmar in #8951
- [TRTLLM-8999][infra] Reduce gb200 multi-node test stages by @EmmaQiaoCh in #8778
- [None][infra] Waive failed tests for main 11/07 by @EmmaQiaoCh in #9008
- [https://nvbugs/5637037][fix] Update unwaive list. by @bobboli in #9001
- [None][chore] Lock onnx version <1.20.0 and remove WAR for TRT 10.13 by @yiqingy0 in #9006
- [TRTLLM-9001][feat] add TP support for DeepSeek-V3.2 by @lfr-0531 in #8943
- [None][fix] fix eagle3 accuracy issue on sm120 by @byshiue in #8944
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9030
- [None][feat] Add customized topk and related unit tests for DSA by @ChristinaZ in #8882
- [None][fix] Improve type annotations on ResourceManager.get_resource_manager by @ixlmar in #9013
- [https://nvbugs/5619396][fix] Add sm103 to CutlassFP8RowwiseGemm by @VALLIS-NERIA in #9042
- [https://nvbugs/5625972][fix] Add context manager to fix FakeTensorProp by @Fridah-nv in #9047
- [https://nvbugs/5644187][fix] Llava-Next MMMU bugfix and Phi4 test bugfix by @yechank-nvidia in #9034
- [https://nvbugs/5556998][fix] init_hf_modules in worker_main for models with trust_remote=true by @lancelly in #8931
- [None][chore] Clean up unused and confusing code in moe test by @dongfengy in #9019
- [None][chore] Relocate rlhf_utils.py by @shuyixiong in #8938
- [TRTLLM-9198][perf] Add torch.compile + multi-stream support for k-cache scatter and weight scaling by @chang-l in #8988
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8998
- [None][infra] Waive failed tests on main 11/11 by @EmmaQiaoCh in #9058
- [None][infra] install mooncake in docker images by @bo-nv in #8447
- [None][doc] update llama and llama4 example doc by @jiahanc in #9048
- [#8763][feature] AutoDeploy: configurable dtype for caching by @lucaslie in #8812
- [https://nvbugs/5622938][fix] Use async send_requests_to_next_pp. by @yuxianq in #9041
- [None][chore] Remove duplicated waive test by @yiqingy0 in #9067
- [None][chore] Add tensorrt_llm/scripts to .gitignore by @elvischenv in #8895
- [None][ci] waive test_disaggregated_serving.py::TestQwen3_8B::test_auto_dtype[False] by @QiJune in #9069
- [None][infra] Only print and don't fail the check if there are duplicated items in waives.txt by @EmmaQiaoCh in #9068
- [https://nvbugs/5616189][fix] Make more cases use local cached models by @HuiGao-NV in #8935
- [TRTLLM-7723][feat] sampling using FlashInfer.sampling by @ixlmar in #8581
- [None][fix] Display the GPU memory information in GiB unit. by @nv-guomingz in #9070
- [TRTLLM-8377][test] unit tests for TorchSampler batched sampling by @ixlmar in #9012
- [None][fix] type annotation by @ixlmar in #9071
- [TRTLLM-8119][feat] Update doc/tests/chat_template for nano-v2-vlm by @Wanli-Jiang in #8840
- [None][feat] AutoDeploy: Perf improvement for mamba layers by @nvchenghaoz in #8991
- [TRTLLM-8521][chore] remove circular dependency between model engine and cuda graph runner by @QiJune in #7572
- [None][fix] AutoDeploy: update nano3 accuracy test by @lucaslie in #9061
- [TRTLLM-9259][perf] Use torch.compile to fuse copy + layernorm within the LayerNorm module by @chang-l in #9052
- [None][ci] run speculative unit tests serially by @QiJune in #9080
- [None][fix] Remove unnecessary attention workspace memory check by @jiaganc in #9064
- [TRTLLM-9018][infra] add mirror for Build-Docker-Images stage by @ZhanruiSunCh in #9063
- [None][infra] Waive a failed case of disaggregated/test_disaggregated.py by @EmmaQiaoCh in #9074
- [None][ci] waive some test cases of disaggregated serving by @QiJune in #9085
- [None] [doc] Add Mixed Precision Context and Generation section to Disagg by @timothygao8710 in #8769
- [https://nvbugs/5568991][test] Remove Phi-3 models by @yufeiwu-nv in #9066
- [TRTLLM-9175][test] ensure sampling is async by @ixlmar in #9076
- [TRTLLM-8540][feat] Add support for disagg in DSv3.2 by @Tabrizian in #8735
- [#9023][feat] reduce AD graph optimization time for non-participating passes by @nzmora-nvidia in #9024
- [None][feat] Add MTP>1 support for DS-v3.2 by @lfr-0531 in #9045
- [None][chore] Remove is_disaggregated param in executor request queue by @pcastonguay in #9049
- [https://nvbugs/5636912][fix] AutoDeploy: Unwaive the test by @nvchenghaoz in #9018
- [None][feat] Enable EPLB for trtllm-gen and cutlass backend by @dongxuy04 in #8886
- [None][fix] AutoDeploy: Use tmp folder for the load_moe_align by @nvchenghaoz in #9101
- [None][ci] waive test_disaggregated_serving.py::TestQwen3_8B::test_chunked_prefill by @QiJune in #9111
- [TRTLLM-9179][feat] add pp_partition to customize each rank's layer number by @dc3671 in #9003
- [TRTLLM-9212][chore] move MoeLoadBalancerConfig to llm_args.py by @QiJune in #9002
- [None][chore] Waive test_llm_rpc_streaming by @Superjomn in #9113
- [None] [infra] Update CODEOWNERS for pre-commit-config.yaml by @venkywonka in #9108
- [TRTLLM-9209][infra] Upgrade precommit-hooks to v6.0.0 by @cheshirekow in #9097
- [None][ci] Waive test_llm_rpc and test_llm_rpc_streaming by @Superjomn in #9118
- [#6507][fix] Fix precision issue due to KV layout mismatch for split/concat kernels by @ZhangGe6 in #6917
- [TRTLLM-8816][feat] add optimized trtllm-gen attention kernels on sm103 by @PerkzZheng in #9081
- [https://nvbugs/5640873][fix] Move thop tests to pre-merge by @HuiGao-NV in #9094
- [None][fix] Clear indexer k cache reference before release cuda memory by @chang-l in #9110
- [None][test] add deepseek and qwen cases for rtx series by @ruodil in #8839
- [None][chore] Remove closed bugs by @xinhe-nv in #9114
- [None][fix] waive failed tests by @xinhe-nv in #9090
- [None][infra] Waive failed tests for main 11/13 by @EmmaQiaoCh in #9132
- [https://nvbugs/5633340][chore] waive test_auto_scaling.py::test_disagg_server_restart by @reasonsolo in #9131
- [None] [fix] Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade by @kaiyux in #9126
- [None][fixes] Add tool call parsing fixes and Qwen3 coder parser by @2ez4bz in #8817
- [TRTLLM-8084][feat] Enhance the overlap shceduler for two-model spec decoding by @ziyixiong-nv in #8706
- [None][fix] Fix KV cache manager test warnings by @Tabrizian in #9103
- [None][fix] Fix the aux_stream in Llama4MinLatencyFusedMoE by @jinyangyuan-nvidia in #9035
- [None][autodeploy] minor refactor to rmsnorm transforms by @Fridah-nv in #8657
- [None][autodeploy] fix weight extraction for graph based quantized checkpoints by @Fridah-nv in #9109
- [https://nvbugs/5652552][fix] Log the llm args for main branch by @leslie-fang25 in #9120
- [None][fix] support topk autotuner input for expert slot per group larger than 32 by @dongxuy04 in #9087
- [#8732][feat] Update TRTLLM Cutlass MoE kernels with ReLU2 by @nzmora-nvidia in #9011
- [TRTLLM-8988][feat] Unify MPI & Ray's req/response handling with RPC Client/Server by @hchings in #8765
- [None][chore] Support json_schema in response_format by @JunyiXu-nv in #8934
- [None][feat] Add Qwen3-Next to layer-wise benchmarks by @yuantailing in #9065
- [None] [feat] Use triton kernels for RocketKV prediction module by @heyuhhh in #8682
- [None][ci] waive test_disaggregated.py::test_disaggregated_mixed[TinyLlama-1.1B-Chat-v1.0] by @QiJune in #9162
- [None][feat] Autodeploy add triton configs and optimize mamba prefill by @suyoggupta in #9083
- [https://nvbugs/5631254][fix] avoid torch.compile for multiple times by @reasonsolo in #9135
- [None][doc] Add DeepSeek-V3.2-Exp document by @lfr-0531 in #9141
- [None][doc] update docs for EPLB by @dongxuy04 in #9166
- [TRTLLM-9053][feat] Support accuracy test and install from wheel by @zerollzeng in #9038
- [#9102][feat] AutoDeploy: Support fp8 kv cache by @nvchenghaoz in #9107
- [None][ci] Waive unittest/_torch/sampler/test_torch_sampler.py::TestBatchedSampling by @yuanjingx87 in #9161
- [TRTLLM-9295][fix] unflake test_overlap_scheduler.py::test_overlap_scheduler_consis… by @ixlmar in #9146
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9156
- [https://nvbugs/5629887][fix] Add missing device count guard for DSv32 multiGPU tests by @chang-l in #9159
- [None][infra] Lock generation pipeline update by @yuanjingx87 in #9084
- [None][infra] Fix medata.json generated by lock file genreation pipeline by @yuanjingx87 in #9179
- [None][infra] Update allowlist 2025.11.14 by @yuanjingx87 in #9183
- [TRTLLM-9079][infra] upgrade tritonserver DLFW 25.10 by @ZhanruiSunCh in #8929
- [None][chore] Add placement test for ray executor by @hchings in #9122
- [None][infra] Add trt-llm-kv-cache-manager-devs as code owner for appropriate files by @thorjohnsen in #9182
- [None][fix] Make the sliced nvfp4 output contiguous by @JadoTu in #9123
- [None][chore] Waive failing tests blocking pre-merge by @brb-nv in #9189
- [None][infra] Waive failed tests for main branch 11/15 by @EmmaQiaoCh in #9187
- [None][fix] Update the attention layers counting for Qwen3-next. by @nv-guomingz in #9072
- [TRTLLM-8778][feat] Add tree attention support for blackwell arch by @sunnyqgg in #8975
- [None][infra] Waive a failed case in pre-merge stage 11/16 by @EmmaQiaoCh in #9192
- [https://nvbugs/5613089][fix] Fix the rank to access all_rank_chunk_size_list when chunked MoE is used by @jinyangyuan-nvidia in #8723
- [None][feat] Update TRTLLM MoE cubins; reduce mxfp4 weight padding requirement; tighten TMA bound by @rosenrodt in #9025
- [None] [fix] Fix missing ActivationType issue by @kaiyux in #9171
- [TRTLLM-8000][infra] Catch error in merge waive list stage by @yiqingy0 in #7289
- [None][feat] Add simple optimizations for MTP 2-model by @mikeiovine in #9176
- [TRTLLM-8831][feat] Enable early exit with overlap scheduler by @Funatiq in #8587
- [TRTINFRA-7280][infra] Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge by @mlefeb01 in #9117
- [None][infra] Fix lock file generation script by @yuanjingx87 in #9180
- [None][feat] Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection by @zackyoray in #9075
- [None][chore] local imports for AutoDeploy in serve and bench by @lucaslie in #9199
- [None][ci] split speculative test case into several small cases by @QiJune in #9209
- [None][feat] Support Glm4MoeForCausalLM by @dmtri35 in #8256
- [#8732][feat] Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels by @galagam in #9191
- [None][chore] Change trt-server to trtlllm-server in opentelemetry readme by @StanleySun639 in #9173
- [None][chore] benchmark refactor by @zerollzeng in #9207
- [https://nvbugs/5652552][fix] add printing for llm args by @ruodil in #9205
- [None][chore] fix a deepseekv3 error when debug mode is on by @reasonsolo in #9217
- [None][fix] DeepSeek V3.2 indexer RoPE fix by @chang-l in #9232
- [TRTLLM-8948][test] Add long bench case by @crazydemo in #9165
- [None][refactor] decoding inputs, part 2 by @Funatiq in #5799
- [TRTLLM-8949][test] Add rcca test case for eagle3 consistency check by @crazydemo in #9088
- [TRTLLM-8136][feat] Dynamic draft length in spec decode (stage 1). by @zheyuf in #8194
- [None] [tests] Unwaive wide ep related tests by @kaiyux in #9204
- [None][chore] Print device info in trtllm-bench report by @galagam in #8584
- [TRTLLM-9295][fix] restore greedy sampling in _test_openai_chat_guided_decoding by @ixlmar in #9178
- [None][feat] bias for FP4 TRT-LLM Gen MoE by @nekorobov in #9220
- [None][feat] AutoDeploy: Perf improvement for small batch size by @nvchenghaoz in #9163
- [#9152][fix] AutoDeploy fused_allreduce_residual_rmsnorm to support demollm mode by @MrGeva in #9197
- [https://nvbugs/5590408][fix] Exclude num of draft tokens from mMaxSeqLenKv by @ziyixiong-nv in #9210
- [None][chore] Update the Flux autodeploy example by @ajrasane in #8434
- [TRTLLM-9287][infra] Use NIXL backend for accuracy tests by @bo-nv in #9247
- [https://nvbugs/5649010][fix] increase status-checking interval to avoid instability by @reasonsolo in #9203
- [TRTLLM-9286][feat] Integration of CuteDSL NVFP4 grouped GEMM by @syuoni in #8880
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9193
- [None][feat] Have ability to cancel disagg request if KV cache resource are exhausted by @pcastonguay in #9155
- [#9137][feat] Factory sharding as default by @greg-kwasniewski1 in #9144
- [None][fix] Update the default invalid value for deepseek mode of routing by @ChristinaZ in #9222
- [#9098][feat] Simple sharding latent experts by @greg-kwasniewski1 in #9099
- [TRTLLM-9050][test] add llama4 disagg case to cover kv cache overflow error by @crazydemo in #9172
- [None][fix] logits device and shape issues in dynamic draft path by @jellysnack in #9079
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9242
- [None][feat] Update the indexer topK by @ChristinaZ in #9255
- [None][infra] Waive failed cases for main branch on 11/17 by @EmmaQiaoCh in #9266
- [None][doc] Update DS-R1 example doc by @jiahanc in #9231
- [None][fix] Update GLM model accuracy test by @nvxuanyuc in #9286
- [https://nvbugs/5456493][feat] add fp8 dense for sm120 by @CarstyYou in #9174
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #9289
- [https://nvbugs/5661877][fix] fix test regression in TestBatchedSampling::test_samples by @ixlmar in #9215
- [None][perf] Adjust select_alltoall_method_type. by @bobboli in #8950
- [None][feature] AutoDeploy: tighter MoE UT thresholds by @nzmora-nvidia in #9195
- [None][feat] add specdec to nemotron nas by @NVShreyas in #8985
- [#9237][feat] enable iter stats in autodeploy by @NVShreyas in #9278
- [None][fix] change logging for weight loading on unified memory by @farazkh80 in #9177
- [None][chore] Waive tests timing out on main by @brb-nv in #9315
- [None][fix] fix EPLB for DeepSeek-V3.2-Exp by @lfr-0531 in #9245
- [#8476][chore] Update license by @karljang in #8807
- [TRTLLM-7963][feat] Use CUDAGraph to improve the tuning accuracy for AutoTuner. by @hyukn in #9089
- [None][chore] Prevent negative
max_tokenspassed into tllm request by @JunyiXu-nv in #9037 - [TRTLLM-9247][infra] Upgrade NIXL to 0.7.1 by @bo-nv in #9055
New Contributors
- @JadoTu made their first contribution in #8526
- @Boreas618 made their first contribution in #8452
- @dhansen-nvidia made their first contribution in #8805
- @hvagadia made their first contribution in #8922
- @elvischenv made their first contribution in #8895
- @timothygao8710 made their first contribution in #8769
- @ZhangGe6 made their first contribution in #6917
- @heyuhhh made their first contribution in #8682
- @zackyoray made their first contribution in #9075
- @dmtri35 made their first contribution in #8256
Full Changelog: v1.2.0rc2...v1.2.0rc3