NVIDIA/TensorRT-LLM v1.2.0rc1 on GitHub

Announcement Highlights

Model Support
- Add GPT-OSS Sm120/Sm121 support (#7937)
- Fix: Disable DeepGEMM for Qwen3 MoE Attention layers (#8087)
- Fix: Update is_post_quant_all2all_supported for MNNVL (#8355)
- Support quantized model for nano-v2-vlm (#8304)
- Fix: Address illegal access when scale is not provided in Llama3/4 (#7960)
- Fix: Correct Qwen2.5-VL device_path error (#8057)
- Add post-merge test for Seed-OSS-36B-Instruct (#8321)
- Fix: Correct get_num_tokens_per_image for nano-v2-vlm (#8425)
- Add Kimi multi-nodes case (#8025)
API
- Refine sampling strategy selection (BREAKING CHANGE) (#8132)
- Add cache_salt in LLM.generate (#8317)
- Add input tensor pre-hook function API for tuning (#6924)
- Add additional model outputs (#7206)
- Clean create_py_executor API (#8412)
Benchmark
- Add request timing breakdown option in benchmark_serving (#8128)
- Fix bench_serving import error (#8296)
- Update disagg benchmark configs (#8289)
- Add multimodal data to dummy requests during memory profiling (#7539)
- Save runtime report periodically (#8312)
- Resolve sampling defaults in OpenAI API backend (#8121)
Feature
- Add new orchestrator type: Ray (#7520)
- Implement HTTP disagg-cluster management (#7869)
- Add PDL support for more kernels (#7977)
- Enable rejection sampling for CDL (#7731)
- Add torch.compile support for CUDA Core GEMM op (#8261)
- Support block-sparse attention in trtllm gen FMHA kernels (#8301)
- Support SWA KV cache reuse OOW block detach (#7922)
- Add factory TP sharding of quantized models (#8123)
- Turn off speculative decode based on acceptance-length threshold (#7283)
- Enable VLM subgraphs and CUDA graph/compile in AutoDeploy (#8203)
- Add sparse attention framework and RocketKV support (#8086)
- Implement etcd storage for disagg cluster (#8210)
- Export scale factor properly for W4A8/NVFP4/FP8 (#8180)
- Reuse CUDA graph memory pool in normal forward flow (#8095)
- Revise TileN-related routing calculation in MoE backend (#8148)
- Develop DeepConf (#8362)
- Support per-expert pre-quant scale factor for W4A8 AWQ MoE (PyTorch) (#7286)
- Support cached tokens for OpenAI server (#7637)
- Add fmha_v2 kernel for head_dim=80 and SM100 to support VLM (#8392)
- Add topological graph helpers (#8457)
- Enable CUDA graph support for KvConnectorWorker API (#8275)
- Add chunked prefill support in AutoDeploy (#8158)
- Set nixl as default cache transceiver backend (#7926)
- Enable FP8 ContextMLA on GB300 (#8080)
- Skip unnecessary CUDA graph capture (#8050)
- Use device tensor index for MTP (#8062)
Documentation
- Publish blog: Scaling Expert Parallelism in TensorRT LLM (Part 3) (#8323)
- Refine deployment guide by renaming TRT-LLM to TensorRT LLM (#8214)
- Document the role of d2t (#8174)
- Add Qwen3-next doc and L0 test case (#8288)
- Update AutoDeploy README: expert section on YAML configuration (#8370)
- Update TPOT/ITL docs (#8378)
- Add Ray orchestrator initial doc (#8373)
- Add documentation for CUDA 12.9 (#8411)
- Combine feature combination matrix documents (#8442)
- Add ATTRIBUTIONS-{CPP,Python}.md and update wheels setup (#8438)

What's Changed

[None][feat] AutoDeploy: Nemotron-H accuracy test by @lucaslie in #8133
[None][feat] AutoDeploy: graph/module inputs with kwargs instead of args by @lucaslie in #8137
[TRTLLM-7349][feat] Adding new orchestrator type -- ray by @joyang-nv in #7520
[None][autodeploy] small refactors on attention matching by @Fridah-nv in #8079
[#5255][autodeploy] Update FuseAllreduceResidualRMSNorm to use pattern matcher utility; remove fuse_collective by @Fridah-nv in #7545
[TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) by @Superjomn in #5543
[https://nvbugs/5521949][fix] Re-enable test_bielik_11b_v2_2_instruct_multi_lora, fix its API use with pytorch flow LoRA by @amitz-nv in #8146
[None][fix] Adding docker folder to Dockerfile by @pcastonguay in #8138
[None][chore] fix llmargs conflict by @Superjomn in #8152
[TRTLLM-8413][chore] resolve sampling defaults in OpenAI API backend by @ixlmar in #8121
[None][chore] AutoDeploy: clean up accuracy test configs by @lucaslie in #8134
[None][fix] Eagle: Attention DP by @IzzyPutterman in #7939
[None][feat] GPT-OSS Sm120/Sm121 Support by @farazkh80 in #7937
[None][chore] Increase operations-per-run to 1000 for stale action by @karljang in #8162
[None] [test] Add B300 cases to CI by @VALLIS-NERIA in #8056
[None][infra] Skip failed cases for main by @EmmaQiaoCh in #8176
[None][fix] Fix MTP illegal memory access by @mikeiovine in #8161
[https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend by @sklevtsov-nvidia in #8141
[None][test] add test-model-suites option in integration conftest.py by @ruodil in #8016
[https://nvbugs/5455140][fix] unwaive tests related to GB200 OOM by @lancelly in #8159
[https://nvbugs/5550283][fix] update test case to the latest MoE API by @xxi-nv in #8165
[TRTLLM-8414][chore] BREAKING CHANGE: refine sampling strategy selection by @ixlmar in #8132
[None][chore] Waive some tests failing on main post merge by @brb-nv in #8186
[https://nvbugs/5541545][fix] Remove test_llama4 by @mikeiovine in #8031
[https://nvbugs/5522746][fix] unwaive tests caused by node issues after rebooting by @lancelly in #8193
[None][fix] Restrict tinygemm use to certain SMs by @dongfengy in #8182
[None][ci] move some llama4 test cases to pre merge by @QiJune in #8189
[TRTLLM-7846][feat] Http disagg-cluster management implemention by @reasonsolo in #7869
[https://nvbugs/5516666][fix] unwaive some Qwen3 CI tests by @byshiue in #8130
[None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… by @nv-guomingz in #8214
[None][ci] pin flashinfer-python version by @QiJune in #8217
[None][chore] Restore asserts in pytorch flow LoRA tests by @amitz-nv in #8227
[None][infra] Waive failed tests on main 10/09 by @EmmaQiaoCh in #8230
[TRTLLM-7769][chore] document the role of 'd2t' by @ixlmar in #8174
[https://nvbugs/5501820][fix] Add requirements for numba-cuda version to WAR mem corruption by @pengbowang-nv in #7992
[None][fix] Enable FP8 ContextMLA on GB300 by @longlee0622 in #8080
[None][chore] Remove closed bugs by @xinhe-nv in #8151
[None][chore] Print log with time for starting to load safetensor weights by @HuiGao-NV in #8218
[None][fix] Add failed cases into waives.txt by @xinhe-nv in #8229
[https://nvbugs/5547416][fix] unwaive no_cache test by @byshiue in #8213
[None][fix] add gc for test fixture by @xinhe-nv in #8220
[https://nvbugs/5558167][fix] update canceled_req_ids correctly for canceled requests by @QiJune in #8207
[None][fix] Add Lock to protect mReqeustToSession by @chuangz0 in #8085
[None][feat] Add request timing breakdown option in benchmark_serving by @nv-yilinf in #8128
[TRTLLM-6748][feat] add PDL support for more kernels by @dc3671 in #7977
[https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture by @ziyixiong-nv in #8050
[None][chore] Waive failing pre-merge test on main by @brb-nv in #8282
[None][infra] Remove WAR code for GH200 node by @ZhanruiSunCh in #8266
[TRTLLM-7384][feat] enable rejection sampling for CDL by @kris1025 in #7731
[None][infra] Skip failed cases for main branch by @EmmaQiaoCh in #8293
[None][fix] AD test_trtllm_bench to use small model config and skip loading weights by @MrGeva in #8149
[https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 by @amitz-nv in #8063
[None][doc] Fix several invalid ref links in deployment guide sections. by @nv-guomingz in #8287
[None][doc] Add qwen3-next doc into deployment guide and test case into L0. by @nv-guomingz in #8288
[None][feat] Add torch compile support for cuda core GEMM OP by @DylanChen-NV in #8261
[None][fix] add timeout for llama4 by @xinhe-nv in #8254
[https://nvbugs/5503138] [fix] Remove compile warnings by @VALLIS-NERIA in #8167
[None][fix] Fix bench_serving import error by @nv-yilinf in #8296
[TRTLLM-8477][chore] Replace KvCacheConfigCpp with KvCacheConfig inside PyExecutor by @leslie-fang25 in #8259
[None][infra] Update comments for pre-merge GB200 multi-node testing stage by @EmmaQiaoCh in #8281
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #8290
[https://nvbugs/5441729][test] Fix test_modeling_llama_min_latency.py failures by @nvpohanh in #7478
[None][fix] Fix EventLoopShutdownError by @dcaox in #8260
[None][chore] Update disagg benchmark configs by @qiaoxj07 in #8289
[TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention by @lfr-0531 in #8301
[https://nvbugs/5521949][fix] Replace test_codellama_fp8_with_bf16_lora with test_llama_3_1_8b_fp8_with_bf16_lora by @amitz-nv in #8199
[TRTLLM-4517] [feat] Additional model outputs by @Funatiq in #7206
[None] [blog] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary) by @kaiyux in #8323
[None] [doc] Update README by @kaiyux in #8326
[TLLM-6777][feature] Support SWA KV cache reuse OOW block detach by @eopXD in #7922
[None][fix] workaround for numexpr issue by @ixlmar in #8327
[None][fix] Avoid unnecessary concat in attn_output_gate case. by @yuxianq in #8094
[TRTLLM-6342][feat] Factory TP sharding of quantized models by @greg-kwasniewski1 in #8123
[TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. by @zheyuf in #7283
[None][feat] AutoDeploy: VLMs with subgraphs + cudagraph/compile by @lucaslie in #8203
[None][fix] Disable DeepGEMM for Qwen3 MoE Attention layers by @achartier in #8087
[None][fix] Fix dummy load format for key models. by @yuxianq in #7993
[None][infra] Pin numexpr in requirements.txt by @yuanjingx87 in #8343
[TRTLLM-8366][feat] add kimi multi nodes case by @xinhe-nv in #8025
[https://nvbugs/5542878][fix] Unwaive test by @2ez4bz in #8027
[None][feat] Move StreamGeneration to scaffolding main directory by @dcaox in #8347
[None][ci] waive several rpc tests by @Superjomn in #8349
[None][fix] Add lock for request_to_session in sendReadySingal by @chuangz0 in #8310
[https://nvbugs/5404000][fix] Ensure consistency between firstTokenTime and lastTokenTime by @achartier in #8294
[TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support by @lfr-0531 in #8086
[TRTLLM-8507][fix] Fix ray resource cleanup and error handling in LoRA test by @shuyixiong in #8175
[https://nvbugs/5563469][fix] Temporarily disable test_nemotron_nano_8b_lora_torch in L0 due to Torch non-determinism by @moraxu in #8206
[None][chore] AutoDeplopy: Update expert section on yaml configuration in README by @lucaslie in #8370
[None][fix] Fix is_post_quant_all2all_supported for MNNVL by @yuantailing in #8355
[TRTLLM-7846][feat] implement etcd storage for disagg cluster by @reasonsolo in #8210
[None][fix] Remove outdated test waives for GPTOSS by @dongfengy in #8183
[TRTLLM-7351][infra] Add isolate marker for L0 by @EmmaQiaoCh in #7497
[https://nvbugs/5547435][fix] Fix a merge conflict by @liji-nv in #8365
[None] [docs] Update TPOT/ITL docs by @kaiyux in #8378
[None][doc] Ray orchestrator initial doc by @hchings in #8373
[OMNIML-2336][feat] w4a8 nvfp4 fp8 exports scale factor properly by @sychen52 in #8180
[None][chore] set the default value of max_num_tokens explicitly by @QiJune in #8208
[None][chore] update torch_dtype -> dtype in 'transformers' by @ixlmar in #8263
[None][ci] move all llama4 test cases to post merge by @QiJune in #8387
[TRTLLM-8551][feat] add cache_salt in LLM.generate and refactor test_return_logits.py by @ixlmar in #8317
[TRTLLM-4501][feat] Add input tensor pre-hook function API for the tuning process. by @hyukn in #6924
[None] [chore] Add OSS compliance to CODEOWNERS by @venkywonka in #8375
[TRTLLM-8532][chore] clean warmup method of ModelEngine by @QiJune in #8264
[None][fix] Refactor triton paddings by @dongfengy in #6980
[None][feat] reuse cudagraph memory pool in normal forward flow by @HuiGao-NV in #8095
[None][fix] Fix cache buffer size for window by @chuangz0 in #8320
[https://nvbugs/5560921][fix] GenerationExecutor RPC by @Superjomn in #8209
[None][feat] Revise the calculation related to TileN in routing of MOE TRTLLM backend by @ChristinaZ in #8148
[TRTLLM-8579][feat] Support quantized model for nano-v2-vlm by @Wanli-Jiang in #8304
[https://nvbugs/5541494] [fix] Remove waivers by @VALLIS-NERIA in #8353
[None][feat] Dev DeepConf by @dcaox in #8362
[https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend by @yumin066 in #7286
[None][fix] Fix the error where checkpoint_dir is assigned as NONE wh… by @chinamaoge in #8401
[https://nvbugs/5583261][ci] waive test_fetch_responses_streaming_sync by @Superjomn in #8407
[None][chore] Isolate several intermittent cases by @HuiGao-NV in #8408
[https://nvbugs/5532789] [doc] Add documents about CUDA 12.9 by @VALLIS-NERIA in #8411
[TRTLLM-8638][fix] waive llam4 tests on H20 by @xinhe-nv in #8416
[None][feat] Support cached tokens for Openai server by @wjueyao in #7637
[https://nvbugs/5461761][fix] Unwaive eagle3 test by @sunnyqgg in #8363
[None][chore] Mass integration of release/1.1 by @mikeiovine in #8200
[TRTLLM-6780][fix] Add multimodal data to dummy requests during memory profiling by @johncalesp in #7539
[None][bug] Set NCCL_GRAPH_REGISTER to false to avoid hang by @Tabrizian in #8413
[None][infra] Fix for generate lockfile pipeline by @yuanjingx87 in #7820
[None][infra] Update CI allowed list 2025_10_15 by @yuanjingx87 in #8403
[https://nvbugs/5540138][fix] Fix shape error when duplicating kv. by @Tracin in #8390
[TRTLLM-8580][test] save runtime report periodically by @crazydemo in #8312
[None][test] Filter out all fp8 test case for A100. by @yufeiwu-nv in #8420
[None][chore] Combine two documents of feature combination matrix by @leslie-fang25 in #8442
[None][chore] Update commit msg for adding lock files by @chzblych in #8448
[None][test] Add post merge test for Seed-OSS-36B-Instruct by @zhhuang-nv in #8321
[None][fix] trtllm-gen regression in PR 8301 by @PerkzZheng in #8426
[TRTLLM-8638][fix] add waives tests by @xinhe-nv in #8445
[None][feat] Add fmha_v2 kernel for head_dim=80 and sm=100 to support VLM by @Wanli-Jiang in #8392
[None] [chore] Add ATTRIBUTIONS-{CPP,Python}.md + Update in wheels setup by @venkywonka in #8438
[TRTLLM-8201][feat] Topological graph helpers by @greg-kwasniewski1 in #8457
[None][chore] AutoDeploy: cleanup old inference optimizer configs by @h-guo18 in #8039
[TRTLLM-8683][chore] Migrate PluginConfig to Pydantic by @anish-shanbhag in #8277
[None][feat] Enable CUDA graph support for KvConnectorWorker API by @nv-kmcgill53 in #8275
[None][fix] Fix get_num_tokens_per_image for nano-v2-vlm by @Wanli-Jiang in #8425
[TRTLLM-8480][chore] clean create_py_executor API by @QiJune in #8412
[None][feat] AutoDeploy: chunked prefill support by @lucaslie in #8158
[None][chore] Waive failing transceiver test by @brb-nv in #8473
[None][fix] Fix KV event consumption by @jthomson04 in #6346
[None][infra] Waive test for main branch on 10/18 by @EmmaQiaoCh in #8472
[TRTLLM-7964][infra] Set nixl to default cache transceiver backend by @bo-nv in #7926
[None][infra] Skip a failed case in pre-merge for main on 10/19 by @EmmaQiaoCh in #8479

New Contributors

@joyang-nv made their first contribution in #7520
@longlee0622 made their first contribution in #8080
@shuyixiong made their first contribution in #8175
@yumin066 made their first contribution in #7286
@chinamaoge made their first contribution in #8401
@wjueyao made their first contribution in #7637
@h-guo18 made their first contribution in #8039
@anish-shanbhag made their first contribution in #8277
@nv-kmcgill53 made their first contribution in #8275

Full Changelog: v1.2.0rc0.post1...v1.2.0rc1