github NVIDIA/TensorRT-LLM v1.2.0rc1

pre-release14 hours ago

Announcement Highlights

  • Model Support

    • Add GPT-OSS Sm120/Sm121 support (#7937)
    • Fix: Disable DeepGEMM for Qwen3 MoE Attention layers (#8087)
    • Fix: Update is_post_quant_all2all_supported for MNNVL (#8355)
    • Support quantized model for nano-v2-vlm (#8304)
    • Fix: Address illegal access when scale is not provided in Llama3/4 (#7960)
    • Fix: Correct Qwen2.5-VL device_path error (#8057)
    • Add post-merge test for Seed-OSS-36B-Instruct (#8321)
    • Fix: Correct get_num_tokens_per_image for nano-v2-vlm (#8425)
    • Add Kimi multi-nodes case (#8025)
  • API

    • Refine sampling strategy selection (BREAKING CHANGE) (#8132)
    • Add cache_salt in LLM.generate (#8317)
    • Add input tensor pre-hook function API for tuning (#6924)
    • Add additional model outputs (#7206)
    • Clean create_py_executor API (#8412)
  • Benchmark

    • Add request timing breakdown option in benchmark_serving (#8128)
    • Fix bench_serving import error (#8296)
    • Update disagg benchmark configs (#8289)
    • Add multimodal data to dummy requests during memory profiling (#7539)
    • Save runtime report periodically (#8312)
    • Resolve sampling defaults in OpenAI API backend (#8121)
  • Feature

    • Add new orchestrator type: Ray (#7520)
    • Implement HTTP disagg-cluster management (#7869)
    • Add PDL support for more kernels (#7977)
    • Enable rejection sampling for CDL (#7731)
    • Add torch.compile support for CUDA Core GEMM op (#8261)
    • Support block-sparse attention in trtllm gen FMHA kernels (#8301)
    • Support SWA KV cache reuse OOW block detach (#7922)
    • Add factory TP sharding of quantized models (#8123)
    • Turn off speculative decode based on acceptance-length threshold (#7283)
    • Enable VLM subgraphs and CUDA graph/compile in AutoDeploy (#8203)
    • Add sparse attention framework and RocketKV support (#8086)
    • Implement etcd storage for disagg cluster (#8210)
    • Export scale factor properly for W4A8/NVFP4/FP8 (#8180)
    • Reuse CUDA graph memory pool in normal forward flow (#8095)
    • Revise TileN-related routing calculation in MoE backend (#8148)
    • Develop DeepConf (#8362)
    • Support per-expert pre-quant scale factor for W4A8 AWQ MoE (PyTorch) (#7286)
    • Support cached tokens for OpenAI server (#7637)
    • Add fmha_v2 kernel for head_dim=80 and SM100 to support VLM (#8392)
    • Add topological graph helpers (#8457)
    • Enable CUDA graph support for KvConnectorWorker API (#8275)
    • Add chunked prefill support in AutoDeploy (#8158)
    • Set nixl as default cache transceiver backend (#7926)
    • Enable FP8 ContextMLA on GB300 (#8080)
    • Skip unnecessary CUDA graph capture (#8050)
    • Use device tensor index for MTP (#8062)
  • Documentation

    • Publish blog: Scaling Expert Parallelism in TensorRT LLM (Part 3) (#8323)
    • Refine deployment guide by renaming TRT-LLM to TensorRT LLM (#8214)
    • Document the role of d2t (#8174)
    • Add Qwen3-next doc and L0 test case (#8288)
    • Update AutoDeploy README: expert section on YAML configuration (#8370)
    • Update TPOT/ITL docs (#8378)
    • Add Ray orchestrator initial doc (#8373)
    • Add documentation for CUDA 12.9 (#8411)
    • Combine feature combination matrix documents (#8442)
    • Add ATTRIBUTIONS-{CPP,Python}.md and update wheels setup (#8438)

What's Changed

  • [None][feat] AutoDeploy: Nemotron-H accuracy test by @lucaslie in #8133
  • [None][feat] AutoDeploy: graph/module inputs with kwargs instead of args by @lucaslie in #8137
  • [TRTLLM-7349][feat] Adding new orchestrator type -- ray by @joyang-nv in #7520
  • [None][autodeploy] small refactors on attention matching by @Fridah-nv in #8079
  • [#5255][autodeploy] Update FuseAllreduceResidualRMSNorm to use pattern matcher utility; remove fuse_collective by @Fridah-nv in #7545
  • [TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) by @Superjomn in #5543
  • [https://nvbugs/5521949][fix] Re-enable test_bielik_11b_v2_2_instruct_multi_lora, fix its API use with pytorch flow LoRA by @amitz-nv in #8146
  • [None][fix] Adding docker folder to Dockerfile by @pcastonguay in #8138
  • [None][chore] fix llmargs conflict by @Superjomn in #8152
  • [TRTLLM-8413][chore] resolve sampling defaults in OpenAI API backend by @ixlmar in #8121
  • [None][chore] AutoDeploy: clean up accuracy test configs by @lucaslie in #8134
  • [None][fix] Eagle: Attention DP by @IzzyPutterman in #7939
  • [None][feat] GPT-OSS Sm120/Sm121 Support by @farazkh80 in #7937
  • [None][chore] Increase operations-per-run to 1000 for stale action by @karljang in #8162
  • [None] [test] Add B300 cases to CI by @VALLIS-NERIA in #8056
  • [None][infra] Skip failed cases for main by @EmmaQiaoCh in #8176
  • [None][fix] Fix MTP illegal memory access by @mikeiovine in #8161
  • [https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend by @sklevtsov-nvidia in #8141
  • [None][test] add test-model-suites option in integration conftest.py by @ruodil in #8016
  • [https://nvbugs/5455140][fix] unwaive tests related to GB200 OOM by @lancelly in #8159
  • [https://nvbugs/5550283][fix] update test case to the latest MoE API by @xxi-nv in #8165
  • [TRTLLM-8414][chore] BREAKING CHANGE: refine sampling strategy selection by @ixlmar in #8132
  • [None][chore] Waive some tests failing on main post merge by @brb-nv in #8186
  • [https://nvbugs/5541545][fix] Remove test_llama4 by @mikeiovine in #8031
  • [https://nvbugs/5522746][fix] unwaive tests caused by node issues after rebooting by @lancelly in #8193
  • [None][fix] Restrict tinygemm use to certain SMs by @dongfengy in #8182
  • [None][ci] move some llama4 test cases to pre merge by @QiJune in #8189
  • [TRTLLM-7846][feat] Http disagg-cluster management implemention by @reasonsolo in #7869
  • [https://nvbugs/5516666][fix] unwaive some Qwen3 CI tests by @byshiue in #8130
  • [None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… by @nv-guomingz in #8214
  • [None][ci] pin flashinfer-python version by @QiJune in #8217
  • [None][chore] Restore asserts in pytorch flow LoRA tests by @amitz-nv in #8227
  • [None][infra] Waive failed tests on main 10/09 by @EmmaQiaoCh in #8230
  • [TRTLLM-7769][chore] document the role of 'd2t' by @ixlmar in #8174
  • [https://nvbugs/5501820][fix] Add requirements for numba-cuda version to WAR mem corruption by @pengbowang-nv in #7992
  • [None][fix] Enable FP8 ContextMLA on GB300 by @longlee0622 in #8080
  • [None][chore] Remove closed bugs by @xinhe-nv in #8151
  • [None][chore] Print log with time for starting to load safetensor weights by @HuiGao-NV in #8218
  • [None][fix] Add failed cases into waives.txt by @xinhe-nv in #8229
  • [https://nvbugs/5547416][fix] unwaive no_cache test by @byshiue in #8213
  • [None][fix] add gc for test fixture by @xinhe-nv in #8220
  • [https://nvbugs/5558167][fix] update canceled_req_ids correctly for canceled requests by @QiJune in #8207
  • [None][fix] Add Lock to protect mReqeustToSession by @chuangz0 in #8085
  • [None][feat] Add request timing breakdown option in benchmark_serving by @nv-yilinf in #8128
  • [TRTLLM-6748][feat] add PDL support for more kernels by @dc3671 in #7977
  • [https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture by @ziyixiong-nv in #8050
  • [None][chore] Waive failing pre-merge test on main by @brb-nv in #8282
  • [None][infra] Remove WAR code for GH200 node by @ZhanruiSunCh in #8266
  • [TRTLLM-7384][feat] enable rejection sampling for CDL by @kris1025 in #7731
  • [None][infra] Skip failed cases for main branch by @EmmaQiaoCh in #8293
  • [None][fix] AD test_trtllm_bench to use small model config and skip loading weights by @MrGeva in #8149
  • [https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 by @amitz-nv in #8063
  • [None][doc] Fix several invalid ref links in deployment guide sections. by @nv-guomingz in #8287
  • [None][doc] Add qwen3-next doc into deployment guide and test case into L0. by @nv-guomingz in #8288
  • [None][feat] Add torch compile support for cuda core GEMM OP by @DylanChen-NV in #8261
  • [None][fix] add timeout for llama4 by @xinhe-nv in #8254
  • [https://nvbugs/5503138] [fix] Remove compile warnings by @VALLIS-NERIA in #8167
  • [None][fix] Fix bench_serving import error by @nv-yilinf in #8296
  • [TRTLLM-8477][chore] Replace KvCacheConfigCpp with KvCacheConfig inside PyExecutor by @leslie-fang25 in #8259
  • [None][infra] Update comments for pre-merge GB200 multi-node testing stage by @EmmaQiaoCh in #8281
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8290
  • [https://nvbugs/5441729][test] Fix test_modeling_llama_min_latency.py failures by @nvpohanh in #7478
  • [None][fix] Fix EventLoopShutdownError by @dcaox in #8260
  • [None][chore] Update disagg benchmark configs by @qiaoxj07 in #8289
  • [TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention by @lfr-0531 in #8301
  • [https://nvbugs/5521949][fix] Replace test_codellama_fp8_with_bf16_lora with test_llama_3_1_8b_fp8_with_bf16_lora by @amitz-nv in #8199
  • [TRTLLM-4517] [feat] Additional model outputs by @Funatiq in #7206
  • [None] [blog] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary) by @kaiyux in #8323
  • [None] [doc] Update README by @kaiyux in #8326
  • [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach by @eopXD in #7922
  • [None][fix] workaround for numexpr issue by @ixlmar in #8327
  • [None][fix] Avoid unnecessary concat in attn_output_gate case. by @yuxianq in #8094
  • [TRTLLM-6342][feat] Factory TP sharding of quantized models by @greg-kwasniewski1 in #8123
  • [TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. by @zheyuf in #7283
  • [None][feat] AutoDeploy: VLMs with subgraphs + cudagraph/compile by @lucaslie in #8203
  • [None][fix] Disable DeepGEMM for Qwen3 MoE Attention layers by @achartier in #8087
  • [None][fix] Fix dummy load format for key models. by @yuxianq in #7993
  • [None][infra] Pin numexpr in requirements.txt by @yuanjingx87 in #8343
  • [TRTLLM-8366][feat] add kimi multi nodes case by @xinhe-nv in #8025
  • [https://nvbugs/5542878][fix] Unwaive test by @2ez4bz in #8027
  • [None][feat] Move StreamGeneration to scaffolding main directory by @dcaox in #8347
  • [None][ci] waive several rpc tests by @Superjomn in #8349
  • [None][fix] Add lock for request_to_session in sendReadySingal by @chuangz0 in #8310
  • [https://nvbugs/5404000][fix] Ensure consistency between firstTokenTime and lastTokenTime by @achartier in #8294
  • [TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support by @lfr-0531 in #8086
  • [TRTLLM-8507][fix] Fix ray resource cleanup and error handling in LoRA test by @shuyixiong in #8175
  • [https://nvbugs/5563469][fix] Temporarily disable test_nemotron_nano_8b_lora_torch in L0 due to Torch non-determinism by @moraxu in #8206
  • [None][chore] AutoDeplopy: Update expert section on yaml configuration in README by @lucaslie in #8370
  • [None][fix] Fix is_post_quant_all2all_supported for MNNVL by @yuantailing in #8355
  • [TRTLLM-7846][feat] implement etcd storage for disagg cluster by @reasonsolo in #8210
  • [None][fix] Remove outdated test waives for GPTOSS by @dongfengy in #8183
  • [TRTLLM-7351][infra] Add isolate marker for L0 by @EmmaQiaoCh in #7497
  • [https://nvbugs/5547435][fix] Fix a merge conflict by @liji-nv in #8365
  • [None] [docs] Update TPOT/ITL docs by @kaiyux in #8378
  • [None][doc] Ray orchestrator initial doc by @hchings in #8373
  • [OMNIML-2336][feat] w4a8 nvfp4 fp8 exports scale factor properly by @sychen52 in #8180
  • [None][chore] set the default value of max_num_tokens explicitly by @QiJune in #8208
  • [None][chore] update torch_dtype -> dtype in 'transformers' by @ixlmar in #8263
  • [None][ci] move all llama4 test cases to post merge by @QiJune in #8387
  • [TRTLLM-8551][feat] add cache_salt in LLM.generate and refactor test_return_logits.py by @ixlmar in #8317
  • [TRTLLM-4501][feat] Add input tensor pre-hook function API for the tuning process. by @hyukn in #6924
  • [None] [chore] Add OSS compliance to CODEOWNERS by @venkywonka in #8375
  • [TRTLLM-8532][chore] clean warmup method of ModelEngine by @QiJune in #8264
  • [None][fix] Refactor triton paddings by @dongfengy in #6980
  • [None][feat] reuse cudagraph memory pool in normal forward flow by @HuiGao-NV in #8095
  • [None][fix] Fix cache buffer size for window by @chuangz0 in #8320
  • [https://nvbugs/5560921][fix] GenerationExecutor RPC by @Superjomn in #8209
  • [None][feat] Revise the calculation related to TileN in routing of MOE TRTLLM backend by @ChristinaZ in #8148
  • [TRTLLM-8579][feat] Support quantized model for nano-v2-vlm by @Wanli-Jiang in #8304
  • [https://nvbugs/5541494] [fix] Remove waivers by @VALLIS-NERIA in #8353
  • [None][feat] Dev DeepConf by @dcaox in #8362
  • [https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend by @yumin066 in #7286
  • [None][fix] Fix the error where checkpoint_dir is assigned as NONE wh… by @chinamaoge in #8401
  • [https://nvbugs/5583261][ci] waive test_fetch_responses_streaming_sync by @Superjomn in #8407
  • [None][chore] Isolate several intermittent cases by @HuiGao-NV in #8408
  • [https://nvbugs/5532789] [doc] Add documents about CUDA 12.9 by @VALLIS-NERIA in #8411
  • [TRTLLM-8638][fix] waive llam4 tests on H20 by @xinhe-nv in #8416
  • [None][feat] Support cached tokens for Openai server by @wjueyao in #7637
  • [https://nvbugs/5461761][fix] Unwaive eagle3 test by @sunnyqgg in #8363
  • [None][chore] Mass integration of release/1.1 by @mikeiovine in #8200
  • [TRTLLM-6780][fix] Add multimodal data to dummy requests during memory profiling by @johncalesp in #7539
  • [None][bug] Set NCCL_GRAPH_REGISTER to false to avoid hang by @Tabrizian in #8413
  • [None][infra] Fix for generate lockfile pipeline by @yuanjingx87 in #7820
  • [None][infra] Update CI allowed list 2025_10_15 by @yuanjingx87 in #8403
  • [https://nvbugs/5540138][fix] Fix shape error when duplicating kv. by @Tracin in #8390
  • [TRTLLM-8580][test] save runtime report periodically by @crazydemo in #8312
  • [None][test] Filter out all fp8 test case for A100. by @yufeiwu-nv in #8420
  • [None][chore] Combine two documents of feature combination matrix by @leslie-fang25 in #8442
  • [None][chore] Update commit msg for adding lock files by @chzblych in #8448
  • [None][test] Add post merge test for Seed-OSS-36B-Instruct by @zhhuang-nv in #8321
  • [None][fix] trtllm-gen regression in PR 8301 by @PerkzZheng in #8426
  • [TRTLLM-8638][fix] add waives tests by @xinhe-nv in #8445
  • [None][feat] Add fmha_v2 kernel for head_dim=80 and sm=100 to support VLM by @Wanli-Jiang in #8392
  • [None] [chore] Add ATTRIBUTIONS-{CPP,Python}.md + Update in wheels setup by @venkywonka in #8438
  • [TRTLLM-8201][feat] Topological graph helpers by @greg-kwasniewski1 in #8457
  • [None][chore] AutoDeploy: cleanup old inference optimizer configs by @h-guo18 in #8039
  • [TRTLLM-8683][chore] Migrate PluginConfig to Pydantic by @anish-shanbhag in #8277
  • [None][feat] Enable CUDA graph support for KvConnectorWorker API by @nv-kmcgill53 in #8275
  • [None][fix] Fix get_num_tokens_per_image for nano-v2-vlm by @Wanli-Jiang in #8425
  • [TRTLLM-8480][chore] clean create_py_executor API by @QiJune in #8412
  • [None][feat] AutoDeploy: chunked prefill support by @lucaslie in #8158
  • [None][chore] Waive failing transceiver test by @brb-nv in #8473
  • [None][fix] Fix KV event consumption by @jthomson04 in #6346
  • [None][infra] Waive test for main branch on 10/18 by @EmmaQiaoCh in #8472
  • [TRTLLM-7964][infra] Set nixl to default cache transceiver backend by @bo-nv in #7926
  • [None][infra] Skip a failed case in pre-merge for main on 10/19 by @EmmaQiaoCh in #8479

New Contributors

  • @joyang-nv made their first contribution in #7520
  • @longlee0622 made their first contribution in #8080
  • @shuyixiong made their first contribution in #8175
  • @yumin066 made their first contribution in #7286
  • @chinamaoge made their first contribution in #8401
  • @wjueyao made their first contribution in #7637
  • @h-guo18 made their first contribution in #8039
  • @anish-shanbhag made their first contribution in #8277
  • @nv-kmcgill53 made their first contribution in #8275

Full Changelog: v1.2.0rc0.post1...v1.2.0rc1

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.