Announcement Highlights
-
Model Support
- Add GPT-OSS Sm120/Sm121 support (#7937)
- Fix: Disable DeepGEMM for Qwen3 MoE Attention layers (#8087)
- Fix: Update is_post_quant_all2all_supported for MNNVL (#8355)
- Support quantized model for nano-v2-vlm (#8304)
- Fix: Address illegal access when scale is not provided in Llama3/4 (#7960)
- Fix: Correct Qwen2.5-VL device_path error (#8057)
- Add post-merge test for Seed-OSS-36B-Instruct (#8321)
- Fix: Correct get_num_tokens_per_image for nano-v2-vlm (#8425)
- Add Kimi multi-nodes case (#8025)
-
API
-
Benchmark
- Add request timing breakdown option in benchmark_serving (#8128)
- Fix bench_serving import error (#8296)
- Update disagg benchmark configs (#8289)
- Add multimodal data to dummy requests during memory profiling (#7539)
- Save runtime report periodically (#8312)
- Resolve sampling defaults in OpenAI API backend (#8121)
-
Feature
- Add new orchestrator type: Ray (#7520)
- Implement HTTP disagg-cluster management (#7869)
- Add PDL support for more kernels (#7977)
- Enable rejection sampling for CDL (#7731)
- Add torch.compile support for CUDA Core GEMM op (#8261)
- Support block-sparse attention in trtllm gen FMHA kernels (#8301)
- Support SWA KV cache reuse OOW block detach (#7922)
- Add factory TP sharding of quantized models (#8123)
- Turn off speculative decode based on acceptance-length threshold (#7283)
- Enable VLM subgraphs and CUDA graph/compile in AutoDeploy (#8203)
- Add sparse attention framework and RocketKV support (#8086)
- Implement etcd storage for disagg cluster (#8210)
- Export scale factor properly for W4A8/NVFP4/FP8 (#8180)
- Reuse CUDA graph memory pool in normal forward flow (#8095)
- Revise TileN-related routing calculation in MoE backend (#8148)
- Develop DeepConf (#8362)
- Support per-expert pre-quant scale factor for W4A8 AWQ MoE (PyTorch) (#7286)
- Support cached tokens for OpenAI server (#7637)
- Add fmha_v2 kernel for head_dim=80 and SM100 to support VLM (#8392)
- Add topological graph helpers (#8457)
- Enable CUDA graph support for KvConnectorWorker API (#8275)
- Add chunked prefill support in AutoDeploy (#8158)
- Set nixl as default cache transceiver backend (#7926)
- Enable FP8 ContextMLA on GB300 (#8080)
- Skip unnecessary CUDA graph capture (#8050)
- Use device tensor index for MTP (#8062)
-
Documentation
- Publish blog: Scaling Expert Parallelism in TensorRT LLM (Part 3) (#8323)
- Refine deployment guide by renaming TRT-LLM to TensorRT LLM (#8214)
- Document the role of d2t (#8174)
- Add Qwen3-next doc and L0 test case (#8288)
- Update AutoDeploy README: expert section on YAML configuration (#8370)
- Update TPOT/ITL docs (#8378)
- Add Ray orchestrator initial doc (#8373)
- Add documentation for CUDA 12.9 (#8411)
- Combine feature combination matrix documents (#8442)
- Add ATTRIBUTIONS-{CPP,Python}.md and update wheels setup (#8438)
What's Changed
- [None][feat] AutoDeploy: Nemotron-H accuracy test by @lucaslie in #8133
- [None][feat] AutoDeploy: graph/module inputs with kwargs instead of args by @lucaslie in #8137
- [TRTLLM-7349][feat] Adding new orchestrator type -- ray by @joyang-nv in #7520
- [None][autodeploy] small refactors on attention matching by @Fridah-nv in #8079
- [#5255][autodeploy] Update FuseAllreduceResidualRMSNorm to use pattern matcher utility; remove fuse_collective by @Fridah-nv in #7545
- [TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) by @Superjomn in #5543
- [https://nvbugs/5521949][fix] Re-enable test_bielik_11b_v2_2_instruct_multi_lora, fix its API use with pytorch flow LoRA by @amitz-nv in #8146
- [None][fix] Adding docker folder to Dockerfile by @pcastonguay in #8138
- [None][chore] fix llmargs conflict by @Superjomn in #8152
- [TRTLLM-8413][chore] resolve sampling defaults in OpenAI API backend by @ixlmar in #8121
- [None][chore] AutoDeploy: clean up accuracy test configs by @lucaslie in #8134
- [None][fix] Eagle: Attention DP by @IzzyPutterman in #7939
- [None][feat] GPT-OSS Sm120/Sm121 Support by @farazkh80 in #7937
- [None][chore] Increase operations-per-run to 1000 for stale action by @karljang in #8162
- [None] [test] Add B300 cases to CI by @VALLIS-NERIA in #8056
- [None][infra] Skip failed cases for main by @EmmaQiaoCh in #8176
- [None][fix] Fix MTP illegal memory access by @mikeiovine in #8161
- [https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend by @sklevtsov-nvidia in #8141
- [None][test] add test-model-suites option in integration conftest.py by @ruodil in #8016
- [https://nvbugs/5455140][fix] unwaive tests related to GB200 OOM by @lancelly in #8159
- [https://nvbugs/5550283][fix] update test case to the latest MoE API by @xxi-nv in #8165
- [TRTLLM-8414][chore] BREAKING CHANGE: refine sampling strategy selection by @ixlmar in #8132
- [None][chore] Waive some tests failing on main post merge by @brb-nv in #8186
- [https://nvbugs/5541545][fix] Remove test_llama4 by @mikeiovine in #8031
- [https://nvbugs/5522746][fix] unwaive tests caused by node issues after rebooting by @lancelly in #8193
- [None][fix] Restrict tinygemm use to certain SMs by @dongfengy in #8182
- [None][ci] move some llama4 test cases to pre merge by @QiJune in #8189
- [TRTLLM-7846][feat] Http disagg-cluster management implemention by @reasonsolo in #7869
- [https://nvbugs/5516666][fix] unwaive some Qwen3 CI tests by @byshiue in #8130
- [None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… by @nv-guomingz in #8214
- [None][ci] pin flashinfer-python version by @QiJune in #8217
- [None][chore] Restore asserts in pytorch flow LoRA tests by @amitz-nv in #8227
- [None][infra] Waive failed tests on main 10/09 by @EmmaQiaoCh in #8230
- [TRTLLM-7769][chore] document the role of 'd2t' by @ixlmar in #8174
- [https://nvbugs/5501820][fix] Add requirements for numba-cuda version to WAR mem corruption by @pengbowang-nv in #7992
- [None][fix] Enable FP8 ContextMLA on GB300 by @longlee0622 in #8080
- [None][chore] Remove closed bugs by @xinhe-nv in #8151
- [None][chore] Print log with time for starting to load safetensor weights by @HuiGao-NV in #8218
- [None][fix] Add failed cases into waives.txt by @xinhe-nv in #8229
- [https://nvbugs/5547416][fix] unwaive no_cache test by @byshiue in #8213
- [None][fix] add gc for test fixture by @xinhe-nv in #8220
- [https://nvbugs/5558167][fix] update canceled_req_ids correctly for canceled requests by @QiJune in #8207
- [None][fix] Add Lock to protect mReqeustToSession by @chuangz0 in #8085
- [None][feat] Add request timing breakdown option in benchmark_serving by @nv-yilinf in #8128
- [TRTLLM-6748][feat] add PDL support for more kernels by @dc3671 in #7977
- [https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture by @ziyixiong-nv in #8050
- [None][chore] Waive failing pre-merge test on main by @brb-nv in #8282
- [None][infra] Remove WAR code for GH200 node by @ZhanruiSunCh in #8266
- [TRTLLM-7384][feat] enable rejection sampling for CDL by @kris1025 in #7731
- [None][infra] Skip failed cases for main branch by @EmmaQiaoCh in #8293
- [None][fix] AD test_trtllm_bench to use small model config and skip loading weights by @MrGeva in #8149
- [https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 by @amitz-nv in #8063
- [None][doc] Fix several invalid ref links in deployment guide sections. by @nv-guomingz in #8287
- [None][doc] Add qwen3-next doc into deployment guide and test case into L0. by @nv-guomingz in #8288
- [None][feat] Add torch compile support for cuda core GEMM OP by @DylanChen-NV in #8261
- [None][fix] add timeout for llama4 by @xinhe-nv in #8254
- [https://nvbugs/5503138] [fix] Remove compile warnings by @VALLIS-NERIA in #8167
- [None][fix] Fix bench_serving import error by @nv-yilinf in #8296
- [TRTLLM-8477][chore] Replace KvCacheConfigCpp with KvCacheConfig inside PyExecutor by @leslie-fang25 in #8259
- [None][infra] Update comments for pre-merge GB200 multi-node testing stage by @EmmaQiaoCh in #8281
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8290
- [https://nvbugs/5441729][test] Fix test_modeling_llama_min_latency.py failures by @nvpohanh in #7478
- [None][fix] Fix EventLoopShutdownError by @dcaox in #8260
- [None][chore] Update disagg benchmark configs by @qiaoxj07 in #8289
- [TRTLLM-8536][feat] Update trtllm gen fmha kernels to support block sparse attention by @lfr-0531 in #8301
- [https://nvbugs/5521949][fix] Replace test_codellama_fp8_with_bf16_lora with test_llama_3_1_8b_fp8_with_bf16_lora by @amitz-nv in #8199
- [TRTLLM-4517] [feat] Additional model outputs by @Funatiq in #7206
- [None] [blog] Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary) by @kaiyux in #8323
- [None] [doc] Update README by @kaiyux in #8326
- [TLLM-6777][feature] Support SWA KV cache reuse OOW block detach by @eopXD in #7922
- [None][fix] workaround for numexpr issue by @ixlmar in #8327
- [None][fix] Avoid unnecessary concat in attn_output_gate case. by @yuxianq in #8094
- [TRTLLM-6342][feat] Factory TP sharding of quantized models by @greg-kwasniewski1 in #8123
- [TRTLLM-7412][feat] Turn off spec decode when the rolling average acceptance length drops below threshold. by @zheyuf in #7283
- [None][feat] AutoDeploy: VLMs with subgraphs + cudagraph/compile by @lucaslie in #8203
- [None][fix] Disable DeepGEMM for Qwen3 MoE Attention layers by @achartier in #8087
- [None][fix] Fix dummy load format for key models. by @yuxianq in #7993
- [None][infra] Pin numexpr in requirements.txt by @yuanjingx87 in #8343
- [TRTLLM-8366][feat] add kimi multi nodes case by @xinhe-nv in #8025
- [https://nvbugs/5542878][fix] Unwaive test by @2ez4bz in #8027
- [None][feat] Move StreamGeneration to scaffolding main directory by @dcaox in #8347
- [None][ci] waive several rpc tests by @Superjomn in #8349
- [None][fix] Add lock for request_to_session in sendReadySingal by @chuangz0 in #8310
- [https://nvbugs/5404000][fix] Ensure consistency between firstTokenTime and lastTokenTime by @achartier in #8294
- [TRTLLM-8536][feat] Add the sparse attention framework and one use case--RocketKV support by @lfr-0531 in #8086
- [TRTLLM-8507][fix] Fix ray resource cleanup and error handling in LoRA test by @shuyixiong in #8175
- [https://nvbugs/5563469][fix] Temporarily disable test_nemotron_nano_8b_lora_torch in L0 due to Torch non-determinism by @moraxu in #8206
- [None][chore] AutoDeplopy: Update expert section on yaml configuration in README by @lucaslie in #8370
- [None][fix] Fix is_post_quant_all2all_supported for MNNVL by @yuantailing in #8355
- [TRTLLM-7846][feat] implement etcd storage for disagg cluster by @reasonsolo in #8210
- [None][fix] Remove outdated test waives for GPTOSS by @dongfengy in #8183
- [TRTLLM-7351][infra] Add isolate marker for L0 by @EmmaQiaoCh in #7497
- [https://nvbugs/5547435][fix] Fix a merge conflict by @liji-nv in #8365
- [None] [docs] Update TPOT/ITL docs by @kaiyux in #8378
- [None][doc] Ray orchestrator initial doc by @hchings in #8373
- [OMNIML-2336][feat] w4a8 nvfp4 fp8 exports scale factor properly by @sychen52 in #8180
- [None][chore] set the default value of max_num_tokens explicitly by @QiJune in #8208
- [None][chore] update torch_dtype -> dtype in 'transformers' by @ixlmar in #8263
- [None][ci] move all llama4 test cases to post merge by @QiJune in #8387
- [TRTLLM-8551][feat] add cache_salt in LLM.generate and refactor test_return_logits.py by @ixlmar in #8317
- [TRTLLM-4501][feat] Add input tensor pre-hook function API for the tuning process. by @hyukn in #6924
- [None] [chore] Add OSS compliance to CODEOWNERS by @venkywonka in #8375
- [TRTLLM-8532][chore] clean warmup method of ModelEngine by @QiJune in #8264
- [None][fix] Refactor triton paddings by @dongfengy in #6980
- [None][feat] reuse cudagraph memory pool in normal forward flow by @HuiGao-NV in #8095
- [None][fix] Fix cache buffer size for window by @chuangz0 in #8320
- [https://nvbugs/5560921][fix] GenerationExecutor RPC by @Superjomn in #8209
- [None][feat] Revise the calculation related to TileN in routing of MOE TRTLLM backend by @ChristinaZ in #8148
- [TRTLLM-8579][feat] Support quantized model for nano-v2-vlm by @Wanli-Jiang in #8304
- [https://nvbugs/5541494] [fix] Remove waivers by @VALLIS-NERIA in #8353
- [None][feat] Dev DeepConf by @dcaox in #8362
- [https://nvbugs/5378031] [feat] W4A8 AWQ MoE supports Per Expert Pre-quant Scale Factor for PyT backend by @yumin066 in #7286
- [None][fix] Fix the error where checkpoint_dir is assigned as NONE wh… by @chinamaoge in #8401
- [https://nvbugs/5583261][ci] waive test_fetch_responses_streaming_sync by @Superjomn in #8407
- [None][chore] Isolate several intermittent cases by @HuiGao-NV in #8408
- [https://nvbugs/5532789] [doc] Add documents about CUDA 12.9 by @VALLIS-NERIA in #8411
- [TRTLLM-8638][fix] waive llam4 tests on H20 by @xinhe-nv in #8416
- [None][feat] Support cached tokens for Openai server by @wjueyao in #7637
- [https://nvbugs/5461761][fix] Unwaive eagle3 test by @sunnyqgg in #8363
- [None][chore] Mass integration of release/1.1 by @mikeiovine in #8200
- [TRTLLM-6780][fix] Add multimodal data to dummy requests during memory profiling by @johncalesp in #7539
- [None][bug] Set NCCL_GRAPH_REGISTER to false to avoid hang by @Tabrizian in #8413
- [None][infra] Fix for generate lockfile pipeline by @yuanjingx87 in #7820
- [None][infra] Update CI allowed list 2025_10_15 by @yuanjingx87 in #8403
- [https://nvbugs/5540138][fix] Fix shape error when duplicating kv. by @Tracin in #8390
- [TRTLLM-8580][test] save runtime report periodically by @crazydemo in #8312
- [None][test] Filter out all fp8 test case for A100. by @yufeiwu-nv in #8420
- [None][chore] Combine two documents of feature combination matrix by @leslie-fang25 in #8442
- [None][chore] Update commit msg for adding lock files by @chzblych in #8448
- [None][test] Add post merge test for Seed-OSS-36B-Instruct by @zhhuang-nv in #8321
- [None][fix] trtllm-gen regression in PR 8301 by @PerkzZheng in #8426
- [TRTLLM-8638][fix] add waives tests by @xinhe-nv in #8445
- [None][feat] Add fmha_v2 kernel for head_dim=80 and sm=100 to support VLM by @Wanli-Jiang in #8392
- [None] [chore] Add ATTRIBUTIONS-{CPP,Python}.md + Update in wheels setup by @venkywonka in #8438
- [TRTLLM-8201][feat] Topological graph helpers by @greg-kwasniewski1 in #8457
- [None][chore] AutoDeploy: cleanup old inference optimizer configs by @h-guo18 in #8039
- [TRTLLM-8683][chore] Migrate PluginConfig to Pydantic by @anish-shanbhag in #8277
- [None][feat] Enable CUDA graph support for KvConnectorWorker API by @nv-kmcgill53 in #8275
- [None][fix] Fix get_num_tokens_per_image for nano-v2-vlm by @Wanli-Jiang in #8425
- [TRTLLM-8480][chore] clean create_py_executor API by @QiJune in #8412
- [None][feat] AutoDeploy: chunked prefill support by @lucaslie in #8158
- [None][chore] Waive failing transceiver test by @brb-nv in #8473
- [None][fix] Fix KV event consumption by @jthomson04 in #6346
- [None][infra] Waive test for main branch on 10/18 by @EmmaQiaoCh in #8472
- [TRTLLM-7964][infra] Set nixl to default cache transceiver backend by @bo-nv in #7926
- [None][infra] Skip a failed case in pre-merge for main on 10/19 by @EmmaQiaoCh in #8479
New Contributors
- @joyang-nv made their first contribution in #7520
- @longlee0622 made their first contribution in #8080
- @shuyixiong made their first contribution in #8175
- @yumin066 made their first contribution in #7286
- @chinamaoge made their first contribution in #8401
- @wjueyao made their first contribution in #7637
- @h-guo18 made their first contribution in #8039
- @anish-shanbhag made their first contribution in #8277
- @nv-kmcgill53 made their first contribution in #8275
Full Changelog: v1.2.0rc0.post1...v1.2.0rc1