NVIDIA/TensorRT-LLM v1.1.0rc4 on GitHub

Announcement Highlights:

Model Support
- Support phi-4 model in pytorch backend (#7371)
- Support Aggregate mode for phi4-mm (#7521)
API
- Implement basic functionalities for Responses API (#7341)
- Support multiple postprocess workers for chat completions API (#7508)
- Report failing requests (#7060)
Benchmark
- Test trtllm-serve with --extra_llm_api_options (#7492)
Feature
- Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
- Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
- Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
- Separate run_shape_prop as another graph utility (#7313)
- MultiLayer Eagle (#7234)
- Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
- Add NVFP4 x FP8 (#6809)
- Support hashing and KV cache reuse for videos (#7360)
- Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
- Introduce QKNormRoPEAttention module (#6830)
- AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
- Support KV cache salting for secure KV cache reuse (#7106)
- trtllm-gen kernels support sm103 (#7570)
- Move stop_criteria to sample_async (#7041)
- KV cache transfer for uneven pp (#7117)
- Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
- AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
- Add Request specific exception (#6931)
- Add DeepSeek-v3-0324 e2e torch test (#7413)
- Add 8-GPU test cases for RTX6000 (#7083)
- add gptoss 20g tests (#7361)
- Nixl support for GDS (#5488)
- CMake option to link statically with cublas/curand (#7178)
- Extend VLM factory and add Mistral3 factory (#7583)
Documentation
- fix example in docstring (#7410)
- Fix formatting error in Gemma3 readme (#7352)
- Add note about trtllm-serve to the devel container (#7483)
- add GPT OSS Eagle3 blog (#7140)
- 1.0 Documentation. (#6696)
- Update kvcache part (#7549)
- Rename TensorRT-LLM to TensorRT LLM. (#7554)
- refine docs for accuracy evaluation of gpt-oss models (#7252)

What's Changed

[https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
[None][infra] Using local variables in rerun function by @yiqingy0 in #7198
[None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
[https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
[None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
[None][doc] fix example in docstring by @tomeras91 in #7410
[TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
[None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
[TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
[https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
[None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
[None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
[https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
[https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
[None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
[None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
[None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
[None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
[TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
[https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
[https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
[https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
[TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
[https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
[None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
[None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
[TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
[None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
[#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
[None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
[https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
[TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
[https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
[TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
[TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
[None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
[TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
[None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
[None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
[https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
[TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
[https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
[TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
[None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
[None][test] update nim and full test list by @crazydemo in #7468
[None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
[TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
[OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
[https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
[TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
[https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
[None][feat] Add Request specific exception by @Shunkangz in #6931
[#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
[https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
[None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
[None][chore] Remove closed bugs by @xinhe-nv in #7408
[TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
[None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
[None][infra] update nspect version by @niukuo in #7552
[https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in #7476
[#6186][feat] Introduce QKNormRoPEAttention module by @Funatiq in #6830
[None][chore] Remove executor_config in create_py_executor_instance by @leslie-fang25 in #7463
[None][infra] Waive failed tests on main branch 0905 by @EmmaQiaoCh in #7564
[https://nvbugs/5453806][unwaive] Unwaive fp8 kvcache attention test by @peaceh-nv in #7243
[#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example by @lucaslie in #7221
[None][ci] Revert "[https://nvbugs/5461761][fix] Remove the waiver (#7476)" by @QiJune in #7584
[None][ci] move some test cases of DGX H100 to post merge by @QiJune in #7569
[None][ci] Improve SSH connection stability by @chzblych in #7567
[None][ci] Waive qwen3 test for accuracy bug in https://nvbugs/5505402 by @dominicshanshan in #7585
[None][fix] DeepSeek-R1 W4A8 weight loading issue; fixes regression from #6200 by @rosenrodt in #7123
[None][chore] share input_ids buffers among different cuda graphs by @QiJune in #7236
[TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse by @chang-l in #7106
[TRTLLM-4629] [feat] Step1: trtllm-gen kernels support sm103 by @VALLIS-NERIA in #7570
[TRTLLM-7440][fix] Split fused_input_embed to separate out host sync by @chang-l in #7280
[https://nvbugs/5502352][fix] Fix 2-model CDL path by @mikeiovine in #7543
[TRTLLM-5950][infra] Removing remaining turtle keywords from the code base by @EmmaQiaoCh in #7086
[https://nvbugs/5448767][fix] sync termination of requests across PP ranks by @raayandhar in #7455
[None][infra] Skip RTX Pro 6000 test stages due to HW are offline by @EmmaQiaoCh in #7592
[TRTLLM-7153] [feat] Move stop_criteria to sample_async by @netanel-haber in #7041
[None][ci] Block some nodes to avoid unstable network access by @chzblych in #7593
[None][fix] fixing the math on asymmetric tp+pp tests by @raayandhar in #7098
[TRTLLM-7187][fix] Build wheel with NIXL by @BatshevaBlack in #7472
[None][chore] expose tokens_per_block into KvCacheConfig by @Superjomn in #5911
[None][docs] refine docs for accuracy evaluation of gpt-oss models by @binghanc in #7252
[TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API by @JunyiXu-nv in #7508
[None][chore] Mass integration of release/1.0 - 3rd by @dominicshanshan in #7519
[https://nvbugs/5506683][fix] adjust the CI by @byshiue in #7604
[None][infra] Add back rtx-pro-6000 stages since the node is available by @EmmaQiaoCh in #7601
[None][feat] Update multimodal utility get_num_tokens_per_image for better generalization by @chang-l in #7544
[TRTLLM-6142][feat] Reland: set torch recompile_limit based on cuda_graph_batch_sizes and refactored by @MrGeva in #7219
[None][chore] remove executor config in instantiate sampler by @leslie-fang25 in #7516
[TRTLLM-7361][feat] KV cache transfer for uneven pp by @chuangz0 in #7117
[None][infra] Try to fix docker container failed to be killed issue by @yuanjingx87 in #7388
[None][fix] Add try-catch in stream generator by @zhanghaotong in #7467
[https://nvbugs/5481080][fix] Fix GPTOSS W4A16 reference by @dongfengy in #7323
[None][test] Skip eagle3 test by @Tabrizian in #7627
[https://nvbugs/5453709][fix] Remove transformers version limit in Qwen2VL by @Wanli-Jiang in #7152
[TRTLLM-5877][infra] Add fmha tests and auto trigger rules by @yiqingy0 in #6050
[None][chore] Mass integration of release/1.0 - 4th (release/1.0 doc change mainly) by @dominicshanshan in #7607
[None][feat] Nixl support for GDS by @tshmilnvidia in #5488
[TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated by @ZhanruiSunCh in #5980
[#6529][feat] CMake option to link statically with cublas/curand by @WilliamTambellini in #7178
[None][feat] Extend VLM factory and add Mistral3 factory by @2ez4bz in #7583
[None][fix] add the missing import raised by #7607 by @nv-guomingz in #7639

New Contributors

@sorenwu made their first contribution in #7502
@sychen52 made their first contribution in #6809
@therealnaveenkamal made their first contribution in #7490
@binghanc made their first contribution in #7252

Full Changelog: v1.1.0rc3...v1.1.0rc4