Announcement Highlights:
- Model Support
- API
- Benchmark
- Test trtllm-serve with --extra_llm_api_options (#7492)
- Feature
- Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
- Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
- Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
- Separate run_shape_prop as another graph utility (#7313)
- MultiLayer Eagle (#7234)
- Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
- Add NVFP4 x FP8 (#6809)
- Support hashing and KV cache reuse for videos (#7360)
- Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
- Introduce QKNormRoPEAttention module (#6830)
- AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
- Support KV cache salting for secure KV cache reuse (#7106)
- trtllm-gen kernels support sm103 (#7570)
- Move stop_criteria to sample_async (#7041)
- KV cache transfer for uneven pp (#7117)
- Update multimodal utility
get_num_tokens_per_image
for better generalization (#7544) - AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
- Add Request specific exception (#6931)
- Add DeepSeek-v3-0324 e2e torch test (#7413)
- Add 8-GPU test cases for RTX6000 (#7083)
- add gptoss 20g tests (#7361)
- Nixl support for GDS (#5488)
- CMake option to link statically with cublas/curand (#7178)
- Extend VLM factory and add Mistral3 factory (#7583)
- Documentation
- fix example in docstring (#7410)
- Fix formatting error in Gemma3 readme (#7352)
- Add note about trtllm-serve to the devel container (#7483)
- add GPT OSS Eagle3 blog (#7140)
- 1.0 Documentation. (#6696)
- Update kvcache part (#7549)
- Rename TensorRT-LLM to TensorRT LLM. (#7554)
- refine docs for accuracy evaluation of gpt-oss models (#7252)
What's Changed
- [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
- [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
- [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
- [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
- [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
- [None][doc] fix example in docstring by @tomeras91 in #7410
- [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
- [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
- [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
- [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
- [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
- [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
- [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
- [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
- [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
- [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
- [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
- [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
- [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
- [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
- [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
- [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
- [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
- [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
- [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
- [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
- [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
- [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
- [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
- [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
- [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
- [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
- [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
- [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
- [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
- [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
- [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
- [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
- [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
- [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
- [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
- [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
- [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
- [None][test] update nim and full test list by @crazydemo in #7468
- [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
- [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
- [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
- [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
- [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
- [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
- [None][feat] Add Request specific exception by @Shunkangz in #6931
- [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
- [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
- [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
- [None][chore] Remove closed bugs by @xinhe-nv in #7408
- [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
- [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
- [None][infra] update nspect version by @niukuo in #7552
- [https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in #7476
- [#6186][feat] Introduce QKNormRoPEAttention module by @Funatiq in #6830
- [None][chore] Remove executor_config in create_py_executor_instance by @leslie-fang25 in #7463
- [None][infra] Waive failed tests on main branch 0905 by @EmmaQiaoCh in #7564
- [https://nvbugs/5453806][unwaive] Unwaive fp8 kvcache attention test by @peaceh-nv in #7243
- [#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example by @lucaslie in #7221
- [None][ci] Revert "[https://nvbugs/5461761][fix] Remove the waiver (#7476)" by @QiJune in #7584
- [None][ci] move some test cases of DGX H100 to post merge by @QiJune in #7569
- [None][ci] Improve SSH connection stability by @chzblych in #7567
- [None][ci] Waive qwen3 test for accuracy bug in https://nvbugs/5505402 by @dominicshanshan in #7585
- [None][fix] DeepSeek-R1 W4A8 weight loading issue; fixes regression from #6200 by @rosenrodt in #7123
- [None][chore] share input_ids buffers among different cuda graphs by @QiJune in #7236
- [TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse by @chang-l in #7106
- [TRTLLM-4629] [feat] Step1: trtllm-gen kernels support sm103 by @VALLIS-NERIA in #7570
- [TRTLLM-7440][fix] Split
fused_input_embed
to separate out host sync by @chang-l in #7280 - [https://nvbugs/5502352][fix] Fix 2-model CDL path by @mikeiovine in #7543
- [TRTLLM-5950][infra] Removing remaining turtle keywords from the code base by @EmmaQiaoCh in #7086
- [https://nvbugs/5448767][fix] sync termination of requests across PP ranks by @raayandhar in #7455
- [None][infra] Skip RTX Pro 6000 test stages due to HW are offline by @EmmaQiaoCh in #7592
- [TRTLLM-7153] [feat] Move stop_criteria to sample_async by @netanel-haber in #7041
- [None][ci] Block some nodes to avoid unstable network access by @chzblych in #7593
- [None][fix] fixing the math on asymmetric tp+pp tests by @raayandhar in #7098
- [TRTLLM-7187][fix] Build wheel with NIXL by @BatshevaBlack in #7472
- [None][chore] expose tokens_per_block into KvCacheConfig by @Superjomn in #5911
- [None][docs] refine docs for accuracy evaluation of gpt-oss models by @binghanc in #7252
- [TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API by @JunyiXu-nv in #7508
- [None][chore] Mass integration of release/1.0 - 3rd by @dominicshanshan in #7519
- [https://nvbugs/5506683][fix] adjust the CI by @byshiue in #7604
- [None][infra] Add back rtx-pro-6000 stages since the node is available by @EmmaQiaoCh in #7601
- [None][feat] Update multimodal utility
get_num_tokens_per_image
for better generalization by @chang-l in #7544 - [TRTLLM-6142][feat] Reland: set torch recompile_limit based on cuda_graph_batch_sizes and refactored by @MrGeva in #7219
- [None][chore] remove executor config in instantiate sampler by @leslie-fang25 in #7516
- [TRTLLM-7361][feat] KV cache transfer for uneven pp by @chuangz0 in #7117
- [None][infra] Try to fix docker container failed to be killed issue by @yuanjingx87 in #7388
- [None][fix] Add try-catch in stream generator by @zhanghaotong in #7467
- [https://nvbugs/5481080][fix] Fix GPTOSS W4A16 reference by @dongfengy in #7323
- [None][test] Skip eagle3 test by @Tabrizian in #7627
- [https://nvbugs/5453709][fix] Remove transformers version limit in Qwen2VL by @Wanli-Jiang in #7152
- [TRTLLM-5877][infra] Add fmha tests and auto trigger rules by @yiqingy0 in #6050
- [None][chore] Mass integration of release/1.0 - 4th (release/1.0 doc change mainly) by @dominicshanshan in #7607
- [None][feat] Nixl support for GDS by @tshmilnvidia in #5488
- [TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated by @ZhanruiSunCh in #5980
- [#6529][feat] CMake option to link statically with cublas/curand by @WilliamTambellini in #7178
- [None][feat] Extend VLM factory and add Mistral3 factory by @2ez4bz in #7583
- [None][fix] add the missing import raised by #7607 by @nv-guomingz in #7639
New Contributors
- @sorenwu made their first contribution in #7502
- @sychen52 made their first contribution in #6809
- @therealnaveenkamal made their first contribution in #7490
- @binghanc made their first contribution in #7252
Full Changelog: v1.1.0rc3...v1.1.0rc4