github NVIDIA/TensorRT-LLM v1.1.0rc4

pre-release16 hours ago

Announcement Highlights:

  • Model Support
    • Support phi-4 model in pytorch backend (#7371)
    • Support Aggregate mode for phi4-mm (#7521)
  • API
    • Implement basic functionalities for Responses API (#7341)
    • Support multiple postprocess workers for chat completions API (#7508)
    • Report failing requests (#7060)
  • Benchmark
    • Test trtllm-serve with --extra_llm_api_options (#7492)
  • Feature
    • Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
    • Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
    • Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
    • Separate run_shape_prop as another graph utility (#7313)
    • MultiLayer Eagle (#7234)
    • Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
    • Add NVFP4 x FP8 (#6809)
    • Support hashing and KV cache reuse for videos (#7360)
    • Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
    • Introduce QKNormRoPEAttention module (#6830)
    • AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
    • Support KV cache salting for secure KV cache reuse (#7106)
    • trtllm-gen kernels support sm103 (#7570)
    • Move stop_criteria to sample_async (#7041)
    • KV cache transfer for uneven pp (#7117)
    • Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
    • AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
    • Add Request specific exception (#6931)
    • Add DeepSeek-v3-0324 e2e torch test (#7413)
    • Add 8-GPU test cases for RTX6000 (#7083)
    • add gptoss 20g tests (#7361)
    • Nixl support for GDS (#5488)
    • CMake option to link statically with cublas/curand (#7178)
    • Extend VLM factory and add Mistral3 factory (#7583)
  • Documentation
    • fix example in docstring (#7410)
    • Fix formatting error in Gemma3 readme (#7352)
    • Add note about trtllm-serve to the devel container (#7483)
    • add GPT OSS Eagle3 blog (#7140)
    • 1.0 Documentation. (#6696)
    • Update kvcache part (#7549)
    • Rename TensorRT-LLM to TensorRT LLM. (#7554)
    • refine docs for accuracy evaluation of gpt-oss models (#7252)

What's Changed

  • [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
  • [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
  • [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
  • [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
  • [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
  • [None][doc] fix example in docstring by @tomeras91 in #7410
  • [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
  • [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
  • [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
  • [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
  • [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
  • [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
  • [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
  • [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
  • [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
  • [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
  • [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
  • [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
  • [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
  • [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
  • [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
  • [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
  • [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
  • [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
  • [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
  • [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
  • [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
  • [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
  • [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
  • [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
  • [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
  • [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
  • [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
  • [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
  • [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
  • [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
  • [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
  • [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
  • [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
  • [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
  • [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
  • [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
  • [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
  • [None][test] update nim and full test list by @crazydemo in #7468
  • [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
  • [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
  • [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
  • [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
  • [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
  • [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
  • [None][feat] Add Request specific exception by @Shunkangz in #6931
  • [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
  • [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
  • [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
  • [None][chore] Remove closed bugs by @xinhe-nv in #7408
  • [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
  • [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
  • [None][infra] update nspect version by @niukuo in #7552
  • [https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in #7476
  • [#6186][feat] Introduce QKNormRoPEAttention module by @Funatiq in #6830
  • [None][chore] Remove executor_config in create_py_executor_instance by @leslie-fang25 in #7463
  • [None][infra] Waive failed tests on main branch 0905 by @EmmaQiaoCh in #7564
  • [https://nvbugs/5453806][unwaive] Unwaive fp8 kvcache attention test by @peaceh-nv in #7243
  • [#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example by @lucaslie in #7221
  • [None][ci] Revert "[https://nvbugs/5461761][fix] Remove the waiver (#7476)" by @QiJune in #7584
  • [None][ci] move some test cases of DGX H100 to post merge by @QiJune in #7569
  • [None][ci] Improve SSH connection stability by @chzblych in #7567
  • [None][ci] Waive qwen3 test for accuracy bug in https://nvbugs/5505402 by @dominicshanshan in #7585
  • [None][fix] DeepSeek-R1 W4A8 weight loading issue; fixes regression from #6200 by @rosenrodt in #7123
  • [None][chore] share input_ids buffers among different cuda graphs by @QiJune in #7236
  • [TRTLLM-7398][feat] Support KV cache salting for secure KV cache reuse by @chang-l in #7106
  • [TRTLLM-4629] [feat] Step1: trtllm-gen kernels support sm103 by @VALLIS-NERIA in #7570
  • [TRTLLM-7440][fix] Split fused_input_embed to separate out host sync by @chang-l in #7280
  • [https://nvbugs/5502352][fix] Fix 2-model CDL path by @mikeiovine in #7543
  • [TRTLLM-5950][infra] Removing remaining turtle keywords from the code base by @EmmaQiaoCh in #7086
  • [https://nvbugs/5448767][fix] sync termination of requests across PP ranks by @raayandhar in #7455
  • [None][infra] Skip RTX Pro 6000 test stages due to HW are offline by @EmmaQiaoCh in #7592
  • [TRTLLM-7153] [feat] Move stop_criteria to sample_async by @netanel-haber in #7041
  • [None][ci] Block some nodes to avoid unstable network access by @chzblych in #7593
  • [None][fix] fixing the math on asymmetric tp+pp tests by @raayandhar in #7098
  • [TRTLLM-7187][fix] Build wheel with NIXL by @BatshevaBlack in #7472
  • [None][chore] expose tokens_per_block into KvCacheConfig by @Superjomn in #5911
  • [None][docs] refine docs for accuracy evaluation of gpt-oss models by @binghanc in #7252
  • [TRTLLM-7779][feat] Support multiple postprocess workers for chat completions API by @JunyiXu-nv in #7508
  • [None][chore] Mass integration of release/1.0 - 3rd by @dominicshanshan in #7519
  • [https://nvbugs/5506683][fix] adjust the CI by @byshiue in #7604
  • [None][infra] Add back rtx-pro-6000 stages since the node is available by @EmmaQiaoCh in #7601
  • [None][feat] Update multimodal utility get_num_tokens_per_image for better generalization by @chang-l in #7544
  • [TRTLLM-6142][feat] Reland: set torch recompile_limit based on cuda_graph_batch_sizes and refactored by @MrGeva in #7219
  • [None][chore] remove executor config in instantiate sampler by @leslie-fang25 in #7516
  • [TRTLLM-7361][feat] KV cache transfer for uneven pp by @chuangz0 in #7117
  • [None][infra] Try to fix docker container failed to be killed issue by @yuanjingx87 in #7388
  • [None][fix] Add try-catch in stream generator by @zhanghaotong in #7467
  • [https://nvbugs/5481080][fix] Fix GPTOSS W4A16 reference by @dongfengy in #7323
  • [None][test] Skip eagle3 test by @Tabrizian in #7627
  • [https://nvbugs/5453709][fix] Remove transformers version limit in Qwen2VL by @Wanli-Jiang in #7152
  • [TRTLLM-5877][infra] Add fmha tests and auto trigger rules by @yiqingy0 in #6050
  • [None][chore] Mass integration of release/1.0 - 4th (release/1.0 doc change mainly) by @dominicshanshan in #7607
  • [None][feat] Nixl support for GDS by @tshmilnvidia in #5488
  • [TRTLLM-4366][infra] Don't call reinstall_rockylinux_cuda when the base CUDA image is up to dated by @ZhanruiSunCh in #5980
  • [#6529][feat] CMake option to link statically with cublas/curand by @WilliamTambellini in #7178
  • [None][feat] Extend VLM factory and add Mistral3 factory by @2ez4bz in #7583
  • [None][fix] add the missing import raised by #7607 by @nv-guomingz in #7639

New Contributors

Full Changelog: v1.1.0rc3...v1.1.0rc4

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.