github NVIDIA/TensorRT-LLM v1.2.0rc0.post1

pre-releaseone day ago

TensorRT-LLM Weekly Release Notes

  • Model Support

    • Support Qwen3 next (#7892)
  • API

    • Return topk logprobs in torch backend (#7976)
    • Add chunked return_generation_logits logic (#7831)
  • Benchmark

    • Lock gpu clocks in test_perf.py to reliably detect perf regressions (#8099)
    • Fixed the kv cache size parsing in test_perf.py AD backend (#8092)
    • Improve perf_metrics endpoint functionality (#8005)
  • Feature

    • AutoDeploy: Linear Attention Support (SSM + causal_conv + Bamba + Nemotron-H) (#8068)
    • add W4A8 NVFP4 FP8 fused moe (#7968)
    • Add ModelOPT INT4 awq fake quant support in AutoDeploy (#7770)
    • Save state first pass for speculative decoding (#7012)
    • Executor changes to support helix parallelism (#7972)
    • Support CUDA graph for DeepEP (#7514)
    • Integrate tinygemm2 for gpt-oss (#7916)
    • Support for cancelling requests with disaggregation (#8114)
    • Update TRT-LLM Gen MoE kernels (#7970)
    • AutoDeploy: dive deeper into token generation bugs + enable_block_reuse (#8108)
    • AutoDeploy add autotuning when capturing cudagraphs (#8120)
    • AutoDeploy: compiler backends based on nn.Module (#8126)
    • Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True (#7891)
    • Improve batched sampling perf for contiguous batches (#7908)
  • Documentation

    • Add more description on EXAONE usage (#8089)
  • Fix & Infra

    • Fix CUDA graph for Qwen2.5-VL (#8047)
    • Refine qwen3-next implementation (#8064)
    • Patched incorrect starcoder tp config (#8118)
    • Fix Qwen3 FP8 per-tensor when requesting TRTLLM-GEN MoE backend (#8075)
    • Fix TRT-python multi LoRA TP=2 test arguments (#8059)
    • Fix the non-determinism issue in the mm_encoder test (#8033)
    • Checking connection to etcd server in unit test (#8006)
    • Add MNNVL AlltoAll tests to pre-merge (#7466)
    • Add test cases into QA test list (#8081)
    • Avoid downloading Tiny llama from HF (#8071)
    • Fix OOM issue when dp padding is enabled (#8052)
    • Fix unwaiving disagg pp tests (#8069)
    • Fix shape propagation after TP sharding (#7912)
    • Fix patchelf version issue (#8112)
    • Fix device id assignment for some vision models (#8070)
    • Do not explicitly pass temperature=0 to select greedy sampling (#8110)
    • Fix access to new tokens in sampler (#7958)
    • Adding install_tensorrt.sh script to pip wheel (#8116)
    • Fix flakey unit test for dynamic spec decoding (#8129)
    • Minor cleanup and improvements (#7619)
    • Reserve an extra slot for padded batch (#7998)
    • Fix MTP 2-model (#8115)
    • Add LoRa Torch tests for the latest NIM model list (#6806)

What's Changed

  • [TRTLLM-7728][perf] improve batched sampling perf for contiguous batches by @ixlmar in #7908
  • [None][feat] Support Qwen3 next by @byshiue in #7892
  • [TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling by @ixlmar in #7909
  • [None][fix] Fix TRT-python multi LoRA TP=2 test arguments by @amitz-nv in #8059
  • [https://nvbugs/5542867][fix] Fix the non-determinism issue in the mm_encoder test by @chang-l in #8033
  • [https://nvbugs/5538098][fix] Checking connection to etcd server in unit test by @pcastonguay in #8006
  • [TRTLLM-6741][fix] Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True by @Njuapp in #7891
  • [None][feat] Return topk logprobs in torch backend by @dcaox in #7976
  • [#4593][feat] AutoDeploy: Linear Attention Support (SSM + causal_conv + Bamba + Nemotron-H) by @lucaslie in #8068
  • [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7466
  • [TRTLLM-6239][feat] add test cases into QA test list by @xinhe-nv in #8081
  • [None][fix] Fix CUDA graph for Qwen2.5-VL by @yechank-nvidia in #8047
  • [None][chore] Bump version to 1.2.0rc1 by @yiqingy0 in #8097
  • [https://nvbugs/5547414][fix] avoid downloading Tiny llama from HF by @Tabrizian in #8071
  • [None][chore] Refine qwen3-next implementation. by @nv-guomingz in #8064
  • [TRTLLM-8269][fix] Revert "do not explicitly pass temperature=0 to select greedy sampling" by @ixlmar in #8103
  • [None][chore] Revert MNNVL alltoall MR by @brb-nv in #8106
  • [None][fix] : Fix OOM issue when dp padding is enabled by @peaceh-nv in #8052
  • [None][doc] Add more description on EXAONE example by @yechank-nvidia in #8089
  • [None][infra] Skip failed tests in post-merge for main by @EmmaQiaoCh in #8102
  • [https://nvbugs/5434320][fix] fix: Unwaiving disagg pp tests by @pcastonguay in #8069
  • [OMNIML-2336][feat] add W4A8 NVFP4 FP8 fused moe by @sychen52 in #7968
  • [TRTLLM-6342][bug] Fix shape propagation after TP sharding by @greg-kwasniewski1 in #7912
  • [TRTLLM-8031][feat] Add chunked return_generation_logits logic by @yibinl-nvidia in #7831
  • [#5860][feat] Add ModelOPT INT4 awq fake quant support in AutoDeploy by @Fridah-nv in #7770
  • [None][fix] fix patchelf version issue by @bo-nv in #8112
  • [None][feat] Save state first pass by @IzzyPutterman in #7012
  • [TRTLLM-7733][feat] Executor changes to support helix parallelism by @brb-nv in #7972
  • [https://nvbugs/5549081][fix] Fix device id assignment for some vision models by @chang-l in #8070
  • [#7588][feat] lock gpu clocks in test_perf.py to reliably detect perf regressions by @MrGeva in #8099
  • [TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling by @ixlmar in #8110
  • [https://nvbugs/5556020][chore] waive test_eagle3 by @hchings in #8119
  • [TRTLLM-6589][feat] Support CUDA graph for DeepEP by @yifeizhang-c in #7514
  • [TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss by @dongfengy in #7916
  • [None][feat] Support for cancelling requests with disaggregation by @pcastonguay in #8114
  • [None][fix] Fix access to new tokens in sampler. by @dcampora in #7958
  • [None][chore] Adding install_tensorrt.sh script to pip wheel by @pcastonguay in #8116
  • [#7588][fix] fixed the kv cache size parsing in test_perf.py AD backend by @MrGeva in #8092
  • [TRTLLM-6342][bug] Patched incorrect starcoder tp config by @greg-kwasniewski1 in #8118
  • [None][feat] perf_metrics endpoint functionality improvement by @nv-yilinf in #8005
  • [None][feat] Update TRT-LLM Gen MoE kernels by @nekorobov in #7970
  • [https://nvbugs/5548098][fix] Fix flakey unit test for dynamic spec d… by @hchings in #8129
  • [None] [refactor] Minor cleanup and improvements by @Funatiq in #7619
  • [None][feat] AutoDeploy: dive deeper into token generation bugs + enable_block_reuse by @lucaslie in #8108
  • [None][fix] Fix Qwen3 FP8 per-tensor when requesting TRTLLM-GEN MoE backend by @achartier in #8075
  • [None][feat] AutoDeploy add autotuning when capturing cudagraphs by @suyoggupta in #8120
  • [https://nvbugs/5537878][fix] Reserve an extra slot for padded batch by @ziyixiong-nv in #7998
  • [None][feat] AutoDeploy: compiler backends based on nn.Module by @lucaslie in #8126
  • [None][fix] Fix MTP 2-model by @mikeiovine in #8115
  • [TRTLLM-6496][feat] Add LoRa Torch tests for the latest NIM model list by @moraxu in #6806
  • [None][chore] Bump version to 1.2.0rc0.post1 by @yiqingy0 in #8306

Full Changelog: v1.2.0rc0...v1.2.0rc0.post1

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.