Hi,
We are very pleased to announce the 0.12.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Supported LoRA for MoE models.
- The
ModelWeightsLoader
is enabled for LLaMA family models (experimental), seedocs/source/architecture/model-weights-loader.md
. - Supported FP8 FMHA for NVIDIA Ada Lovelace Architecture.
- Supported GPT-J, Phi, Phi-3, Qwen, GPT, GLM, Baichuan, Falcon and Gemma models for the
LLM
class. - Supported FP8 OOTB MoE.
- Supported Starcoder2 SmoothQuant. (#1886)
- Supported ReDrafter Speculative Decoding, see “ReDrafter” section in
docs/source/speculative_decoding.md
. - Supported padding removal for BERT, thanks to the contribution from @Altair-Alpha in #1834.
- Added in-flight batching support for GLM 10B model.
- Supported
gelu_pytorch_tanh
activation function, thanks to the contribution from @ttim in #1897. - Added
chunk_length
parameter to Whisper, thanks to the contribution from @MahmoudAshraf97 in #1909. - Added
concurrency
argument forgptManagerBenchmark
. - Executor API supports requests with different beam widths, see
docs/source/executor.md#sending-requests-with-different-beam-widths
. - Added the flag
--fast_build
totrtllm-build
command (experimental).
API Changes
- [BREAKING CHANGE]
max_output_len
is removed fromtrtllm-build
command, if you want to limit sequence length on engine build stage, specifymax_seq_len
. - [BREAKING CHANGE] The
use_custom_all_reduce
argument is removed fromtrtllm-build
. - [BREAKING CHANGE] The
multi_block_mode
argument is moved from build stage (trtllm-build
and builder API) to the runtime. - [BREAKING CHANGE] The build time argument
context_fmha_fp32_acc
is moved to runtime for decoder models. - [BREAKING CHANGE] The arguments
tp_size
,pp_size
andcp_size
is removed fromtrtllm-build
command. - The C++ batch manager API is deprecated in favor of the C++
executor
API, and it will be removed in a future release of TensorRT-LLM. - Added a version API to the C++ library, a
cpp/include/tensorrt_llm/executor/version.h
file is going to be generated.
Model Updates
- Supported LLaMA 3.1 model.
- Supported Mamba-2 model.
- Supported EXAONE model, see
examples/exaone/README.md
. - Supported Qwen 2 model.
- Supported GLM4 models, see
examples/chatglm/README.md
. - Added LLaVa-1.6 (LLaVa-NeXT) multimodal support, see “LLaVA, LLaVa-NeXT and VILA” section in
examples/multimodal/README.md
.
Fixed Issues
- Fixed wrong pad token for the CodeQwen models. (#1953)
- Fixed typo in
cluster_infos
defined intensorrt_llm/auto_parallel/cluster_info.py
, thanks to the contribution from @saeyoonoh in #1987. - Removed duplicated flags in the command at
docs/source/reference/troubleshooting.md
, thanks for the contribution from @hattizai in #1937. - Fixed segmentation fault in TopP sampling layer, thanks to the contribution from @akhoroshev in #2039. (#2040)
- Fixed the failure when converting the checkpoint for Mistral Nemo model. (#1985)
- Propagated
exclude_modules
to weight-only quantization, thanks to the contribution from @fjosw in #2056. - Fixed wrong links in README, thanks to the contribution from @Tayef-Shah in #2028.
- Fixed some typos in the documentation, thanks to the contribution from @lfz941 in #1939.
- Fixed the engine build failure when deduced
max_seq_len
is not an integer. (#2018)
Infrastructure Changes
- Base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.07-py3
. - Base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.07-py3
. - The dependent TensorRT version is updated to 10.3.0.
- The dependent CUDA version is updated to 12.5.1.
- The dependent PyTorch version is updated to 2.4.0.
- The dependent ModelOpt version is updated to v0.15.0.
Known Issues
- On Windows, installation of TensorRT-LLM may succeed, but you might hit
OSError: exception: access violation reading 0x0000000000000000
when importing the library in Python. See Installing on Windows for workarounds.
Currently, there are two key branches in the project:
- The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
- The main branch is the dev branch. It is more experimental.
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team