Hi,
We are very pleased to announce the 0.15.0 version of TensorRT-LLM. This update includes:
Key Features and Enhancements
- Added support for EAGLE. Refer to
examples/eagle/README.md
. - Added functional support for GH200 systems.
- Added AutoQ (mixed precision) support.
- Added a
trtllm-serve
command to start a FastAPI based server. - Added FP8 support for Nemotron NAS 51B. Refer to
examples/nemotron_nas/README.md
. - Added INT8 support for GPTQ quantization.
- Added TensorRT native support for INT8 Smooth Quantization.
- Added quantization support for Exaone model. Refer to
examples/exaone/README.md
. - Enabled Medusa for Qwen2 models. Refer to “Medusa with Qwen2” section in
examples/medusa/README.md
. - Optimized pipeline parallelism with ReduceScatter and AllGather for Mixtral models.
- Added support for
Qwen2ForSequenceClassification
model architecture. - Added Python plugin support to simplify plugin development efforts. Refer to
examples/python_plugin/README.md
. - Added different rank dimensions support for LoRA modules when using the Hugging Face format. Thanks for the contribution from @AlessioNetti in #2366.
- Enabled embedding sharing by default. Refer to "Embedding Parallelism, Embedding Sharing, and Look-Up Plugin" section in
docs/source/performance/perf-best-practices.md
for information about the required conditions for embedding sharing. - Added support for per-token per-channel FP8 (namely row-wise FP8) on Ada.
- Extended the maximum supported
beam_width
to256
. - Added FP8 and INT8 SmoothQuant quantization support for the InternVL2-4B variant (LLM model only). Refer to
examples/multimodal/README.md
. - Added support for prompt-lookup speculative decoding. Refer to
examples/prompt_lookup/README.md
. - Integrated the QServe w4a8 per-group/per-channel quantization. Refer to “w4aINT8 quantization (QServe)” section in
examples/llama/README.md
. - Added a C++ example for fast logits using the
executor
API. Refer to “executorExampleFastLogits” section inexamples/cpp/executor/README.md
. - [BREAKING CHANGE] NVIDIA Volta GPU support is removed in this and future releases.
- Added the following enhancements to the LLM API:
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
LLM.generate
toLLM.__init__
for better generation performance without warmup. - Added
n
andbest_of
arguments to theSamplingParams
class. These arguments enable returning multiple generations for a single request. - Added
ignore_eos
,detokenize
,skip_special_tokens
,spaces_between_special_tokens
, andtruncate_prompt_tokens
arguments to theSamplingParams
class. These arguments enable more control over the tokenizer behavior. - Added support for incremental detokenization to improve the detokenization performance for streaming generation.
- Added the
enable_prompt_adapter
argument to theLLM
class and theprompt_adapter_request
argument for theLLM.generate
method. These arguments enable prompt tuning.
- [BREAKING CHANGE] Moved the runtime initialization from the first invocation of
- Added support for a
gpt_variant
argument to theexamples/gpt/convert_checkpoint.py
file. This enhancement enables checkpoint conversion with more GPT model variants. Thanks to the contribution from @tonylek in #2352.
API Changes
- [BREAKING CHANGE] Moved the flag
builder_force_num_profiles
intrtllm-build
command to theBUILDER_FORCE_NUM_PROFILES
environment variable. - [BREAKING CHANGE] Modified defaults for
BuildConfig
class so that they are aligned with thetrtllm-build
command. - [BREAKING CHANGE] Removed Python bindings of
GptManager
. - [BREAKING CHANGE]
auto
is used as the default value for--dtype
option in quantize and checkpoints conversion scripts. - [BREAKING CHANGE] Deprecated
gptManager
API path ingptManagerBenchmark
. - [BREAKING CHANGE] Deprecated the
beam_width
andnum_return_sequences
arguments to theSamplingParams
class in the LLM API. Use then
,best_of
anduse_beam_search
arguments instead. - Exposed
--trust_remote_code
argument to the OpenAI API server. (#2357)
Model Updates
- Added support for Llama 3.2 and llama 3.2-Vision model. Refer to
examples/mllama/README.md
for more details on the llama 3.2-Vision model. - Added support for Deepseek-v2. Refer to
examples/deepseek_v2/README.md
. - Added support for Cohere Command R models. Refer to
examples/commandr/README.md
. - Added support for Falcon 2, refer to
examples/falcon/README.md
, thanks to the contribution from @puneeshkhanna in #1926. - Added support for InternVL2. Refer to
examples/multimodal/README.md
. - Added support for Qwen2-0.5B and Qwen2.5-1.5B model. (#2388)
- Added support for Minitron. Refer to
examples/nemotron
. - Added a GPT Variant - Granite(20B and 34B). Refer to “GPT Variant - Granite” section in
examples/gpt/README.md
. - Added support for LLaVA-OneVision model. Refer to “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in
examples/multimodal/README.md
.
Fixed Issues
- Fixed a slice error in forward function. (#1480)
- Fixed an issue that appears when building BERT. (#2373)
- Fixed an issue that model is not loaded when building BERT. (2379)
- Fixed the broken executor examples. (#2294)
- Fixed the issue that the kernel
moeTopK()
cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug. - Fixed an assertion failure on
crossKvCacheFraction
. (#2419) - Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
- Fixed a PDL typo in
docs/source/performance/perf-benchmarking.md
, thanks @MARD1NO for pointing it out in #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:24.10-py3
. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:24.10-py3
. - The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.
- The dependent ModelOpt version is updated to 0.19 for Linux platform, while 0.17 is still used on Windows platform.
Documentation
- Added a copy button for code snippets in the documentation. (#2288)
We are updating the main
branch regularly with new features, bug fixes and performance optimizations. The rel
branch will be updated less frequently, and the exact frequencies depend on your feedback.
Thanks,
The TensorRT-LLM Engineering Team