NVIDIA/TensorRT-LLM v0.17.0 on GitHub

Hi,

We are very pleased to announce the 0.17.0 version of TensorRT-LLM. This update includes:

Model Support

Added InternLM-XComposer2 support. Refer to “InternLM-XComposer2” section in examples/multimodal/README.md.

Features

Blackwell support
- Added support for B200.
- Added support for GeForce RTX 50 series using Windows Subsystem for Linux (WSL) for limited models.
- Added NVFP4 GEMM support for Llama and Mixtral models.
- Added NVFP4 support for the LLM API and trtllm-bench command.
- GB200 NVL is not fully supported.
- Added benchmark script to measure perf benefits of KV cache host offload with expected runtime improvements from GH200.
PyTorch workflow
- The PyTorch workflow is an experimental feature in tensorrt_llm._torch. The following is a list of supported infrastructure, models, and * features that can be used with the PyTorch workflow.
- Added support for H100/H200/B200.
- Added support for Llama models, Mixtral, QWen, Vila.
- Added support for FP16/BF16/FP8/NVFP4 Gemm, FP16/BF16/FP8 KVCache.
- Added custom context and decoding attention kernels support via PyTorch custom op.
- Added support for chunked context (default off).
- Added CudaGraph support for decoding only.
- Added overlap scheduler support to overlap prepare inputs and model forward by decoding 1 extra token.
Added FP8 context FMHA support for the W4A8 quantization workflow.
Added ModelOpt quantized checkpoint support for the LLM API.
Added support for min_p. Refer to https://arxiv.org/pdf/2407.01082.
- Thanks for the contribution from @pathorn in #1536.
- This also addresses #1154 and #1683.
Added FP8 support for encoder-decoder models. Refer to the “FP8 Post-Training Quantization” section in examples/enc_dec/README.md.
Added up and gate projection fusion support for LoRA modules.

API

[BREAKING CHANGE] paged_context_fmha and fp8_context_fmha are enabled by default.
[BREAKING CHANGE] KV cache reuse is enabled automatically when paged_context_fmha is enabled.
[BREAKING CHANGE] tokens_per_block is set to 32 by default.
Added --concurrency support for the throughput subcommand of trtllm-bench.

Bug fixes

Fixed incorrect LoRA output dimension. Thanks for the contribution from @akhoroshev in #2484.
Added NVIDIA H200 GPU into the cluster_key for auto parallelism feature. (#2552)
Fixed workspace size issue in the GPT attention plugin. Thanks for the contribution from @AIDC-AI.
Fixed an assertion error in the LoRA plugin. (#2282)

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The rel branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,
The TensorRT-LLM Engineering Team

NVIDIA/TensorRT-LLM v0.17.0 TensorRT-LLM Release 0.17.0 on GitHub

Model Support

Features

API

Bug fixes

NVIDIA/TensorRT-LLM v0.17.0
TensorRT-LLM Release 0.17.0

on GitHub