github NVIDIA/TensorRT-LLM v0.9.0
TensorRT-LLM 0.9.0 Release

latest releases: v0.16.0, v0.15.0, v0.14.0...
9 months ago

Hi,

We are very pleased to announce the 0.9.0 version of TensorRT-LLM. It has been an intense effort, and we hope that it will enable you to easily deploy GPU-based inference for state-of-the-art LLMs. We want TensorRT-LLM to help you run those LLMs very fast.

This update includes:

  • Model Support
    • Support distil-whisper, thanks to the contribution from @Bhuvanesh09 in PR #1061
    • Support HuggingFace StarCoder2
    • Support VILA
    • Support Smaug-72B-v0.1
    • Migrate BLIP-2 examples to examples/multimodal
  • Features
    • [BREAKING CHANGE] TopP sampling optimization with deterministic AIR TopP algorithm is enabled by default
    • [BREAKING CHANGE] Support embedding sharing for Gemma
    • Add support to context chunking to work with KV cache reuse
    • Enable different rewind tokens per sequence for Medusa
    • BART LoRA support (limited to the Python runtime)
    • Enable multi-LoRA for BART LoRA
    • Support early_stopping=False in beam search for C++ Runtime
    • Add logits post processor to the batch manager (see docs/source/batch_manager.md#logits-post-processor-optional)
    • Support import and convert HuggingFace Gemma checkpoints, thanks for the contribution from @mfuntowicz in #1147
    • Support loading Gemma from HuggingFace
    • Support auto parallelism planner for high-level API and unified builder workflow
    • Support run GptSession without OpenMPI #1220
    • Medusa IFB support
    • [Experimental] Support FP8 FMHA, note that the performance is not optimal, and we will keep optimizing it
    • More head sizes support for LLaMA-like models
      • Ampere (sm80, sm86), Ada (sm89), Hopper(sm90) all support head sizes [32, 40, 64, 80, 96, 104, 128, 160, 256] now.
    • OOTB functionality support
      • T5
      • Mixtral 8x7B
  • API
    • C++ executor API
      • Add Python bindings, see documentation and examples in examples/bindings
      • Add advanced and multi-GPU examples for Python binding of executor C++ API, see examples/bindings/README.md
      • Add documents for C++ executor API, see docs/source/executor.md
    • High-level API (refer to examples/high-level-api/README.md for guidance)
      • [BREAKING CHANGE] Reuse the QuantConfig used in trtllm-build tool, support broader quantization features
      • Support in LLM() API to accept engines built by trtllm-build command
      • Add support for TensorRT-LLM checkpoint as model input
      • Refine SamplingConfig used in LLM.generate or LLM.generate_async APIs, with the support of beam search, a variety of penalties, and more features
      • Add support for the StreamingLLM feature, enable it by setting LLM(streaming_llm=...)
      • Migrate Mixtral to high level API and unified builder workflow
    • [BREAKING CHANGE] Refactored Qwen model to the unified build workflow, see examples/qwen/README.md for the latest commands
    • [BREAKING CHANGE] Move LLaMA convert checkpoint script from examples directory into the core library
    • [BREAKING CHANGE] Refactor GPT with unified building workflow, see examples/gpt/README.md for the latest commands
    • [BREAKING CHANGE] Removed all the lora related flags from convert_checkpoint.py script and the checkpoint content to trtllm-build command, to generalize the feature better to more models
    • [BREAKING CHANGE] Removed the use_prompt_tuning flag and options from convert_checkpoint.py script and the checkpoint content, to generalize the feature better to more models. Use the trtllm-build --max_prompt_embedding_table_size instead.
    • [BREAKING CHANGE] Changed the trtllm-build --world_size flag to --auto_parallel flag, the option is used for auto parallel planner only.
    • [BREAKING CHANGE] AsyncLLMEngine is removed, tensorrt_llm.GenerationExecutor class is refactored to work with both explicitly launching with mpirun in the application level, and accept an MPI communicator created by mpi4py
    • [BREAKING CHANGE] examples/server are removed, see examples/app instead.
    • [BREAKING CHANGE] Remove LoRA related parameters from convert checkpoint scripts
    • [BREAKING CHANGE] Simplify Qwen convert checkpoint script
    • [BREAKING CHANGE] Remove model parameter from gptManagerBenchmark and gptSessionBenchmark
  • Bug fixes
    • Fix a weight-only quant bug for Whisper to make sure that the encoder_input_len_range is not 0, thanks to the contribution from @Eddie-Wang1120 in #992
    • Fix the issue that log probabilities in Python runtime are not returned #983
    • Multi-GPU fixes for multimodal examples #1003
    • Fix wrong end_id issue for Qwen #987
    • Fix a non-stopping generation issue #1118 #1123
    • Fix wrong link in examples/mixtral/README.md #1181
    • Fix LLaMA2-7B bad results when int8 kv cache and per-channel int8 weight only are enabled #967
    • Fix wrong head_size when importing Gemma model from HuggingFace Hub, thanks for the contribution from @mfuntowicz in #1148
    • Fix ChatGLM2-6B building failure on INT8 #1239
    • Fix wrong relative path in Baichuan documentation #1242
    • Fix wrong SamplingConfig tensors in ModelRunnerCpp #1183
    • Fix error when converting SmoothQuant LLaMA #1267
    • Fix the issue that examples/run.py only load one line from --input_file
    • Fix the issue that ModelRunnerCpp does not transfer SamplingConfig tensor fields correctly #1183
  • Benchmark
    • Add emulated static batching in gptManagerBenchmark
    • Support arbitrary dataset from HuggingFace for C++ benchmarks, see “Prepare dataset” section in benchmarks/cpp/README.md
    • Add percentile latency report to gptManagerBenchmark
  • Performance
    • Optimize gptDecoderBatch to support batched sampling
    • Enable FMHA for models in BART, Whisper and NMT family
    • Remove router tensor parallelism to improve performance for MoE models, thanks to the contribution from @megha95 in #1091
    • Improve custom all-reduce kernel
  • Infra
    • Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.02-py3
    • Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.02-py3
    • The dependent TensorRT version is updated to 9.3
    • The dependent PyTorch version is updated to 2.2
    • The dependent CUDA version is updated to 12.3.2 (a.k.a. 12.3 Update 2)

Currently, there are two key branches in the project:

  • The rel branch is the stable branch for the release of TensorRT-LLM. It has been QA-ed and carefully tested.
  • The main branch is the dev branch. It is more experimental.

We are updating the main branch regularly with new features, bug fixes and performance optimizations. The stable branch will be updated less frequently, and the exact frequencies depend on your feedback.

Thanks,

The TensorRT-LLM Engineering Team

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.