NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.

What's New In 1.12.0

Add queuing policies for dynamic batching scheduler. These policies are specified in the model configuration and allow each model to set maximum queue size, time outs, and priority levels for inference requests.
Support for large ONNX models where weights are stored in separate files.
Allow ONNX Runtime optimization level to be configured via the model configuration optimization setting.
Experimental Python client and server support for community standard GRPC inferencing API.
Add --min-supported-compute-capability flag to allow Triton Server to use older, unsupported GPUs.
Fix perf_client shared memory support. In some cases shared-memory option did not work correctly due to the input and output tensor names. This issue is now resolved.

Known Issues

TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.12.0_ubuntu1804.clients.tar.gz file. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.12.0_ubuntu1804.custombackend.tar.gz file. See the documentation section 'Building a Custom Backend' for more information on using these files.

Jetson Jetpack Support

An experimental release of Triton for the Developer Preview of JetPack 4.4 (https://developer.nvidia.com/embedded/jetpack) is provided in the attached file: v1.12.0-jetpack4.4dp.tgz. This experimental release supports the TensorFlow (1.15.2), TensorRT (7.1) and Custom backends as well as ensembles. GPU metrics, GCS storage and S3 storage are not supported.

The tar file contains the Triton executable and shared libraries and also the C++ and Python client libraries and examples.

Installation and Usage

The following dependencies must be installed before running Triton.

apt-get update && \
    apt-get install -y --no-install-recommends \
        software-properties-common \
        autoconf \
        automake \
        build-essential \
        cmake \
        git \
        libgoogle-glog0v5 \
        libre2-dev \
        libssl-dev \
        libtool \
        libboost-dev \
        libcurl4-openssl-dev \
        zlib1g-dev

Additionally, to run the clients the following dependencies must be installed.

apt-get install -y --no-install-recommends \
        curl \
        libopencv-dev=3.2.0+dfsg-4ubuntu0.1 \
        libopencv-core-dev=3.2.0+dfsg-4ubuntu0.1 \
        pkg-config \
        python3 \
        python3-pip \
        python3-dev

python3 -m pip install --upgrade wheel setuptools
python3 -m pip install --upgrade grpcio-tools numpy pillow

The Python wheel for the python client library is present in the tar file and can be installed by running the following command:

python3 -m pip install --upgrade clients/python/tensorrtserver-*.whl

triton-inference-server/server v1.12.0 Release 1.12.0 corresponding to NGC container 20.03 on GitHub