NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.9.0

The model configuration now includes a model warmup option. This option provides the ability to tune and optimize the model before inference requests are received, avoiding initial inference delays. This option is especially useful for frameworks like TensorFlow that perform network optimization in response to the initial inference requests. Models can be warmed-up with one or more synthetic or realistic workloads before they become ready in the server.
An enhanced sequence batcher now has multiple scheduling strategies. A new Oldest strategy integrates with the dynamic batcher to enable improved inference performance for models that don’t require all inference requests in a sequence to be routed to the same batch slot.
The perf_client now has an option to generate requests using a realistic poisson distribution or a user provided distribution.
A new repository API (available in the shared library API, HTTP, and GRPC) returns an index of all models available in the model repositories) visible to the server. This index can be used to see what models are available for loading onto the server.
The server status returned by the server status API now includes the timestamp of the last inference request received for each model.
Inference server tracing capabilities are now documented in the Optimization section of the User Guide. Tracing support is enhanced to provide trace for ensembles and the contained models.
A community contributed Dockerfile is now available to build the TensorRT Inference Server clients on CentOS.

Known Issues

The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
- The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
- The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
- The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.9.0_ubuntu1604.clients.tar.gz and v1.9.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.9.0_ubuntu1604.custombackend.tar.gz and v1.9.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

triton-inference-server/server v1.9.0 Release 1.9.0, corresponding to NGC container 19.12 on GitHub

NVIDIA TensorRT Inference Server

What's New In 1.9.0

Known Issues

Client Libraries and Examples

Custom Backend SDK

triton-inference-server/server v1.9.0
Release 1.9.0, corresponding to NGC container 19.12

on GitHub