NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.2.0

Ensembling is now available. An ensemble represents a pipeline of one or more models and the connection of input and output tensors between those models. A single inference request to an ensemble will trigger the execution of the entire pipeline.
Added Helm chart that deploys a single TensorRT Inference Server into a Kubernetes cluster.
The client Makefile now supports building for both Ubuntu 16.04 and Ubuntu 18.04. The Python wheel produced from the build is now compatible with both Python2 and Python3.
The perf_client application now has a --percentile flag that can be used to report latencies instead of reporting average latency (which remains the default). For example, using --percentile=99 causes perf_client to report the 99th percentile latency.
The perf_client application now has a -z option to use zero-valued input tensors instead of random values.
Improved error reporting of incorrect input/output tensor names for TensorRT models.
Added --allow-gpu-metrics option to enable/disable reporting of GPU metrics.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.2.0_ubuntu1604.clients.tar.gz and v1.2.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

triton-inference-server/server v1.2.0 Release 1.2.0, corresponding to NGC container 19.05 on GitHub

NVIDIA TensorRT Inference Server

What's New In 1.2.0

Client Libraries and Examples

triton-inference-server/server v1.2.0
Release 1.2.0, corresponding to NGC container 19.05

on GitHub