NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.0.0

1.0.0 is the first GA, non-beta, release of TensorRT Inference Server. See the README for information on backwards-compatibility guarantees for this and future releases.
Added support for stateful models and backends that require multiple inference requests be routed to the same model instance/batch slot. The new sequence batcher provides scheduling and batching capabilities for this class of models.
Added GRPC streaming protocol support for inference requests.
The HTTP front-end is now asynchronous to enable lower-latency and higher-throughput handling of inference requests.
Enhanced perf_client to support stateful models and backends.

Client Libraries and Examples

An Ubuntu 16.04 build of the client libraries and examples are included in this release in the attached v1.0.0.clients.tar.gz. See the documentation section 'Building the Client Libraries and Examples' for more information on using this file.

triton-inference-server/server v1.0.0 Release 1.0.0, corresponding to NGC container 19.03 on GitHub

NVIDIA TensorRT Inference Server

What's New In 1.0.0

Client Libraries and Examples

triton-inference-server/server v1.0.0
Release 1.0.0, corresponding to NGC container 19.03

on GitHub