NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.8.0

Shared-memory support is expanded to include CUDA shared memory.
Improve efficiency of pinned-memory used for ensemble models.
The perf_client application has been improved with easier-to-use
command-line arguments (which maintaining compatibility with existing
arguments).
Support for string tensors added to perf_client.
Documentation contains a new “Optimization” section discussing some common
optimization strategies and how to use perf_client to explore these
strategies.

Deprecated Features

The asynchronous inference API has been modified in the C++ and Python client libraries.
- In the C++ library:
  - The non-callback version of the AsyncRun function was removed.
  - The GetReadyAsyncRequest function was removed.
  - The signature of the GetAsyncRunResults function was changed to remove the is_ready and wait arguments.
- In the Python library:
  - The non-callback version of the async_run function was removed.
  - The get_ready_async_request function was removed.
  - The signature of the get_async_run_results function was changed to remove the wait argument.

Known Issues

The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the CustomGetNextInputV2Fn_t function adds the memory_type_id argument.
- The signature of the CustomGetOutputV2Fn_t function adds the memory_type_id argument.
The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the TRTSERVER_ResponseAllocatorAllocFn_t function has changed. See src/core/trtserver.h for a description of the new behavior.
- The signature of the TRTSERVER_InferenceRequestProviderSetInputData function adds the memory_type_id argument.
- The signature of the TRTSERVER_InferenceResponseOutputData function add the memory_type_id argument.
TensorRT reformat-free I/O is not supported.
Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.8.0_ubuntu1604.clients.tar.gz and v1.8.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.8.0_ubuntu1604.custombackend.tar.gz and v1.8.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

triton-inference-server/server v1.8.0 Release 1.8.0, corresponding to NGC container 19.11 on GitHub