NVIDIA TensorRT Inference Server
The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.
What's New In 1.8.0
- Shared-memory support is expanded to include CUDA shared memory.
- Improve efficiency of pinned-memory used for ensemble models.
- The perf_client application has been improved with easier-to-use
command-line arguments (which maintaining compatibility with existing
arguments). - Support for string tensors added to perf_client.
- Documentation contains a new “Optimization” section discussing some common
optimization strategies and how to use perf_client to explore these
strategies.
Deprecated Features
- The asynchronous inference API has been modified in the C++ and Python client libraries.
- In the C++ library:
- The non-callback version of the
AsyncRun
function was removed. - The
GetReadyAsyncRequest
function was removed. - The signature of the
GetAsyncRunResults
function was changed to remove theis_ready
andwait
arguments.
- The non-callback version of the
- In the Python library:
- The non-callback version of the
async_run
function was removed. - The
get_ready_async_request
function was removed. - The signature of the
get_async_run_results
function was changed to remove thewait
argument.
- The non-callback version of the
- In the C++ library:
Known Issues
- The beta of the custom backend API version 2 has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature of the
CustomGetNextInputV2Fn_t
function adds thememory_type_id
argument. - The signature of the
CustomGetOutputV2Fn_t
function adds thememory_type_id
argument.
- The signature of the
- The beta of the inference server library API has non-backwards compatible changes to enable complete support for input and output tensors in both CPU and GPU memory:
- The signature and operation of the
TRTSERVER_ResponseAllocatorAllocFn_t
function has changed. Seesrc/core/trtserver.h
for a description of the new behavior. - The signature of the
TRTSERVER_InferenceRequestProviderSetInputData
function adds thememory_type_id
argument. - The signature of the
TRTSERVER_InferenceResponseOutputData
function add thememory_type_id
argument.
- The signature and operation of the
- TensorRT reformat-free I/O is not supported.
- Some versions of Google Kubernetes Engine (GKE) contain a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version or a GKE 1.14.6 or later version to avoid this issue.
Client Libraries and Examples
Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.8.0_ubuntu1604.clients.tar.gz and v1.8.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files. The client SDK is also available as a NGC Container.
Custom Backend SDK
Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.8.0_ubuntu1604.custombackend.tar.gz and v1.8.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.