github triton-inference-server/server v1.6.0
Release 1.6.0, corresponding to NGC container 19.09

latest releases: v2.44.0, v2.43.0, v2.42.0...
4 years ago

NVIDIA TensorRT Inference Server

The NVIDIA TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server.

What's New In 1.6.0

  • Added TensorRT 6 support, which includes support for TensorRT dynamic
    shapes.

  • Shared memory support is added as an alpha feature in this release. This
    support allows input and output tensors to be communicated via shared
    memory instead of over the network. Currently only system (CPU) shared
    memory is supported.

  • Amazon S3 is now supported as a remote file system for model repositories.
    Use the s3:// prefix on model repository paths to reference S3 locations.

  • The inference server library API is available as a beta in this release.
    The library API allows you to link against libtrtserver.so so that you can
    include all the inference server functionality directly in your application.

  • GRPC endpoint performance improvement. The inference server’s GRPC endpoint
    now uses significantly less memory while delivering higher performance.

  • The ensemble scheduler is now more flexible in allowing batching and
    non-batching models to be composed together in an ensemble.

  • The ensemble scheduler will now keep tensors in GPU memory between models
    when possible. Doing so significantly increases performance of some ensembles
    by avoiding copies to and from system memory.

  • The performance client, perf_client, now supports models with variable-sized
    input tensors.

Known Issues

  • The ONNX Runtime backend could not be updated to the 0.5.0 release due to multiple performance and correctness issues with that release.

  • In TensorRT 6:

    • Reformat-free I/O is not supported.
    • Only models that have a single optimization profile are currently supported.
  • Google Kubernetes Engine (GKE) version 1.14 contains a regression in the handling of LD_LIBRARY_PATH that prevents the inference server container from running correctly (see issue 141255952). Use a GKE 1.13 or earlier version to avoid this issue.

Client Libraries and Examples

Ubuntu 16.04 and Ubuntu 18.04 builds of the client libraries and examples are included in this release in the attached v1.6.0_ubuntu1604.clients.tar.gz and v1.6.0_ubuntu1804.clients.tar.gz files. See the documentation section 'Building the Client Libraries and Examples' for more information on using these files.

Custom Backend SDK

Ubuntu 16.04 and Ubuntu 18.04 builds of the custom backend SDK are included in this release in the attached v1.6.0_ubuntu1604.custombackend.tar.gz and v1.6.0_ubuntu1804.custombackend.tar.gz files. See the documentation section 'Building a Custom Backend' for more information on using these files.

Don't miss a new server release

NewReleases is sending notifications on new releases.