Triton Inference Server
The Triton Inference Server provides a cloud inferencing solution optimized for both CPUs and GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton Server is also available as a shared library with an API that allows the full functionality of the server to be included directly in an application.
The client libraries and examples are available in this release exclusively via the Ubuntu 24.04–based NGC Container. The SDK container includes the client libraries and examples, Performance Analyzer, and Model Analyzer. See Getting the Client Libraries for more information.
The Triton TensorRT-LLM container image and base layers are updated for this release. Please refer to the support matrix and compatibility.md in the TensorRT-LLM backend repository for all dependency versions.
New Features and Improvements
ml_dtypes.bfloat16 arrays for BF16 input/output tensors and no longer casts to/from np.float32. Scripts that passed BF16 data as np.float32 arrays must switch to ml_dtypes.bfloat16.
S3 after an idle-connection timeout.
SageMaker endpoint.
/opt/tritonserver/ to stay root-owned, with the triton-server user accessing it via group/other permissions.
Known Issues
GF(2^m) with untrusted params) for many linux binaries, in order to avoid exposure to known issues in OpenSSL 1.1.1.
tritonserver --allow-gpu-metrics false ....
is_non_linear_format_io:true for reformat-free tensors is not provided in the model configuration, the model may not load successfully.
ResponseSender goes out of scope or is properly cleaned up before unloading the model to guarantee that the unloading process executes correctly.
initialize step: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads. For the default model control mode, after server shutdown, vllm related sub-processes are not killed. Related vllm issue: vllm-project/vllm#6766. Please specify "distributed_executor_backend":"ray" in the model.json when deploying vllm models with tensor parallelism > 1.
"config" : "<JSON>" instead of custom configuration file in the following format: "file:configs/<model-config-name>.pbtxt" : "<base64-encoded-file-content>".
--disable-auto-complete-config.
Client Libraries and Examples
Triton TRT-LLM Container Support Matrix
Dependency
Version
TensorRT-LLM
1.2.1
TensorRT
See compatibility.md for the TensorRT version pinned to the 26.06 TRT-LLM container