New Features:
- Support added for running multiple models with the same engine when using the Elastic Scheduler.
- When using the Elastic Scheduler, the caller can now use the
num_streams
argument to tune the number of requests that are processed in parallel. - Pipeline and annotation support added and generalized for transformers, yolov5, and torchvision.
- Documentation additions made for transformers, yolov5, torchvision, and serving that focus on model deployment for the given integrations.
- AWS SageMaker example created.
Changes:
- Click as a root dependency added as the new preferred route for CLI invocation and arg management.
Performance:
- Inference performance has been improved for unstructured sparse quantized models on AVX2 and AVX-512 systems that do not support VNNI instructions. This includes up to 20% on BERT and 45% on ResNet-50.
Resolved Issues:
- When a layer operates on a dataset larger than 2GB, potential crashes no longer happen.
- Assertion error addressed for Reduce operations where the reduction axis is of length 1.
- Rare assertion failure addressed related to Tensor Columns.
- When running the DeepSparse Engine on a system with a non-uniform system topology, model compilation now properly terminates.
Known Issues:
- In rare cases, the engine may crash with an assertion failure during model compilation for a convolution with a 1x1 kernel with 2x2 convolution strides; hotfix forthcoming.
- The engine will crash with an assertion failure when setting the
num_streams
parameter to fewer than the number of NUMA nodes; hotfix forthcoming. - In rare cases, the engine may enter an infinite loop when an operation has multiple inputs coming from the same source; hotfix forthcoming.