neuralmagic/deepsparse v0.2.0 on GitHub

New Features:

Dense convolutions on AVX2 systems were optimized, improving performance for many non-pruned networks. In particular, this results in a speed improvement for batch size 64 ResNet-50 of up to 28% on Intel AVX2 systems and up to 39% on AMD AVX2 systems.
Operations to shuffle activations in engine optimized, resulting in up to 14% speed improvement for batch size 64 pruned quantized MobileNetV1.
Performance improvements made for networks with large output arrays.

In rare cases where a tensor, used as the input or output to an operation, is larger than 2GB, the engine can segfault. Users should decrease the batch size as a workaround.
In some cases, models running complicated pre- or post-processing steps could diminish the DeepSparse Engine performance by up to a factor of 10x due to hyperthreading, as two engine threads can run on the same physical core. Address the performance issue by trying the following recommended solutions in order of preference:
1. Enable thread binding
If that does not give performance benefit or you want to try additional options:
1. Use the numactl utility to prevent the process from running on hyperthreads.
2. Manually set the thread affinity in Python as follows:
```
import os
from deepsparse.cpu import cpu_architecture
ARCH = cpu_architecture()

if ARCH.vendor == "GenuineIntel":
    os.sched_setaffinity(0, range(ARCH.num_physical_cores()))
elif ARCH.vendor == "AuthenticAMD":
    os.sched_setaffinity(0, range(0, 2*ARCH.num_physical_cores(), 2))
else:
    raise RuntimeError(f"Unknown CPU vendor {ARCH.vendor}")
```