NVIDIA/DALI v0.25.0 on GitHub

Key Features and Enhancements

This DALI release includes the following key features and enhancements.

Added support for aarch64 Server Base System Architecture (#2110) - we provide a build for CUDA 11 that can be installed following the installation guide.
New operators:
- Normal Distribution GPU Operator (#2125)
- Video reader resize (#2097)
Improvements to ExternalSource Op:
- Added the no_copy option, which allows DALI to borrow a user's memory instead of copying it (#2024).
- Removed the redundant copy in the ExternalSource operator (#2124)
Reworked the Resize operator family, including video, channel-first, RoI, and multiple-type support (#2164) with the new Resize tutorial (#2189).
Bundled all python versions into one wheel (#2096).
- One DALI wheel can be used with all supported Python versions, including 3.5, 3.6, 3.7 and 3.8.
Improved error messages and added information about the Operator of origin (#2065).
Extended the following C APIs to copy output and input samples:
- daliOutputCopy (#2145) and daliOutputCopySamples (#2161, #2186).
- These APIs allow you to use the copy kernel and reduce the amount of copied memory and to use the copy kernel in ShareUserData (#2200).
Performance improvements:
- Arithmetic Ops GPU (#2137)
- Priorities in CPU thread pool allowing for better load balancing with uneven samples (#2092, #2102)

Bug fixes

Fix aarch64 builds that are still gcc 5.x based (#2099)
Fix conda build after the new build of libprotobuf was released (#2101)
Fix the lack of setting the right device in the ExternalSource (#2112)
Fix lack of a proper include to set CUDART_VERSION inside nvml.h and nvml_wrap.h (#2113)
Fix layout propagation in Gaussian Blur (#2118)
Fix layout propagation in Erase (#2133)
Fix TF dataset notebook (#2135)
Fix lack of MXNet plugin docs generation (#2146)
Fix TL3_RN50_convergence test for PaddlePaddle (#2159)
Workaround a bug in compiler, magically converting instance call to static call. (#2162)
Fix the need to have a numpy installed when test_utils.py is just imported (#2166)
Fix missing layouts in operators (#2136)
Fix QNX build (#2199)

Improvements

Update to CUDA 11 GA toolkit (#2094)
Allow nvJPEG to pre-allocate pinned and device buffers during construction (#2081)
Add zero-copy to the ExternalSource operator (#2024)
Introduce priorities in ThreadPool (#2092)
Video reader resize (#2097)
Detect version of CUDA based on libcudart.so.* name (#2105)
Add Operator origin information to most errors (#2065)
Enhance Pad documentation (#2098)
Bundle all python versions into one wheel (#2096)
Use new nvmlDeviceGetCpuAffinityWithinScope API for thread binding (#2093)
Use new ThreadPool API to post work with priority (#2102)
TensorListView generalized reshape and reinterpret (#2108)
Update aarch64_linux build to Jetpack 4.4 and CUDA 10.2 (#2107)
Renable VP9 video tests after driver update (#2117)
Remove usage of future from DALI (#2119)
Removes redundant copy in ExternalSource operator (#2124)
Add more verbose info when HwDecoderUtilizationTest is skipped (#2106)
Per-stream/per-device object pool. (#2127)
Fix PaddlePaddle test broken by rarfile update not compatible with Python 3.5 (#2130)
Add missing and a partial check in linter for this include file. (#2131)
Add libprotobuf-static as DALI conda build dependency (#2132)
Auto apply dataset options (#1963)
Add an option to use a copy kernel to feed external input (#2122)
Adjust mel filter test to librosa change (#2144)
Add dependency to dali_kernels to dali lib (#2143)
Tune Arithmetic Op launch specification (#2137)
Add daliOutputCopy (#2145)
Reduce memory usage in VideoReadeResize test (#2149)
Normal Distribution GPU Operator (#2125)
Remove pinning of numba version as librosa 0.8.0 has been released (#2151)
Add an ability to suppress _iterator_deprecation_warning (#2154)
Span-of-arrays flattening + minor layout utils (#2156)
Remove deprecated use of ltrb in BboxRandomCrop (#2141)
Improve PyTorch and MXNet ExternalSource examples (#2147)
Enable DALI build and tests for SBSA (#2110)
Add --disable-mmap flag to RN50 data pipeline test (#2163)
Make TF dataset build for 2.3.0 (#2160)
Enforce recordio indices are not empty (#2157)
Add daliOutputCopySamples (#2161)
Use TIFFGetFieldDefaulted and remove warning about falling back to GenericImage decoder (#2153)
Add an information about the faulty image to CreateImage invocation in nvjpeg_decoder_decoupled_api.h (#2174)
Add proper error handling where there are no valid sequences in the VideoReader (#2180)
Update instruction how run ResNet50 example for PyTorch (#2170)
Add the possibility to skip individual samples when using daliOutputCopySamples (#2186)
Change DALI build command to use minor CUDA version as well (#2155)
Reworked Resize operator family - video, channel-first, RoI and multiple type support (#2164)
Move to Update 1 release of CUDA 11 toolkit (#2188)
Make the test deterministically pick video files. (#2190)
Resize tutorial (#2189)
Use copy kernel when making a contiguous batch during ShareUserData, if user requested it (#2200)

Breaking API changes

Remove deprecated use of ltrb in BboxRandomCrop (#2141)

Deprecated feature

Known issues:

The video loader operator requires that the key frames occur at a minimum every 10 to 15 frames of the video stream. If the key frames occur at a lesser frequency, then the returned frames may be out of sync.
The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when run in Docker with escalated privileges, for example:
- privileged=yes in Extra Settings for AWS data points
- --privileged or --security-opt seccomp=unconfined for bare Docker

Binary builds

Install via pip for CUDA 10
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda100==0.25.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda100==0.25.0

or for CUDA 11:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==0.25.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==0.25.0

Or use direct download links (CUDA 10.0):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.0.28.tar.gz

NVIDIA/DALI v0.25.0 DALI v0.25.0 on GitHub