NVIDIA/DALI v1.20.0 on GitHub

Key Features and Enhancements

This DALI release includes the following key features and enhancements:

Added the fn.experimental.remap operator for generic geometric transformation of images and video (#4379, #4419, #4365, #4374, #4425).
Added MPEG4 support to the GPU video decoder (#4424, #4327).
Added the fn.experimental.inflate operator that enables decompression of LZ4 compressed input (#4366).
Added support for broadcasting in arithmetic operators (CPU and GPU) (#4348).
Added experimental split and merge operators for conditional execution (#4359, #4405, #4358).
The following optimizations in GPU operators:
- Optimized MelScale kernel (#4395).
- Optimizations in the GPU decoder (#4351).
- Simplified arithmetic GPU operator (#4411).
- Split reduction kernels (#4383).
- Avoiding copy from non-pinned memory in PreemphasisFilter operator (#4380).
- Refactored the ConvertTimeMajorSpectrogram kernel (#4389).

Fixed Issues

The following issues were fixed in this release:

Fixed TensorList copy synchronization issues (#4458, #4453).
Fixed an issue with hint grid size in OpticalFlow (#4443).
Fixed the ES synchronization issues in integrated memory devices (#4321, #4423).
Added a missing CUDA stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426).
Fixed the pipeline initialization in Python after deserialization (#4350).
Fixed issues with serialization of functions in recent notebook versions (#4406).
Fixed an integration with new TF version by replacing Status::OK() with Status() in the TF plugin (#4442).

Improvements

Update dependencies 22/11 (#4427)
fn.experimental.remap optimizations (#4419)
Add mkv support (#4424)
Add inflate operator (#4366)
Include nvCOMP's license and notice in the acknowledgements (#4368)
Use numpy instead of naive loops in remap test. (#4425)
MelScale kernel optimization (#4395)
Optimize GPU decoder (#4351)
Simplify arithmetic operator GPU implementation (#4411)
Add CVE reporting guideline to the repo and readme (#4385)
Add internal Split and Merge operators (#4359)
Fix fstring usage for warning in pipeline (#4401)
Add fn.experimental.remap operator (#4379)
Divide expression_impl to avoid recompiling all ops when touching a detail in the impl (#4412)
Refactor ConvertTimeMajorSpectrogram kernel (#4389)
Remove documentation about data_layout argument for paddle and pytorch iterators (#4409)
Serialize failing global functions by value (#4406)
Limit the TF memory usage in test_dali_tf_dataset_shape.py tests (#4400)
Split reduction kernels (#4383)
Add convenient conversions from a list of arrays to DALI TensorList (#4391)
Add permute_in_place function with tests. (#4387)
Split cuda utils.h & fix includes (#4386)
Enable MPEG4 GPU decoding (#4327)
Update CUDA toolkit for Jetson build to 11.8 (#4376)
Remove TensorFlow 1.15 support from CUDA 11 (#4377)
Avoid copying from non-pinned memory in PreemphasisFilter operator (#4380)
Support broadcasting in arithmetic operators (CPU & GPU) (#4348)
Remove unnecessary reset in the PyTorch SSD example (#4373)
Remap kernel implementation with NPP (#4365)
Utils and prerequisities for NppRemapKernel implementation (#4374)
Extend DALIInterpType to_string (#4370)
Validate ROI in imgcodec (#4279)
Workspace unification (#4339)
Extend and relax TensorList sample APIs (#4358)
Remove the Pipeline/Executor completion callback APIs (#4345)

Bug Fixes

Fix H2H copy in HW NVJPEG. (#4458)
Fix an issue with improper hint grid size in OpticalFlow (#4443)
Enable support for full-swing videos (#4447)
Fix TensorList copy ordering issues (#4453)
Replace Status::OK() with Status() for TF plugin (#4442)
Adds a cuda stream synchronization before cuvidUnmapVideoFrame in nvDecoder (#4426)
Fix ES synchronization issues in integrated memory devices (#4321)
Fix debug build warnings in the inflate op (#4433)
Fix ExecutorSyncTest that run the SimpleExecutor twice (#4432)
Fix setting pinned status of the tensor list in the Python (#4431)
Pinned resource test fix: reset the device buffer on a proper stream. (#4428)
Fix libtiff CVEs (#4414)
Fix pinned resource test on integrated GPUs (#4423)
Fix builtin test - do not use operators lib (#4420)
Harden the code against ODR violations (#4421)
Unroll nested namespaces (#4415)
Add proper validation for empty batch in External Source (#4404)
Fix video decoder test for aarch64 (#4402)
Fix to enable leading underscore in op name (#4405)
Serialize failing global functions by value (#4406)
Add cuh files to linter (#4384)
Avoid reading out of bounds (#4398)
Fix namespace resolution for CUDA and STL math functions (#4378)
Fix unnecessary copy of the workspace object. (#4371)
Fix pipeline initialization in python after deserialization (#4350)
Fix misleading video example with timestamps (#4364)
Fix sanitizer build tests (#4367)

Breaking API changes

Removed the Pipeline/Executor completion callback APIs (#4345).
[C++ API] Workspace unification: C++ workspace is no longer templated with backend type (#4339).

Deprecated features

DALI will drop support for CUDA 10.2 in an upcoming release.

Known issues:

The GPU numpy reader might crash during the DALI process teardown with cufile 1.4.0.
The video loader operator requires that the key frames occur, at a minimum, every 10 to 15 frames of the video stream.
If the key frames occur at a frequency that is less than 10-15 frames, the returned frames might be out of sync.
Experimental VideoReaderDecoder does not support open GOP.
It will not report an error and might produce invalid frames. VideoReader uses a heuristic approach to detect open GOP and should work in most common cases.
The DALI TensorFlow plugin might not be compatible with TensorFlow versions 1.15.0 and later.
To use DALI with the TensorFlow version that does not have a prebuilt plugin binary shipped with DALI, make sure that the compiler that is used to build TensorFlow exists on the system during the plugin installation. (Depending on the particular version, you can use GCC 4.8.4, GCC 4.8.5, or GCC 5.4.)
In experimental debug and eager modes, the GPU external source is not properly synchronized with DALI internal streams.
As a workaround, you can manually synchronize the device before returning the data from the callback.
Due to some known issues with meltdown/spectra mitigations and DALI, DALI shows best performance when running in Docker with escalated privileges, for example:
- privileged=yes in Extra Settings for AWS data points
- --privileged or --security-opt seccomp=unconfined for bare Docker.

Binary builds

Install via pip for CUDA 10.2:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda102==1.20.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda102==1.20.0

or for CUDA 11:

CUDA 11.0 build uses CUDA toolkit enhanced compatibility. It is built with the latest CUDA 11.x toolkit
while it can run on the latest, stable CUDA 11.0 capable drivers (450.80 or later). 
Using the latest driver may enable additional functionality. 
More details can be found in enhanced CUDA compatibility guide.

pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-cuda110==1.20.0
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist/ nvidia-dali-tf-plugin-cuda110==1.20.0

Or use direct download links (CUDA 10.2):

Or use direct download links (CUDA 11.0):

FFmpeg source code:

This software uses code of FFmpeg licensed under the LGPLv2.1 and its source can be downloaded here

Libsndfile source code:

https://developer.download.nvidia.com/compute/redist/nvidia-dali/libsndfile-1.1.0.tar.gz

NVIDIA/DALI v1.20.0 DALI v1.20.0 on GitHub