Highlights
TorchAudio 2.0 release includes:
- Data augmentation operators, e.g. convolution, additive noise, speed perturbation
- WavLM and XLS-R models and pre-trained pipelines
- Backend dispatcher powering revised
info
,load
,save
functions - Dropped support of Python 3.7
- Added Python 3.11 support
[Beta] Data augmentation operators
The release adds several data augmentation operators under torchaudio.functional
and torchaudio.transforms
:
torchaudio.functional.add_noise
torchaudio.functional.convolve
torchaudio.functional.deemphasis
torchaudio.functional.fftconvolve
torchaudio.functional.preemphasis
torchaudio.functional.speed
torchaudio.transforms.AddNoise
torchaudio.transforms.Convolve
torchaudio.transforms.Deemphasis
torchaudio.transforms.FFTConvolve
torchaudio.transforms.Preemphasis
torchaudio.transforms.Speed
torchaudio.transforms.SpeedPerturbation
The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.
For usage details, please refer to the documentation for torchaudio.functional
and torchaudio.transforms
, and tutorial “Audio Data Augmentation”.
[Beta] WavLM and XLS-R models and pre-trained pipelines
The release adds two self-supervised learning models for speech and audio.
Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:
torchaudio.pipelines.WAVLM_BASE
torchaudio.pipelines.WAVLM_BASE_PLUS
torchaudio.pipelines.WAVLM_LARGE
torchaudio.pipelines.WAV2VEC_XLSR_300M
torchaudio.pipelines.WAV2VEC_XLSR_1B
torchaudio.pipelines.WAV2VEC_XLSR_2B
For usage details, please refer to factory function
and pre-trained pipelines
documentation.
Backend dispatcher
Release 2.0 introduces new versions of I/O functions torchaudio.info
, torchaudio.load
and torchaudio.save
, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1
; the new logic will be enabled by default in Release 2.1.
# Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")
# Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")
# Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")
Please see the documentation for torchaudio
for more details.
Backward-incompatible changes
-
Dropped Python 3.7 support (#3020)
Following the upstream PyTorch (pytorch/pytorch#93155), the support for Python 3.7 has been dropped. -
Default to "precise" seek in
torchaudio.io.StreamReader.seek
(#2737, #2841, #2915, #2916, #2970)
Previously, theStreamReader.seek
method seeked into a key frame closest to the given time stamp. A new optionmode
has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default. -
Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
The following functions are removed fromdatasets.utils
stream_url
download_url
validate_file
extract_archive
.
Deprecations
Ops
-
Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
torchaudio.transforms.MelSpectrogram
assumes theonesided
argument to be alwaysTrue
. The forward path fails if its value isFalse
. Therefore this argument is deprecated. Users specifying this argument should stop specifying it. -
Deprecated
"sinc_interpolation"
and"kaiser_window"
option value in favor of"sinc_interp_hann"
and"sinc_interp_kaiser"
(#2922)
The valid values ofresampling_method
argument of resampling operations (torchaudio.transforms.Resample
andtorchaudio.functional.resample
) are changed."kaiser_window"
is now"sinc_interp_kaiser"
and"sinc_interpolation"
is"sinc_interp_hann"
. The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer #2891. -
Deprecated sox initialization/shutdown public API functions (#3010)
torchaudio.sox_effects.init_sox_effects
andtorchaudio.sox_effects.shutdown_sox_effects
are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.
Models
- Deprecated static binding of Flashlight-text based CTC decoder (#3055, #3089)
Since v0.12, TorchAudio binary distributions included the CTC decoder based on flashlight-text project. In a future release, TorchAudio will switch to dynamic binding of underlying CTC decoder implementation, and stop shipping the core CTC decoder implementations. Users who would like to use the CTC decoder need to separately install the CTC decoder from the upstream flashlight-text project. Other functionalities of TorchAudio will continue to work without flashlight-text.
Note: The API and numerical behavior does not change.
For more detail, please refer #3088.
I/O
- Deprecated file-like object support in sox_io (#3033)
As a preparation to switch to dynamically bound libsox, file-like object support in sox_io backend has been deprecated. It will be removed in 2.1 release in favor of the dispatcher. This deprecation affects the following functionalities.- I/O:
torchaudio.load
,torchaudio.info
andtorchaudio.save
. - Effects:
torchaudio.sox_effects.apply_effects_file
andtorchaudio.functional.apply_codec
.
For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
For effects, replacement functions will be added in the next release.
- I/O:
- Deprecated the use of Tensor as a container for byte string in StreamReader (#3086)
torchaudio.io.StreamReader
supports decoding media from byte strings contained in 1D tensors oftorch.uint8
type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string withio.BytesIO
.Deprecated Migration data = b"..."
src = torch.frombuffer(data, dtype=torch.uint8)
StreamReader(src)
data = b"..."
src = io.BytesIO(data)
StreamReader(src)
Bug Fixes
Ops
- Fixed contiguous error when backpropagating through
torchaudio.functional.lfilter
(#3080)
Pipelines
- Added layer normalization to wav2vec2 large+ pretrained models (#2873)
In self-supervised learning models such as Wav2Vec 2.0, HuBERT, or WavLM, layer normalization should be applied to waveforms if the convolutional feature extraction module uses layer normalization and is trained on a large-scale dataset. After adding layer normalization to those affected models, the Word Error Rate is significantly reduced.
Without the change in #2873, the WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 10.59 | 15.62 | 9.58 | 16.33 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.80 | 6.01 | 2.82 | 6.34 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 2.36 | 4.43 | 2.41 | 4.96 |
HUBERT_ASR_LARGE | 1.85 | 3.46 | 2.09 | 3.89 |
HUBERT_ASR_XLARGE | 2.21 | 3.40 | 2.26 | 4.05 |
After applying layer normalization, the updated WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 6.77 | 10.03 | 6.87 | 10.51 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.19 | 4.55 | 2.32 | 4.64 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 1.78 | 3.51 | 2.03 | 3.68 |
HUBERT_ASR_LARGE | 1.77 | 3.32 | 2.03 | 3.68 |
HUBERT_ASR_XLARGE | 1.73 | 2.72 | 1.90 | 3.16 |
Recipe
- Fixed DDP training in HuBERT recipes (#3068)
Ifshuffle
is setTrue
inBucketizeBatchSampler
, the seed is only the same for the first epoch. In later epochs, eachBucketizeBatchSampler
object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes.
IO
- Fixed signature mismatch on
_fail_info_fileobj
(#3032) - Remove unnecessary AVFrame allocation (#3021)
This fixes the memory leak reported intorchaudio.io.StreamReader
.
New Features
Ops
- Added CUDA kernel for
torchaudio.functional.lfilter
(#3018) - Added data augmentation ops (#2801, #2809, #2829, #2811, #2871, #2874, #2892, #2935, #2977, #3001, #3009, #3061, #3072)
IntroducesAddNoise
,Convolve
,FFTConvolve
,Speed
,SpeedPerturbation
,Deemphasis
, andPreemphasis
intorchaudio.transforms
, andadd_noise
,fftconvolve
,convolve
,speed
,preemphasis
, anddeemphasis
intorchaudio.functional
.
Models
Pipelines
I/O
- Added rgb48le and CUDA p010 support (HDR/10bit) to StreamReader (#3023)
- Added
fill_buffer
method totorchaudio.io.StreamReader
(#2954, #2971) - Added
buffer_chunk_size=-1
option totorchaudio.io.StreamReader
(#2969)
Whenbuffer_chunk_size=-1
,StreamReader
does not drop any buffered frame. Together with thefill_buffer
method, this is a recommended way to load the entire media.reader = StreamReader("video.mp4") reader.add_basic_audio_stream(buffer_chunk_size=-1) reader.add_basic_video_stream(buffer_chunk_size=-1) reader.fill_buffer() audio, video = reader.pop_chunks()
- Added PTS support to
torchaudio.io.StreamReader
(#2975)
torchaudio.io.SteramReader
now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.reader = StreamReader(...) reader.add_basic_audio_stream(...) reader.add_basic_video_stream(...) for audio_chunk, video_chunk in reader.stream(): # Fetch timestamp print(audio_chunk.pts) print(video_chunk.pts) # Chunks behave the same as torch.Tensor. audio_chunk.mean(dim=1)
- Added playback function
torchaudio.io.play_audio
(#3026, #3051)
You can play audio with thetorchaudio.io.play_audio
function. (macOS only) - Added new dispatcher (#3015, #3058, #3073)
Other
- Add utility functions to check information about FFmpeg (#2958, #3014)
The following functions are added totorchaudio.utils.ffmpeg_utils
, which can be used to query into the dynamically linked FFmpeg libraries.get_demuxers()
get_muxers()
get_audio_decoders()
get_audio_encoders()
get_video_decoders()
get_video_encoders()
get_input_devices()
get_output_devices()
get_input_protocols()
get_output_protocols()
get_build_config()
Recipes
- Add modularized SSL training recipe (#2876)
Improvements
I/O
-
Refactor StreamReader/Writer implementation
- Refactored StreamProcessor interface (#2791)
- Refactored Buffer implementation (#2939, #2943, #2962, #2984, #2988)
- Refactored AVFrame to Tensor conversions (#2940, #2946)
- Refactored and optimize yuv420p and nv12 processing (#2945)
- Abstracted away AVFormatContext from constructor (#3007)
- Removed unused/redundant things (#2995)
- Replaced
torchaudio::ffmpeg
namespace withtorchaudio::io
(#3013) - Merged
pop_chunks
implementations (#3002) - Cleaned up private methods (#3030)
- Moved drain method to private (#2996)
-
Added logging to
torchaudio.io.StreamReader/Writer
(#2878) -
Fixed the #threads used by FilterGraph to 1 (#2985)
-
Fixed the default #threads used by decoder to 1 in
torchaudio.io.StreamReader
(#2949) -
Moved libsox integration from
libtorchaudio
tolibtorchaudio_sox
(#2929) -
Added query methods to FilterGraph (#2976)
Ops
- Added logging to MelSpectrogram and Spectrogram (#2861)
- Fixed filtering function fallback mechanism (#2953)
- Enabled log probs input for RNN-T loss (#2798)
- Refactored extension modules initialization (#2968)
- Updated the guard mechanism for FFmpeg-related features (#3028)
- Updated the guard mechanism for
cuda_version
(#2952)
Models
- Renamed generator to vocoder in HiFiGAN model and factory functions (#2955)
- Enforces contiguous tensor in CTC decoder (#3074)
Datasets
- Validates the input path in LibriMix dataset (#2944)
Documentation
- Fixed docs warnings for conformer w2v2 (#2900)
- Updated model documentation structure (#2902)
- Fixed document for MelScale and InverseMelScale (#2967)
- Updated highlighting in doc (#3000)
- Added installation / build instruction to doc (#3038)
- Redirect build instruction to official doc (#3053)
- Tweak docs around IO (#3064)
- Improved docstring about input path to LibriMix (#2937)
Recipes
- Simplify train step in Conformer RNN-T LibriSpeech recipe (#2981)
- Update WER results for CTC n-gram decoding (#3070)
- Update ssl example (#3060)
- fix import bug in global_stats.py (#2858)
- Fixes examples/source_separation for WSJ0_2mix dataset (#2987)
Tutorials
- Added mel spectrogram visualization to Streaming ASR tutorial (#2974)
- Fixed mel spectrogram visualization in TTS tutorial (#2989)
- Updated data augmentation tutorial to use new operators (#3062)
- Fixed hybrid demucs tutorial for CUDA (#3017)
- Updated hardware accelerated video processing tutorial (#3050)
Builds
- Fixed
USE_CUDA
detection (#3005) - Fixed
USE_ROCM
detection (#3008) - Added M1 Conda builds (#2840)
- Added M1 Wheels builds (#2839)
- Added CUDA 11.8 builds (#2951)
- Switched CI to CUDA 11.7 from CUDA 11.6 (#3031, #3034)
- Added python 3.11 support (#3039, #3071)
- Updated C++ standard to 17 (#2973)
Tests
- Fix integration test for WAV2VEC2_ASR_LARGE_LV60K_10M (#2910)
- Fix CI tests on gpu machines (#2982)
- Remove function input parameters from data aug functional tests (#3011)
- Reduce the sample rate of some tests (#2963)
Style
- Fix type of arguments in torchaudio.io classes (#2913)