pypi torchaudio 2.0.1
Torchaudio 2.0 Release Note

latest releases: 2.4.1, 2.4.0, 2.3.1...
18 months ago

Highlights

TorchAudio 2.0 release includes:

  • Data augmentation operators, e.g. convolution, additive noise, speed perturbation
  • WavLM and XLS-R models and pre-trained pipelines
  • Backend dispatcher powering revised info, load, save functions
  • Dropped support of Python 3.7
  • Added Python 3.11 support

[Beta] Data augmentation operators

The release adds several data augmentation operators under torchaudio.functional and torchaudio.transforms:

  • torchaudio.functional.add_noise
  • torchaudio.functional.convolve
  • torchaudio.functional.deemphasis
  • torchaudio.functional.fftconvolve
  • torchaudio.functional.preemphasis
  • torchaudio.functional.speed
  • torchaudio.transforms.AddNoise
  • torchaudio.transforms.Convolve
  • torchaudio.transforms.Deemphasis
  • torchaudio.transforms.FFTConvolve
  • torchaudio.transforms.Preemphasis
  • torchaudio.transforms.Speed
  • torchaudio.transforms.SpeedPerturbation

The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.

For usage details, please refer to the documentation for torchaudio.functional and torchaudio.transforms, and tutorial “Audio Data Augmentation”.

[Beta] WavLM and XLS-R models and pre-trained pipelines

The release adds two self-supervised learning models for speech and audio.

  • WavLM that is robust to noise and reverberation.
  • XLS-R that is trained on cross-lingual datasets.

Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:

  • torchaudio.pipelines.WAVLM_BASE
  • torchaudio.pipelines.WAVLM_BASE_PLUS
  • torchaudio.pipelines.WAVLM_LARGE
  • torchaudio.pipelines.WAV2VEC_XLSR_300M
  • torchaudio.pipelines.WAV2VEC_XLSR_1B
  • torchaudio.pipelines.WAV2VEC_XLSR_2B

For usage details, please refer to factory function and pre-trained pipelines documentation.

Backend dispatcher

Release 2.0 introduces new versions of I/O functions torchaudio.info, torchaudio.load and torchaudio.save, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1; the new logic will be enabled by default in Release 2.1.

# Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")

# Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")

# Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")

Please see the documentation for torchaudio for more details.

Backward-incompatible changes

  • Dropped Python 3.7 support (#3020)
    Following the upstream PyTorch (pytorch/pytorch#93155), the support for Python 3.7 has been dropped.

  • Default to "precise" seek in torchaudio.io.StreamReader.seek (#2737, #2841, #2915, #2916, #2970)
    Previously, the StreamReader.seek method seeked into a key frame closest to the given time stamp. A new option mode has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default.

  • Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
    The following functions are removed from datasets.utils

    • stream_url
    • download_url
    • validate_file
    • extract_archive.

Deprecations

Ops

  • Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
    torchaudio.transforms.MelSpectrogram assumes the onesided argument to be always True. The forward path fails if its value is False. Therefore this argument is deprecated. Users specifying this argument should stop specifying it.

  • Deprecated "sinc_interpolation" and "kaiser_window" option value in favor of "sinc_interp_hann" and "sinc_interp_kaiser" (#2922)
    The valid values of resampling_method argument of resampling operations (torchaudio.transforms.Resample and torchaudio.functional.resample) are changed. "kaiser_window" is now "sinc_interp_kaiser" and "sinc_interpolation" is "sinc_interp_hann". The old values will continue to work, but users are encouraged to update their code.
    For the reason behind of this change, please refer #2891.

  • Deprecated sox initialization/shutdown public API functions (#3010)
    torchaudio.sox_effects.init_sox_effects and torchaudio.sox_effects.shutdown_sox_effects are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.

Models

  • Deprecated static binding of Flashlight-text based CTC decoder (#3055, #3089)
    Since v0.12, TorchAudio binary distributions included the CTC decoder based on flashlight-text project. In a future release, TorchAudio will switch to dynamic binding of underlying CTC decoder implementation, and stop shipping the core CTC decoder implementations. Users who would like to use the CTC decoder need to separately install the CTC decoder from the upstream flashlight-text project. Other functionalities of TorchAudio will continue to work without flashlight-text.
    Note: The API and numerical behavior does not change.
    For more detail, please refer #3088.

I/O

  • Deprecated file-like object support in sox_io (#3033)
    As a preparation to switch to dynamically bound libsox, file-like object support in sox_io backend has been deprecated. It will be removed in 2.1 release in favor of the dispatcher. This deprecation affects the following functionalities.
    • I/O: torchaudio.load, torchaudio.info and torchaudio.save.
    • Effects: torchaudio.sox_effects.apply_effects_file and torchaudio.functional.apply_codec.
      For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
      For effects, replacement functions will be added in the next release.
  • Deprecated the use of Tensor as a container for byte string in StreamReader (#3086)
    torchaudio.io.StreamReader supports decoding media from byte strings contained in 1D tensors of torch.uint8 type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string with io.BytesIO.
    Deprecated Migration
    data = b"..."
    src = torch.frombuffer(data, dtype=torch.uint8)
    StreamReader(src)
    data = b"..."
    src = io.BytesIO(data)
    StreamReader(src)

Bug Fixes

Ops

  • Fixed contiguous error when backpropagating through torchaudio.functional.lfilter (#3080)

Pipelines

  • Added layer normalization to wav2vec2 large+ pretrained models (#2873)
    In self-supervised learning models such as Wav2Vec 2.0, HuBERT, or WavLM, layer normalization should be applied to waveforms if the convolutional feature extraction module uses layer normalization and is trained on a large-scale dataset. After adding layer normalization to those affected models, the Word Error Rate is significantly reduced.

Without the change in #2873, the WER results are:

Model dev-clean dev-other test-clean test-other
WAV2VEC2_ASR_LARGE_LV60K_10M 10.59 15.62 9.58 16.33
WAV2VEC2_ASR_LARGE_LV60K_100H 2.80 6.01 2.82 6.34
WAV2VEC2_ASR_LARGE_LV60K_960H 2.36 4.43 2.41 4.96
HUBERT_ASR_LARGE 1.85 3.46 2.09 3.89
HUBERT_ASR_XLARGE 2.21 3.40 2.26 4.05

After applying layer normalization, the updated WER results are:

Model dev-clean dev-other test-clean test-other
WAV2VEC2_ASR_LARGE_LV60K_10M 6.77 10.03 6.87 10.51
WAV2VEC2_ASR_LARGE_LV60K_100H 2.19 4.55 2.32 4.64
WAV2VEC2_ASR_LARGE_LV60K_960H 1.78 3.51 2.03 3.68
HUBERT_ASR_LARGE 1.77 3.32 2.03 3.68
HUBERT_ASR_XLARGE 1.73 2.72 1.90 3.16

Recipe

  • Fixed DDP training in HuBERT recipes (#3068)
    If shuffle is set True in BucketizeBatchSampler, the seed is only the same for the first epoch. In later epochs, each BucketizeBatchSampler object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes.

IO

  • Fixed signature mismatch on _fail_info_fileobj (#3032)
  • Remove unnecessary AVFrame allocation (#3021)
    This fixes the memory leak reported in torchaudio.io.StreamReader.

New Features

Ops

  • Added CUDA kernel for torchaudio.functional.lfilter (#3018)
  • Added data augmentation ops (#2801, #2809, #2829, #2811, #2871, #2874, #2892, #2935, #2977, #3001, #3009, #3061, #3072)
    Introduces AddNoise, Convolve, FFTConvolve, Speed, SpeedPerturbation, Deemphasis, and Preemphasis in torchaudio.transforms, and add_noise, fftconvolve, convolve, speed, preemphasis, and deemphasis in torchaudio.functional.

Models

Pipelines

  • Added WavLM bundles (#2833, #2895)
  • Added pre-trained pipelines for XLS-R models (#2978)

I/O

  • Added rgb48le and CUDA p010 support (HDR/10bit) to StreamReader (#3023)
  • Added fill_buffer method to torchaudio.io.StreamReader (#2954, #2971)
  • Added buffer_chunk_size=-1 option to torchaudio.io.StreamReader (#2969)
    When buffer_chunk_size=-1, StreamReader does not drop any buffered frame. Together with the fill_buffer method, this is a recommended way to load the entire media.
    reader = StreamReader("video.mp4")
    reader.add_basic_audio_stream(buffer_chunk_size=-1)
    reader.add_basic_video_stream(buffer_chunk_size=-1)
    reader.fill_buffer()
    audio, video = reader.pop_chunks()
  • Added PTS support to torchaudio.io.StreamReader (#2975)
    torchaudio.io.SteramReader now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.
    reader = StreamReader(...)
    reader.add_basic_audio_stream(...)
    reader.add_basic_video_stream(...)
    for audio_chunk, video_chunk in reader.stream():
        # Fetch timestamp
        print(audio_chunk.pts)
        print(video_chunk.pts)
        # Chunks behave the same as torch.Tensor.
        audio_chunk.mean(dim=1)
  • Added playback function torchaudio.io.play_audio (#3026, #3051)
    You can play audio with the torchaudio.io.play_audio function. (macOS only)
  • Added new dispatcher (#3015, #3058, #3073)

Other

  • Add utility functions to check information about FFmpeg (#2958, #3014)
    The following functions are added to torchaudio.utils.ffmpeg_utils, which can be used to query into the dynamically linked FFmpeg libraries.
    • get_demuxers()
    • get_muxers()
    • get_audio_decoders()
    • get_audio_encoders()
    • get_video_decoders()
    • get_video_encoders()
    • get_input_devices()
    • get_output_devices()
    • get_input_protocols()
    • get_output_protocols()
    • get_build_config()

Recipes

  • Add modularized SSL training recipe (#2876)

Improvements

I/O

  • Refactor StreamReader/Writer implementation

    • Refactored StreamProcessor interface (#2791)
    • Refactored Buffer implementation (#2939, #2943, #2962, #2984, #2988)
    • Refactored AVFrame to Tensor conversions (#2940, #2946)
    • Refactored and optimize yuv420p and nv12 processing (#2945)
    • Abstracted away AVFormatContext from constructor (#3007)
    • Removed unused/redundant things (#2995)
    • Replaced torchaudio::ffmpeg namespace with torchaudio::io (#3013)
    • Merged pop_chunks implementations (#3002)
    • Cleaned up private methods (#3030)
    • Moved drain method to private (#2996)
  • Added logging to torchaudio.io.StreamReader/Writer (#2878)

  • Fixed the #threads used by FilterGraph to 1 (#2985)

  • Fixed the default #threads used by decoder to 1 in torchaudio.io.StreamReader (#2949)

  • Moved libsox integration from libtorchaudio to libtorchaudio_sox (#2929)

  • Added query methods to FilterGraph (#2976)

Ops

  • Added logging to MelSpectrogram and Spectrogram (#2861)
  • Fixed filtering function fallback mechanism (#2953)
  • Enabled log probs input for RNN-T loss (#2798)
  • Refactored extension modules initialization (#2968)
  • Updated the guard mechanism for FFmpeg-related features (#3028)
  • Updated the guard mechanism for cuda_version (#2952)

Models

  • Renamed generator to vocoder in HiFiGAN model and factory functions (#2955)
  • Enforces contiguous tensor in CTC decoder (#3074)

Datasets

  • Validates the input path in LibriMix dataset (#2944)

Documentation

  • Fixed docs warnings for conformer w2v2 (#2900)
  • Updated model documentation structure (#2902)
  • Fixed document for MelScale and InverseMelScale (#2967)
  • Updated highlighting in doc (#3000)
  • Added installation / build instruction to doc (#3038)
  • Redirect build instruction to official doc (#3053)
  • Tweak docs around IO (#3064)
  • Improved docstring about input path to LibriMix (#2937)

Recipes

  • Simplify train step in Conformer RNN-T LibriSpeech recipe (#2981)
  • Update WER results for CTC n-gram decoding (#3070)
  • Update ssl example (#3060)
  • fix import bug in global_stats.py (#2858)
  • Fixes examples/source_separation for WSJ0_2mix dataset (#2987)

Tutorials

  • Added mel spectrogram visualization to Streaming ASR tutorial (#2974)
  • Fixed mel spectrogram visualization in TTS tutorial (#2989)
  • Updated data augmentation tutorial to use new operators (#3062)
  • Fixed hybrid demucs tutorial for CUDA (#3017)
  • Updated hardware accelerated video processing tutorial (#3050)

Builds

  • Fixed USE_CUDA detection (#3005)
  • Fixed USE_ROCM detection (#3008)
  • Added M1 Conda builds (#2840)
  • Added M1 Wheels builds (#2839)
  • Added CUDA 11.8 builds (#2951)
  • Switched CI to CUDA 11.7 from CUDA 11.6 (#3031, #3034)
  • Added python 3.11 support (#3039, #3071)
  • Updated C++ standard to 17 (#2973)

Tests

  • Fix integration test for WAV2VEC2_ASR_LARGE_LV60K_10M (#2910)
  • Fix CI tests on gpu machines (#2982)
  • Remove function input parameters from data aug functional tests (#3011)
  • Reduce the sample rate of some tests (#2963)

Style

  • Fix type of arguments in torchaudio.io classes (#2913)

Don't miss a new torchaudio release

NewReleases is sending notifications on new releases.