pypi torchaudio 0.13.0
torchaudio 0.13.0 Release Note

latest releases: 2.4.1, 2.4.0, 2.3.1...
23 months ago

Highlights

TorchAudio 0.13.0 release includes:

  • Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
  • New datasets and metadata mode for the SUPERB benchmark
  • Custom language model support for CTC beam search decoding
  • StreamWriter for audio and video encoding

[Beta] Source Separation Models and Bundles

Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features

  • MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
  • Hybrid Demucs model architecture (docs)
  • Three factory functions suitable for different sample rate ranges
  • Pre-trained pipelines (docs) and tutorial

SDR Results of pre-trained pipelines on MUSDB-HQ test set

Pipeline All Drums Bass Other Vocals
HDEMUCS_HIGH_MUSDB* 6.42 7.76 6.51 4.47 6.93
HDEMUCS_HIGH_MUSDB_PLUS** 9.37 11.38 10.53 7.24 8.32

* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

Special thanks to @adefossez for the guidance.

ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.

Datasets with metadata functionality:

[Beta] Custom Language Model support in CTC Beam Search Decoding

In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.

[Beta] StreamWriter

torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.

Backward-incompatible changes

  • [BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
    The GriffinLim implementations in transforms and functional used the momentum parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim usage of momentum is updated to resolve this discrepancy.
  • Make torchaudio.info decode audio to compute num_frames if it is not found in metadata (#2740).
    In such cases, torchaudio.info may now return non-zero values for num_frames.

Bug Fixes

  • Fix random Gaussian generation (#2639)
    torchaudio.compliance.kaldi.fbank with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.
  • Update download link for speech commands (#2777)
    The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.

New Features

IO

  • Add metadata to source stream info (#2461, #2464)
  • Add utility function to fetch FFmpeg library versions (#2467)
  • Add YUV444P support to StreamReader (#2516)
  • Add StreamWriter (#2628, #2648, #2505)
  • Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
  • Add StreamReader Tensor Binding to src (#2699)
  • Add StreamWriter media device/streaming tutorial (#2708)
  • Add StreamWriter tutorial (#2698)

Ops

  • Add ITU-R BS.1770-4 loudness recommendation (#2472)
  • Add convolution operator (#2602)
  • Add additive noise function (#2608)

Models

  • Hybrid Demucs model implementation (#2506)
  • Docstring change for Hybrid Demucs (#2542, #2570)
  • Add NNLM support to CTC Decoder (#2528, #2658)
  • Move hybrid demucs model out of prototype (#2668)
  • Move conv_tasnet_base doc out of prototype (#2675)
  • Add custom lm example to decoder tutorial (#2762)

Pipelines

  • Add SourceSeparationBundle to prototype (#2440, #2559)
  • Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
  • Create tutorial for HDemucs (#2572)
  • Add HDEMUCS_HIGH_MUSDB (#2601)
  • Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
  • Move Hybrid Demucs pipeline to beta (#2673)
  • Update description of HDemucs pipelines

Datasets

  • Add fluent speech commands (#2480, #2510)
  • Add musdb dataset and tests (#2484)
  • Add VoxCeleb1 dataset (#2349)
  • Add metadata function for LibriSpeech (#2653)
  • Add Speech Commands metadata function (#2687)
  • Add metadata mode for various datasets (#2697)
  • Add IEMOCAP dataset (#2732)
  • Add Snips Dataset (#2738)
  • Add metadata for Librimix (#2751)
  • Add file name to returned item in Snips dataset (#2775)
  • Update IEMOCAP variants and labels (#2778)

Improvements

IO

Ops

  • Speed up resample with kernel generation modification (#2553, #2561)
    The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for the torchaudio.functional.resample function using the sinc resampling method, on float32 tensor with two channels and one second duration.

CPU

torchaudio version 8k → 16k [Hz] 16k → 8k 16k → 44.1k 44.1k → 16k
0.13 0.256 0.549 0.769 0.820
0.12 0.386 0.534 31.8 12.1

CUDA

torchaudio version 8k → 16k [Hz] 16k → 8k 16k → 44.1k 44.1k → 16k
0.13 0.332 0.336 0.345 0.381
0.12 0.524 0.334 64.4 22.8
  • Add normalization parameter on spectrogram and inverse spectrogram (#2554)
  • Replace assert with raise for ops (#2579, #2599)
  • Replace CHECK_ by TORCH_CHECK_ (#2582)
  • Fix argument validation in TorchAudio filtering (#2609)

Models

  • Switch to flashlight decoder from upstream (#2557)
  • Add dimension and shape check (#2563)
  • Replace assert with raise in models (#2578, #2590)
  • Migrate CTC decoder code (#2580)
  • Enable CTC decoder in Windows (#2587)

Datasets

  • Replace assert with raise in datasets (#2571)
  • Add unit test for LibriMix dataset (#2659)
  • Add gtzan download note (#2763)

Tutorials

  • Tweak tutorials (#2630, #2733)
  • Update ASR inference tutorial (#2631)
  • Update and fix tutorials (#2661, #2701)
  • Introduce IO section to getting started tutorials (#2703)
  • Update HW video processing tutorial (#2739)
  • Update tutorial author information (#2764)
  • Fix typos in tacotron2 tutorial (#2761)
  • Fix fading in hybrid demucs tutorial (#2771)
  • Fix leaking matplotlib figure (#2769)
  • Update resampling tutorial (#2773)

Recipes

  • Use lazy import for joblib (#2498)
  • Revise LibriSpeech Conformer RNN-T recipe (#2535)
  • Fix bug in Conformer RNN-T recipe (#2611)
  • Replace bg_iterator in examples (#2645)
  • Remove obsolete examples (#2655)
  • Fix LibriSpeech Conforner RNN-T eval script (#2666)
  • Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
  • Improve wav2vec2/hubert model for pre-training (#2716)
  • Improve hubert recipe for pre-training and fine-tuning (#2744)

WER improvement on LibriSpeech dev and test sets

Viterbi (v0.12) Viterbi (v0.13) KenLM (v0.12) KenLM (v0.13)
dev-clean 10.7 10.9 4.4 4.2
dev-other 18.3 17.5 9.7 9.4
test-clean 10.8 10.9 4.4 4.4
test-other 18.5 17.8 10.1 9.5

Documentation

Examples

  • Add example for Vol transform (#2597)
  • Add example for Vad transform (#2598)
  • Add example for SlidingWindowCmn transform (#2600)
  • Add example for MelScale transform (#2616)
  • Add example for AmplitudeToDB transform (#2615)
  • Add example for InverseMelScale transform (#2635)
  • Add example for MFCC transform (#2637)
  • Add example for LFCC transform (#2640)
  • Add example for Loudness transform (#2641)

Other

  • Remove CTC decoder prototype message (#2459)
  • Fix docstring (#2540)
  • Dataset docstring change (#2575)
  • Fix typo - "dimension" (#2596)
  • Add note for lexicon free decoder output (#2603)
  • Fix stylecheck (#2606)
  • Fix dataset docs parsing issue with extra spaces (#2607)
  • Remove outdated doc (#2617)
  • Use double quotes for string in functional and transforms (#2618)
  • Fix doc warning (#2627)
  • Update README.md (#2633)
  • Sphinx-gallery updates (#2629, #2638, #2736, #2678, #2679)
  • Tweak documentation (#2656)
  • Consolidate bibliography / reference (#2676)
  • Tweak badge link URL generation (#2677)
  • Adopt :autosummary: in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692)
  • Update sox info docstring to account for mp3 frame count handling (#2742)
  • Fix HuBERT docstring (#2746)
  • Fix CTCDecoder doc (#2766)
  • Fix torchaudio.backend doc (#2781)

Build/CI

Don't miss a new torchaudio release

NewReleases is sending notifications on new releases.