torchaudio 0.13.0 on Python PyPI

Highlights

TorchAudio 0.13.0 release includes:

Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
New datasets and metadata mode for the SUPERB benchmark
Custom language model support for CTC beam search decoding
StreamWriter for audio and video encoding

[Beta] Source Separation Models and Bundles

Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)

The TorchAudio v0.13 release includes the following features

MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
Hybrid Demucs model architecture (docs)
Three factory functions suitable for different sample rate ranges
Pre-trained pipelines (docs) and tutorial

SDR Results of pre-trained pipelines on MUSDB-HQ test set

Pipeline	All	Drums	Bass	Other	Vocals
HDEMUCS_HIGH_MUSDB*	6.42	7.76	6.51	4.47	6.93
HDEMUCS_HIGH_MUSDB_PLUS**	9.37	11.38	10.53	7.24	8.32

* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.

Special thanks to @adefossez for the guidance.

ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.

Datasets with metadata functionality:

LIBRISPEECH (docs)
LibriMix (docs)
QUESST14 (docs)
SPEECHCOMMANDS (docs)
(new) FluentSpeechCommands (docs)
(new) Snips (docs)
(new) IEMOCAP (docs)
(new) VoxCeleb1 (Identification, Verification)

[Beta] Custom Language Model support in CTC Beam Search Decoding

In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM wrapper.

[Beta] StreamWriter

torchaudio.io.StreamWriter is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.

Backward-incompatible changes

[BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
The GriffinLim implementations in transforms and functional used the momentum parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim usage of momentum is updated to resolve this discrepancy.
Make torchaudio.info decode audio to compute num_frames if it is not found in metadata (#2740).
In such cases, torchaudio.info may now return non-zero values for num_frames.

Bug Fixes

Fix random Gaussian generation (#2639)
torchaudio.compliance.kaldi.fbank with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.
Update download link for speech commands (#2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.

New Features

IO

Add metadata to source stream info (#2461, #2464)
Add utility function to fetch FFmpeg library versions (#2467)
Add YUV444P support to StreamReader (#2516)
Add StreamWriter (#2628, #2648, #2505)
Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
Add StreamReader Tensor Binding to src (#2699)
Add StreamWriter media device/streaming tutorial (#2708)
Add StreamWriter tutorial (#2698)

Ops

Add ITU-R BS.1770-4 loudness recommendation (#2472)
Add convolution operator (#2602)
Add additive noise function (#2608)

Models

Hybrid Demucs model implementation (#2506)
Docstring change for Hybrid Demucs (#2542, #2570)
Add NNLM support to CTC Decoder (#2528, #2658)
Move hybrid demucs model out of prototype (#2668)
Move conv_tasnet_base doc out of prototype (#2675)
Add custom lm example to decoder tutorial (#2762)

Pipelines

Add SourceSeparationBundle to prototype (#2440, #2559)
Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
Create tutorial for HDemucs (#2572)
Add HDEMUCS_HIGH_MUSDB (#2601)
Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
Move Hybrid Demucs pipeline to beta (#2673)
Update description of HDemucs pipelines

Datasets

Add fluent speech commands (#2480, #2510)
Add musdb dataset and tests (#2484)
Add VoxCeleb1 dataset (#2349)
Add metadata function for LibriSpeech (#2653)
Add Speech Commands metadata function (#2687)
Add metadata mode for various datasets (#2697)
Add IEMOCAP dataset (#2732)
Add Snips Dataset (#2738)
Add metadata for Librimix (#2751)
Add file name to returned item in Snips dataset (#2775)
Update IEMOCAP variants and labels (#2778)

Improvements

IO

Replace runtime_error exception with TORCH_CHECK (#2550, #2551, #2592)
Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
Refactor sox C++ (#2636, #2663)
Delay the import of kaldi_io (#2573)

Ops

Speed up resample with kernel generation modification (#2553, #2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for the torchaudio.functional.resample function using the sinc resampling method, on float32 tensor with two channels and one second duration.

CPU

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.13	0.256	0.549	0.769	0.820
0.12	0.386	0.534	31.8	12.1

CUDA

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.13	0.332	0.336	0.345	0.381
0.12	0.524	0.334	64.4	22.8

Add normalization parameter on spectrogram and inverse spectrogram (#2554)
Replace assert with raise for ops (#2579, #2599)
Replace CHECK_ by TORCH_CHECK_ (#2582)
Fix argument validation in TorchAudio filtering (#2609)

Models

Switch to flashlight decoder from upstream (#2557)
Add dimension and shape check (#2563)
Replace assert with raise in models (#2578, #2590)
Migrate CTC decoder code (#2580)
Enable CTC decoder in Windows (#2587)

Datasets

Replace assert with raise in datasets (#2571)
Add unit test for LibriMix dataset (#2659)
Add gtzan download note (#2763)

Tutorials

Tweak tutorials (#2630, #2733)
Update ASR inference tutorial (#2631)
Update and fix tutorials (#2661, #2701)
Introduce IO section to getting started tutorials (#2703)
Update HW video processing tutorial (#2739)
Update tutorial author information (#2764)
Fix typos in tacotron2 tutorial (#2761)
Fix fading in hybrid demucs tutorial (#2771)
Fix leaking matplotlib figure (#2769)
Update resampling tutorial (#2773)

Recipes

Use lazy import for joblib (#2498)
Revise LibriSpeech Conformer RNN-T recipe (#2535)
Fix bug in Conformer RNN-T recipe (#2611)
Replace bg_iterator in examples (#2645)
Remove obsolete examples (#2655)
Fix LibriSpeech Conforner RNN-T eval script (#2666)
Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
Improve wav2vec2/hubert model for pre-training (#2716)
Improve hubert recipe for pre-training and fine-tuning (#2744)

WER improvement on LibriSpeech dev and test sets

	Viterbi (v0.12)	Viterbi (v0.13)	KenLM (v0.12)	KenLM (v0.13)
dev-clean	10.7	10.9	4.4	4.2
dev-other	18.3	17.5	9.7	9.4
test-clean	10.8	10.9	4.4	4.4
test-other	18.5	17.8	10.1	9.5

Documentation

Examples

Add example for Vol transform (#2597)
Add example for Vad transform (#2598)
Add example for SlidingWindowCmn transform (#2600)
Add example for MelScale transform (#2616)
Add example for AmplitudeToDB transform (#2615)
Add example for InverseMelScale transform (#2635)
Add example for MFCC transform (#2637)
Add example for LFCC transform (#2640)
Add example for Loudness transform (#2641)

Other

Remove CTC decoder prototype message (#2459)
Fix docstring (#2540)
Dataset docstring change (#2575)
Fix typo - "dimension" (#2596)
Add note for lexicon free decoder output (#2603)
Fix stylecheck (#2606)
Fix dataset docs parsing issue with extra spaces (#2607)
Remove outdated doc (#2617)
Use double quotes for string in functional and transforms (#2618)
Fix doc warning (#2627)
Update README.md (#2633)
Sphinx-gallery updates (#2629, #2638, #2736, #2678, #2679)
Tweak documentation (#2656)
Consolidate bibliography / reference (#2676)
Tweak badge link URL generation (#2677)
Adopt :autosummary: in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692)
Update sox info docstring to account for mp3 frame count handling (#2742)
Fix HuBERT docstring (#2746)
Fix CTCDecoder doc (#2766)
Fix torchaudio.backend doc (#2781)

Build/CI

Simplify the requirements to minimum runtime dependencies (#2313)
Bump version to 0.13 (#2460)
Add tagged builds to torchaudio (#2471)
Update config.guess to the latest (#2479)
Pin MKL to 2020.04 (#2486)
Integration test fix deleting temporary directory (#2569)
Refactor cmake (#2585)
Introducing pytorch-cuda metapackage (#2612)
Move xcode to 14 from 12.5 (#2622)
Update nightly wheels to ROCm5.2 (#2672)
Lint updates (#2389, #2487)
M1 build updates (#2473, #2474, #2496, #2674)
CUDA-related updates: versions, builds, and checks (#2501, #2623, #2670, #2707, #2710, #2721, #2724)
Release-related updates (#2489, #2492, #2495, #2759)
Fix Anaconda upload (#2581, #2621)
Fix windows python 3.8 loading path (#2735, #2747)

torchaudio 0.13.0 torchaudio 0.13.0 Release Note on Python PyPI

Highlights

[Beta] Source Separation Models and Bundles

[Beta] Datasets and Metadata Mode for SUPERB Benchmarks

[Beta] Custom Language Model support in CTC Beam Search Decoding

[Beta] StreamWriter

Backward-incompatible changes

Bug Fixes

New Features

IO

Ops

Models

Pipelines

Datasets

Improvements

IO

Ops

Models

Datasets

Tutorials

Recipes

Documentation

Examples

Other

Build/CI

torchaudio 0.13.0
torchaudio 0.13.0 Release Note

on Python PyPI