Highlights
TorchAudio 0.13.0 release includes:
- Source separation models and pre-trained bundles (Hybrid Demucs, ConvTasNet)
- New datasets and metadata mode for the SUPERB benchmark
- Custom language model support for CTC beam search decoding
- StreamWriter for audio and video encoding
[Beta] Source Separation Models and Bundles
Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)
The TorchAudio v0.13 release includes the following features
- MUSDB_HQ Dataset, which is used in Hybrid Demucs training (docs)
- Hybrid Demucs model architecture (docs)
- Three factory functions suitable for different sample rate ranges
- Pre-trained pipelines (docs) and tutorial
SDR Results of pre-trained pipelines on MUSDB-HQ test set
Pipeline | All | Drums | Bass | Other | Vocals |
---|---|---|---|---|---|
HDEMUCS_HIGH_MUSDB* | 6.42 | 7.76 | 6.51 | 4.47 | 6.93 |
HDEMUCS_HIGH_MUSDB_PLUS** | 9.37 | 11.38 | 10.53 | 7.24 | 8.32 |
* Trained on the training data of MUSDB-HQ dataset.
** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.
Special thanks to @adefossez for the guidance.
ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.
[Beta] Datasets and Metadata Mode for SUPERB Benchmarks
With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata
function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.
Datasets with metadata functionality:
- LIBRISPEECH (docs)
- LibriMix (docs)
- QUESST14 (docs)
- SPEECHCOMMANDS (docs)
- (new) FluentSpeechCommands (docs)
- (new) Snips (docs)
- (new) IEMOCAP (docs)
- (new) VoxCeleb1 (Identification, Verification)
[Beta] Custom Language Model support in CTC Beam Search Decoding
In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM
wrapper.
[Beta] StreamWriter
torchaudio.io.StreamWriter
is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.
Backward-incompatible changes
- [BC-breaking] Fix momentum in transforms.GriffinLim (#2568)
TheGriffinLim
implementations in transforms and functional used themomentum
parameter differently, resulting in inconsistent results between the two implementations. Thetransforms.GriffinLim
usage ofmomentum
is updated to resolve this discrepancy. - Make
torchaudio.info
decode audio to computenum_frames
if it is not found in metadata (#2740).
In such cases,torchaudio.info
may now return non-zero values fornum_frames
.
Bug Fixes
- Fix random Gaussian generation (#2639)
torchaudio.compliance.kaldi.fbank
with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead. - Update download link for speech commands (#2777)
The previous download link for SpeechCommands v2 did not include data for the valid and test sets, resulting in errors when trying to use those subsets. Update the download link to correctly download the whole dataset.
New Features
IO
- Add metadata to source stream info (#2461, #2464)
- Add utility function to fetch FFmpeg library versions (#2467)
- Add YUV444P support to StreamReader (#2516)
- Add StreamWriter (#2628, #2648, #2505)
- Support in-memory decoding via Tensor wrapper in StreamReader (#2694)
- Add StreamReader Tensor Binding to src (#2699)
- Add StreamWriter media device/streaming tutorial (#2708)
- Add StreamWriter tutorial (#2698)
Ops
- Add ITU-R BS.1770-4 loudness recommendation (#2472)
- Add convolution operator (#2602)
- Add additive noise function (#2608)
Models
- Hybrid Demucs model implementation (#2506)
- Docstring change for Hybrid Demucs (#2542, #2570)
- Add NNLM support to CTC Decoder (#2528, #2658)
- Move hybrid demucs model out of prototype (#2668)
- Move conv_tasnet_base doc out of prototype (#2675)
- Add custom lm example to decoder tutorial (#2762)
Pipelines
- Add SourceSeparationBundle to prototype (#2440, #2559)
- Adding pipeline changes, factory functions to HDemucs (#2547, #2565)
- Create tutorial for HDemucs (#2572)
- Add HDEMUCS_HIGH_MUSDB (#2601)
- Move SourceSeparationBundle and pre-trained ConvTasNet pipeline into Beta (#2669)
- Move Hybrid Demucs pipeline to beta (#2673)
- Update description of HDemucs pipelines
Datasets
- Add fluent speech commands (#2480, #2510)
- Add musdb dataset and tests (#2484)
- Add VoxCeleb1 dataset (#2349)
- Add metadata function for LibriSpeech (#2653)
- Add Speech Commands metadata function (#2687)
- Add metadata mode for various datasets (#2697)
- Add IEMOCAP dataset (#2732)
- Add Snips Dataset (#2738)
- Add metadata for Librimix (#2751)
- Add file name to returned item in Snips dataset (#2775)
- Update IEMOCAP variants and labels (#2778)
Improvements
IO
- Replace
runtime_error
exception withTORCH_CHECK
(#2550, #2551, #2592) - Refactor StreamReader (#2507, #2508, #2512, #2530, #2531, #2533, #2534)
- Refactor sox C++ (#2636, #2663)
- Delay the import of kaldi_io (#2573)
Ops
- Speed up resample with kernel generation modification (#2553, #2561)
The kernel generation for resampling is optimized in this release. The following table illustrates the performance improvements from the previous release for thetorchaudio.functional.resample
function using the sinc resampling method, onfloat32
tensor with two channels and one second duration.
CPU
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.256 | 0.549 | 0.769 | 0.820 |
0.12 | 0.386 | 0.534 | 31.8 | 12.1 |
CUDA
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.332 | 0.336 | 0.345 | 0.381 |
0.12 | 0.524 | 0.334 | 64.4 | 22.8 |
- Add normalization parameter on spectrogram and inverse spectrogram (#2554)
- Replace assert with raise for ops (#2579, #2599)
- Replace CHECK_ by TORCH_CHECK_ (#2582)
- Fix argument validation in TorchAudio filtering (#2609)
Models
- Switch to flashlight decoder from upstream (#2557)
- Add dimension and shape check (#2563)
- Replace assert with raise in models (#2578, #2590)
- Migrate CTC decoder code (#2580)
- Enable CTC decoder in Windows (#2587)
Datasets
- Replace assert with raise in datasets (#2571)
- Add unit test for LibriMix dataset (#2659)
- Add gtzan download note (#2763)
Tutorials
- Tweak tutorials (#2630, #2733)
- Update ASR inference tutorial (#2631)
- Update and fix tutorials (#2661, #2701)
- Introduce IO section to getting started tutorials (#2703)
- Update HW video processing tutorial (#2739)
- Update tutorial author information (#2764)
- Fix typos in tacotron2 tutorial (#2761)
- Fix fading in hybrid demucs tutorial (#2771)
- Fix leaking matplotlib figure (#2769)
- Update resampling tutorial (#2773)
Recipes
- Use lazy import for joblib (#2498)
- Revise LibriSpeech Conformer RNN-T recipe (#2535)
- Fix bug in Conformer RNN-T recipe (#2611)
- Replace bg_iterator in examples (#2645)
- Remove obsolete examples (#2655)
- Fix LibriSpeech Conforner RNN-T eval script (#2666)
- Replace IValue::toString()->string() with IValue::toStringRef() (#2700)
- Improve wav2vec2/hubert model for pre-training (#2716)
- Improve hubert recipe for pre-training and fine-tuning (#2744)
WER improvement on LibriSpeech dev and test sets
Viterbi (v0.12) | Viterbi (v0.13) | KenLM (v0.12) | KenLM (v0.13) | |
---|---|---|---|---|
dev-clean | 10.7 | 10.9 | 4.4 | 4.2 |
dev-other | 18.3 | 17.5 | 9.7 | 9.4 |
test-clean | 10.8 | 10.9 | 4.4 | 4.4 |
test-other | 18.5 | 17.8 | 10.1 | 9.5 |
Documentation
Examples
- Add example for Vol transform (#2597)
- Add example for Vad transform (#2598)
- Add example for SlidingWindowCmn transform (#2600)
- Add example for MelScale transform (#2616)
- Add example for AmplitudeToDB transform (#2615)
- Add example for InverseMelScale transform (#2635)
- Add example for MFCC transform (#2637)
- Add example for LFCC transform (#2640)
- Add example for Loudness transform (#2641)
Other
- Remove CTC decoder prototype message (#2459)
- Fix docstring (#2540)
- Dataset docstring change (#2575)
- Fix typo - "dimension" (#2596)
- Add note for lexicon free decoder output (#2603)
- Fix stylecheck (#2606)
- Fix dataset docs parsing issue with extra spaces (#2607)
- Remove outdated doc (#2617)
- Use double quotes for string in functional and transforms (#2618)
- Fix doc warning (#2627)
- Update README.md (#2633)
- Sphinx-gallery updates (#2629, #2638, #2736, #2678, #2679)
- Tweak documentation (#2656)
- Consolidate bibliography / reference (#2676)
- Tweak badge link URL generation (#2677)
- Adopt
:autosummary:
in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692) - Update sox info docstring to account for mp3 frame count handling (#2742)
- Fix HuBERT docstring (#2746)
- Fix CTCDecoder doc (#2766)
- Fix torchaudio.backend doc (#2781)
Build/CI
- Simplify the requirements to minimum runtime dependencies (#2313)
- Bump version to 0.13 (#2460)
- Add tagged builds to torchaudio (#2471)
- Update config.guess to the latest (#2479)
- Pin MKL to 2020.04 (#2486)
- Integration test fix deleting temporary directory (#2569)
- Refactor cmake (#2585)
- Introducing pytorch-cuda metapackage (#2612)
- Move xcode to 14 from 12.5 (#2622)
- Update nightly wheels to ROCm5.2 (#2672)
- Lint updates (#2389, #2487)
- M1 build updates (#2473, #2474, #2496, #2674)
- CUDA-related updates: versions, builds, and checks (#2501, #2623, #2670, #2707, #2710, #2721, #2724)
- Release-related updates (#2489, #2492, #2495, #2759)
- Fix Anaconda upload (#2581, #2621)
- Fix windows python 3.8 loading path (#2735, #2747)