torchaudio 0.10.0 Release Note
Highlights
torchaudio 0.10.0 release includes:
- New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
- Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
- New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
- CUDA-enabled binaries
[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.
These pretrained weights can be used for feature extractions and downstream task adaptation.
>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...
Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD
[Beta] Tacotron2 and TTS Pipeline
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines
module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)
[Beta] RNN Transducer Loss
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss
or torchaudio.transforms.RNNTLoss
) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.
[Beta] MVDR Beamforming
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio.
Please refer to the MVDR tutorial.
GPU Build
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
Additional Features
torchaudio.functional.lfilter
now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.
Backward Incompatible Changes
I/O
- Default to PCM_16 for flac on soundfile backend (#1604)
- When saving FLAC format with “soundfile” backend,
PCM_24
(the previous default) could cause warping. The default has been changed toPCM_16
, which does not suffer this.
- When saving FLAC format with “soundfile” backend,
Ops
- Default to native complex type when returning raw spectrogram (#1549)
- When
power=None
,torchaudio.functional.spectrogram
andtorchaudio.transforms.Spectrogram
now defaults toreturn_complex=True
, which returns Tensor of native complex type (such astorch.cfloat
andtorch.cdouble
). To use a pseudo complex type, pass the resulting tensor totorch.view_as_real
.
- When
- Remove deprecated kaldi.resample_waveform (#1555)
- Please use
torchaudio.functional.resample
.
- Please use
- Replace waveform with specgram in SlidingWindowCmn (#1859)
- The argument name was corrected to
specgram
.
- The argument name was corrected to
- Ensure integer input frequencies for resample (#1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.
Wav2Vec2
- Update
extract_features
of Wav2Vec2Model (#1776)- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use
Wav2Vec2Model.feature_extractor()
.
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use
- Move fine-tune specific module out of wav2vec2 encoder (#1782)
- The internal structure of
Wav2Vec2Model
was updated.Wav2Vec2Model.encoder.read_out
module is moved toWav2Vec2Model.aux
. If you have serialized state dict, please replace the keyencoder.read_out
withaux
.
- The internal structure of
- Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
- The signatures of wav2vec2 factory functions are changed.
num_out
parameter has been changed toaux_num_out
and other parameters are added before it. Please update the code fromwav2vec2_base(num_out)
towav2vec2_base(aux_num_out=num_out)
.
- The signatures of wav2vec2 factory functions are changed.
Deprecations
- Add
melscale_fbanks
and deprecatecreate_fb_matrix
(#1653)- As
linear_fbanks
is introduced,create_fb_matrix
is renamed tomelscale_fbanks
. The originalcreate_fb_matrix
is now deprecated. Please usemelscale_fbanks
.
- As
- Deprecate
VCTK
dataset (#1810)- This dataset has been taken down and is no longer available. Please use
VCTK_092
dataset.
- This dataset has been taken down and is no longer available. Please use
- Deprecate data utils (#1809)
bg_iterator
anddiskcache_iterator
are known to not improve the throughput of data loaders. Please cease their usage.
New Features
Models
Tacotron2
- Add Tacotron2 model (#1621, #1647, #1844)
- Add Tacotron2 loss function (#1764)
- Add Tacotron2 inference method (#1648, #1839, #1849)
- Add phoneme text preprocessing for Tacotron2 (#1668)
- Move Tacotron2 out of prototype (#1714)
HuBERT
Pretrained Weights and Pipelines
- Add pretrained weights for wavernn (#1612)
- Add Tacotron2 pretrained models (#1693)
- Add HUBERT pretrained weights (#1821, #1824)
- Add pretrained weights from wav2vec2.0 and XLSR papers (#1827)
- Add customization support to wav2vec2 labels (#1834)
- Default pretrained weights to eval mode (#1843)
- Move wav2vec2 pretrained models to pipelines module (#1876)
- Add TTS bundle/pipelines (#1872)
- Fix vocoder interface (#1895)
- Fix Phonemizer download (#1897)
RNN Transducer Loss
- Add reduction parameter for RNNT loss (#1590)
- Rename RNNT loss C++ parameters (#1602)
- Rename transducer to RNNT (#1603)
- Remove gradient variable from RNNT loss Python code (#1616)
- Remove reuse_logits_for_grads option for RNNT loss (#1610)
- Remove fused_log_softmax option from RNNT loss (#1615)
- RNNT loss resolve null gradient (#1707)
- Move RNNT loss out of prototype (#1711)
MVDR Beamforming
- Add MVDR module to example (#1709)
- Add normalization to steering vector solutions in MVDR Module (#1765)
- Move MVDR and PSD modules to transforms (#1771)
- Add MVDR beamforming tutorial to example directory (#1768)
Ops
- Add edit_distance (#1601)
- Add PitchShift to functional and transform (#1629)
- Add LFCC feature to transforms (#1611)
- Add InverseSpectrogram to transforms and functional (#1652)
Datasets
Improvements
I/O
- Make buffer size for function info configurable (#1634)
Ops
- Replace deprecated AutoNonVariableTypeMode (#1583)
- Remove lazy behavior from MelScale (#1636)
- Simplify axis value checks (#1501)
- Use at::parallel_for in lfilter core loop (#1557)
- Add filterbanks support to lfilter (#1587)
- Add batch support to lfilter (#1638)
- Use integer rates in pitch shift resample (#1861)
Models
- Rename infer method to forward for WaveRNNInferenceWrapper (#1650)
- Refactor WaveRNN infer and move it to the codebase (#1704)
- Make the core wav2vec2 factory function public (#1829)
- Refactor WaveRNNInferenceWrapper (#1845)
- Store n_bits in WaveRNN (#1847)
- Replace custom padding with torch’s native impl (#1846)
- Avoid concatenation in loop (#1850)
- Add lengths param to WaveRNN.infer (#1851)
- Add sample rate to wav2vec2 bundle (#1878)
- Remove factory functions of Tacotron2 and WaveRNN (#1874)
Datasets
- Fix encoding of CMUDict data reading (#1665)
- Rename utterance to transcript in datasets (#1841)
- Clean up constructor of CMUDict (#1852)
Performance
- Refactor transforms.Fade on GPU computation (#1871)
CUDA
Tensor shape | [1,4,8000] | [1,4,16000] | [1,4,32000] |
---|---|---|---|
0.10 | 119 | 120 | 123 |
0.9 | 160 | 184 | 240 |
Unit: msec
Examples
- Add text preprocessing utilities for TTS pipeline (#1639)
- Replace simple_ctc with Python greedy decoder (#1558)
- Add an inference example for WaveRNN (#1637)
- Refactor coding style for WaveRNN example (#1663)
- Add style checks on example files on CI (#1667)
- Add Tacotron2 training script (#1642)
- Add an inference example for Tacotron2 (#1654)
- Fix Tacotron2 inference example (#1716)
- Fix WaveRNN training example (#1740)
- Training recipe for ConvTasNet on Libri2Mix dataset (#1757)
Build
- Update skipIfNoCuda decorator and force GPU tests in GPU CIs (#1559)
- Temporarily pin nightly version on Linux/macOS CPU unittest (#1598)
- Temporarily pin nightly version on Linux GPU unitest (#1606)
- Revert CI hot fix (#1614)
- Expose USE_CUDA in build (#1609)
- Pin MKL to 2021.2.0 (#1655)
- Simplify extension initialization (#1649)
- Synchronize extension initialization mechanism with fbcode (#1682)
- Ensure we’re propagating BUILD_VERSION (#1697)
- Guard Kaldi’s version generation (#1715)
- Update sphinx to 3.5.4 (#1685)
- Default to BUILD_SOX=1 in non-Windows systems (#1725)
- Add CUDA install step to Win Packaging jobs (#1732)
- setup.py should parse TORCH_CUDA_ARCH_LIST (#1733)
- Simplify the extension initialization process (#1734)
- Fix CUDA build logic for _torchaudio.so (#1737)
- Enable Linux wheel/conda GPU package builds (#1730)
- Increase no_output_timeout to 20m for WinConda (#1738)
- Build torchaudio for 11.3 as well (#1747)
- Upload wheels to respective folders (#1751)
- Extract PyBind11 feature implementations (#1739)
- Update the way to access libsox global config (#1755)
- Fix ROCM build error (#1729)
- Fix compile warnings (#1762)
- Migrate CircleCI docker image (#1767)
- Split extension into custom impl and Python wrapper libraries (#1752)
- Put libtorchaudio in lib directory (#1773)
- Update win gpu image from previous to stable (#1786)
- Set libtorch audio suffix as pyd on Windows (#1788)
- Fix build on Windows with CUDA (#1787)
- Enable audio windows cuda tests (#1777)
- Set release and base PyTorch version (#1816)
- Exclude prototype if it is in release (#1870)
- Log prototype exclusion (#1882)
- Update prototype exclusion (#1885)
- Remove alpha from version number (#1901)
Testing
- Migrate resample tests from kaldi to functional (#1520)
- Add autograd gradcheck test for RNN transducer loss (#1532)
- Fix HF wav2vec2 test (#1585)
- Update unit test CUDA to 10.2 (#1605)
- Fix CircleCI unittest environemnt
- Remove skipIfRocm from test_fileobj_flac in soundfile.save_test (#1626)
- MFCC test refactor (#1618)
- Refactor RNNT Loss Unit Tests (#1630)
- Reduce sample rate to avoid test time out (#1640)
- Refactor text preprocessing tests in Tacotron2 example (#1635)
- Move test initialization logic to dedicated directory (#1680)
- Update pitch shift batch consistency test (#1700)
- Refactor scripting in test (#1727)
- Update the version of fairseq used for testing (#1745)
- Put output tensor on proper device in get_whitenoise (#1744)
- Refactor batch consistency test in transforms (#1772)
- Tweak test name by appending factory function name (#1780)
- Enable audio windows cuda tests (#1777)
- Skip hubert_asr_xlarge TS test on Windows (#1800)
- Skip hubert_xlarge TS test on Windows (#1807)
Others
- Remove unused files (#1588)
- Remove residuals for removed modules (#1599)
- Remove torchscript bc test references (#1623)
- Remove torchaudio._internal.fft module (#1631)
Misc
- Rename master branch to main (#1649)
- Fix Python spacing (#1670)
- Lint fix (#1726)
- Add .gitattributes (#1731)
- Style fixes (#1766)
- Update reference from master to main elsewhere (#1784)
Bug Fixes
Documentation
- README Updates
- Update README (#1544)
- Remove NumPy dependency from README (#1582)
- Fix typos and sentence structure in README.md (#1633)
- Update and move convention section to CONTRIBUTING.md (#1635)
- Remove unnecessary README (#1728)
- Add link to TTS colab example to README (#1748)
- Fix typo in source separation README (#1774)
- Docstring Changes
- Set removal version of pseudo complex support (#1553)
- Update docs (#1584)
- Add return type in doc for RNNT loss (#1591)
- Improve RNNT loss docstrings (#1642)
- Add documentation for CMUDict’s property (#1683)
- Refactor lfilter docs (#1698)
- Standardize optional types in docstrings (#1746)
- Fix return type of wav2vec2 model (#1790)
- Add equations to MVDR docstring (#1789)
- Standardize tensor shapes format in docs (#1838)
- Add license to pre-trained model doc (#1836)
- Update Tacotron2 docs (#1840)
- Fix PitchShift docstring (#1866)
- Update descriptions of lengths parameters (#1890)
- Standardization and minor fixes (#1892)
- Update models/pipelines doc (#1894)
- Docs formatting
- Remove override CSS (#1554)
- Add prototype.tacotron2 page to docs (#1695)
- Add doc for InverseSepctrogram (#1706)
- Add sections to transforms docs (#1720)
- Add edit_distance to documentation with a new category Metric (#1743)
- Fix model subsections (#1775)
- List all the pre-trained models on right bar (#1828)
- Put pretrained weights to subsection (#1879)
- Examples (see #1564)
- Add example code for Resample (#1644)
- Fix examples in transforms (#1646)
- Add example for ComplexNorm (#1658)
- Add example for MuLawEncoding (#1586)
- Add example for Spectrogram (#1566)
- Add example for GriffinLim (#1671)
- Add example for MuLawDecoding (#1684)
- Add example for Fade transform (#1719)
- Update RNNT loss docs and add example (#1835)
- Add SpecAugment figure/citation (#1887)
- Add filter bank figures (#1891)