Highlights

In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

Added all 9 GLUE benchmark’s datasets (#1710): CoLA, MRPC, QQP, STS-B, SST-2, MNLI, QNLI, RTE, WNLI
Added support for BERTTokenizer
Created native C++ binaries using a CMake based build system (#1644)

Datasets

We increased the number of datasets in TorchText from 22 to 30 by adding the remaining 8 datasets from the GLUE benchmark (SST-2 was already supported). The complete list of GLUE datasets is as follows:

CoLA (paper): Single sentence binary classification acceptability task
SST-2 (paper): Single sentence binary classification sentiment task
MRPC (paper): Dual sentence binary classification paraphrase task
QQP: Dual sentence binary classification paraphrase task
STS-B (paper): Single sentence to float regression sentence similarity task
MNLI (paper): Sentence ternary classification NLI task
QNLI (paper): Sentence binary classification QA and NLI tasks
RTE (paper): Dual sentence binary classification NLI task
WNLI (paper): Dual sentence binary classification coreference and NLI tasks

The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata. For more details, refer to https://pytorch.org/text/stable/datasets.html

Tokenizers

TorchText has extended support for TorchScriptable tokenizers by adding the WordPiece tokenizer used in BERT. It is one of the most commonly used algorithms for splitting input text into sub-words units and was introduced in Japanese and Korean Voice Search (Schuster et al., 2012).

TorchScriptabilty support would allow users to embed the BERT text-preprocessing natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate BERT tokenizers for deployment needs.

For usage details, please refer to the corresponding documentation.

CMake Build System

TorchText has migrated its build system for C++ extension and third party libraries to use CMake rather than PyTorch’s CppExtension module. This allows end-users to integrate TorchText C++ binaries in their applications without having a dependency on libpython thus allowing them to use TorchText operators in a non-Python environment.

Refer to the GitHub issue for more details.

Backward Incompatible Changes

The RobertaModelBundle introduced in 0.12 release, which gets pre-trained RoBERTa/XLM-R models and builds custom models with similar architecture, has been renamed to RobertaBundle (#1653).

The default caching location (cache_dir) has been changed from os.path.expanduser("~/.TorchText/cache") to os.path.expanduser("~/.cache/torch/text"). Furthermore the default root directory of datasets is cache_dir/datasets (#1740). Users can now control default cache location via the TORCH_HOME environment variable (#1741)

New Features

Models

[fbsync] BetterTransformer support for TorchText (#1690) (#1694)
[fbsync] Killed to_better by having native load_from_state_dict and init (#1695)
[fbsync] Removed unneeded modules after using nn.Module for BetterTransformer (#1696)
[fbsync] Replaced TransformerEncoder in TorchText with better transformer (#1703)

Transforms, Tokenizers, Ops

Added pad transform, string to int transform (#1683)
Added support for Scriptable BERT tokenizer (#1707)
Added support for batch input in BERT Tokenizer with perf benchmark (#1745)

Datasets

Support for GLUE benchmark’s datasets added:

CoLA (#1711)
MRPC (#1712)
QQP (#1713)
STS-B (#1714)
MNLI (#1715)
QNLI (#1717)
RTE (#1721)
WNLI (#1724)
Note: SST2 was added previously (#1538)

Others

Prepared datasets for new encoding kwarg. (#1616)
Added Shuffle and sharding datapipes to datasets (#1729)
For Datasets, refactored local functions to be global so that they can be pickled (#1726)
Updated TorchData DataPipe API usages (#1663)
Replaced lambda functions with regular functions in all datasets (#1718)

CMake Build System

[CMake 1/3] Updated C++ includes to use imports relative to root directory (#1666)
[CMake 2/3] Added CMake Build to TorchText to create single `_TorchText library (#1673)
[CMake 3/3] Splited source files with Python dependency to separate library (#1660)

Improvements

Features

[BC-breaking] Renamed Roberta Bundle (#1635)
Modified CLIPTokenizer to either infer number of merges from encoder json or take it in constructor (#1622)
Provided option to return splitted tokens (#1698)
Updated dataset code to avoid creating multiple iterators from a DataPipe (#1708)

Testing

Added unicode generation to IWSLT tests (followup to #1608) (#1642)
Added MacOS unit tests on CircleCI (#1672)
Added parameterized dataset pickling tests (#1732)
Added test to compare encoder inference on input with and without padding (#1770)
Added test for shuffle before shard (#1738)
Added more test coverage (#1653)
Enabled model testing in FBCode (#1720)
Fixed for windows builds with python 3.10 , getting rid of ssize_t (#1627)
Built and test py3.10 (#1625)
Making sure we build correctly against release branch (#1790)
Removed caching artifacts for datasets and fix it for vectors (#1674)
Installed torchdata from nightly release in CI (#1664)
Added m1 tagged build for TorchText (#1776)
Refactored TorchText version handing and adding first version of M1 builds (#1773)
Removed MACOSX_DEPLOYMENT_TARGET (#1728)

Examples

Added data pipelines for Roberta pre-processing (#1637)
Updated sst2 tutorial to replace lambda usage (#1722)

Documentation

Removed _add_docstring_header decorator from amazon review polarity (#1611)
Added missing quotation marks to to CLIPTokenizer docs (#1610)
Updated README around installing LTS version (#1665)
Added contributing guidelines for third party and custom C++ operators (#1742)
Added recommendations regarding use of datapipes for multi-processing, shuffling, DDP, etc. (#1755)
Fixed roberta bundle example doc (#1648)
Updated doc conf (#1634)
Removed install instructions (#1641)
Updated README (#1652)
Updated requirements (#1675)
Fixed typo sharing -> sharding (#1787)
Fixed docs build (#1730)
Replaced git+git with git+https in requirements.txt (#1658)
Added header info for BERT tokenizer (#1754)
Fixed docstring for Tokenizers (#1739)
Fixed doc js initialization (#1736)
Added missing type hints (#1782)
Fixed SentencePiece Tokenizer doc-string (#1706)

Bug fixes

Fixed missed mask arg in TorchText transformer (#1758)
Fixed bug in RTE and WNLI testing (#1759)
Fixed bug in QNLI dataset and corresponding test (#1760)
Fixed STSB and WikiTexts tests (#1737)
Fixed smoke tests for linux (#1687)
Removed redundant dataname in test_shuffle_shard_wrapper (#1733)
Fixed non-deterministic test failures for IWSLT (#1699)
Fixed typo in nightly branch ref (#1783)
Fixed windows utils test (#1761)
Fixed test utils (#1757)
Fixed pad transform test (#1688)
Resolved issues in #1653 + sanitize test names generated by nested_params (#1667)
Fixed mock tests due to change in datasets directory (#1749)
Deleted prints in test_qqp.py (#1734)
Fixed logger issue (#1656)

Others

Pinned Jinja2 version to fix broken doc build (#1669)
Fixed formatting for all files using pre-commit (#1670)
Pinned setuptools to 58.0.4 on Windows (#1746)
Added post install script for pywin32 (#1748)
Pinned Utf8proc version (#1771)
Removed models from experimental (#1643)
Cleaned examples folder (#1647)
Cleaned stale code (#1654)
Took TORCH_HOME env variable into account while setting the cache dir (#1741)
Updateed download hooks and datasets to import HttpReader and GDriveReader from download hooks (#1657)
Added Model benchmark (#1697)
Changed root directory for datasets (#1740)
Used _get_torch_home standard utility from torch hub (#1752)
Removed ticks (``) from the url under is_module_available (#1753)
Prepared repo for auto-formatters (#1546)
Fixed flake8 issues introduced from adding auto formatter (#1617)

pytorch/text v0.13.0 on GitHub

Highlights

Datasets

Tokenizers

CMake Build System

Backward Incompatible Changes

New Features

Models

Transforms, Tokenizers, Ops

Datasets

CMake Build System

Improvements

Features

Testing

Examples

Documentation

Bug fixes

Others

pytorch/text v0.13.0
on GitHub