github pytorch/text v0.13.0

latest releases: v0.18.0, v0.18.0-rc4, v0.17.2...
2 years ago

Highlights

In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

  • Added all 9 GLUE benchmark’s datasets (#1710): CoLA, MRPC, QQP, STS-B, SST-2, MNLI, QNLI, RTE, WNLI
  • Added support for BERTTokenizer
  • Created native C++ binaries using a CMake based build system (#1644)

Datasets

We increased the number of datasets in TorchText from 22 to 30 by adding the remaining 8 datasets from the GLUE benchmark (SST-2 was already supported). The complete list of GLUE datasets is as follows:

  • CoLA (paper): Single sentence binary classification acceptability task
  • SST-2 (paper): Single sentence binary classification sentiment task
  • MRPC (paper): Dual sentence binary classification paraphrase task
  • QQP: Dual sentence binary classification paraphrase task
  • STS-B (paper): Single sentence to float regression sentence similarity task
  • MNLI (paper): Sentence ternary classification NLI task
  • QNLI (paper): Sentence binary classification QA and NLI tasks
  • RTE (paper): Dual sentence binary classification NLI task
  • WNLI (paper): Dual sentence binary classification coreference and NLI tasks

The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata. For more details, refer to https://pytorch.org/text/stable/datasets.html

Tokenizers

TorchText has extended support for TorchScriptable tokenizers by adding the WordPiece tokenizer used in BERT. It is one of the most commonly used algorithms for splitting input text into sub-words units and was introduced in Japanese and Korean Voice Search (Schuster et al., 2012).

TorchScriptabilty support would allow users to embed the BERT text-preprocessing natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate BERT tokenizers for deployment needs.

For usage details, please refer to the corresponding documentation.

CMake Build System

TorchText has migrated its build system for C++ extension and third party libraries to use CMake rather than PyTorch’s CppExtension module. This allows end-users to integrate TorchText C++ binaries in their applications without having a dependency on libpython thus allowing them to use TorchText operators in a non-Python environment.

Refer to the GitHub issue for more details.

Backward Incompatible Changes

The RobertaModelBundle introduced in 0.12 release, which gets pre-trained RoBERTa/XLM-R models and builds custom models with similar architecture, has been renamed to RobertaBundle (#1653).

The default caching location (cache_dir) has been changed from os.path.expanduser("~/.TorchText/cache") to os.path.expanduser("~/.cache/torch/text"). Furthermore the default root directory of datasets is cache_dir/datasets (#1740). Users can now control default cache location via the TORCH_HOME environment variable (#1741)

New Features

Models

  • [fbsync] BetterTransformer support for TorchText (#1690) (#1694)
  • [fbsync] Killed to_better by having native load_from_state_dict and init (#1695)
  • [fbsync] Removed unneeded modules after using nn.Module for BetterTransformer (#1696)
  • [fbsync] Replaced TransformerEncoder in TorchText with better transformer (#1703)

Transforms, Tokenizers, Ops

  • Added pad transform, string to int transform (#1683)
  • Added support for Scriptable BERT tokenizer (#1707)
  • Added support for batch input in BERT Tokenizer with perf benchmark (#1745)

Datasets

Support for GLUE benchmark’s datasets added:

Others

  • Prepared datasets for new encoding kwarg. (#1616)
  • Added Shuffle and sharding datapipes to datasets (#1729)
  • For Datasets, refactored local functions to be global so that they can be pickled (#1726)
  • Updated TorchData DataPipe API usages (#1663)
  • Replaced lambda functions with regular functions in all datasets (#1718)

CMake Build System

  • [CMake 1/3] Updated C++ includes to use imports relative to root directory (#1666)
  • [CMake 2/3] Added CMake Build to TorchText to create single `_TorchText library (#1673)
  • [CMake 3/3] Splited source files with Python dependency to separate library (#1660)

Improvements

Features

  • [BC-breaking] Renamed Roberta Bundle (#1635)
  • Modified CLIPTokenizer to either infer number of merges from encoder json or take it in constructor (#1622)
  • Provided option to return splitted tokens (#1698)
  • Updated dataset code to avoid creating multiple iterators from a DataPipe (#1708)

Testing

  • Added unicode generation to IWSLT tests (followup to #1608) (#1642)
  • Added MacOS unit tests on CircleCI (#1672)
  • Added parameterized dataset pickling tests (#1732)
  • Added test to compare encoder inference on input with and without padding (#1770)
  • Added test for shuffle before shard (#1738)
  • Added more test coverage (#1653)
  • Enabled model testing in FBCode (#1720)
  • Fixed for windows builds with python 3.10 , getting rid of ssize_t (#1627)
  • Built and test py3.10 (#1625)
  • Making sure we build correctly against release branch (#1790)
  • Removed caching artifacts for datasets and fix it for vectors (#1674)
  • Installed torchdata from nightly release in CI (#1664)
  • Added m1 tagged build for TorchText (#1776)
  • Refactored TorchText version handing and adding first version of M1 builds (#1773)
  • Removed MACOSX_DEPLOYMENT_TARGET (#1728)

Examples

  • Added data pipelines for Roberta pre-processing (#1637)
  • Updated sst2 tutorial to replace lambda usage (#1722)

Documentation

  • Removed _add_docstring_header decorator from amazon review polarity (#1611)
  • Added missing quotation marks to to CLIPTokenizer docs (#1610)
  • Updated README around installing LTS version (#1665)
  • Added contributing guidelines for third party and custom C++ operators (#1742)
  • Added recommendations regarding use of datapipes for multi-processing, shuffling, DDP, etc. (#1755)
  • Fixed roberta bundle example doc (#1648)
  • Updated doc conf (#1634)
  • Removed install instructions (#1641)
  • Updated README (#1652)
  • Updated requirements (#1675)
  • Fixed typo sharing -> sharding (#1787)
  • Fixed docs build (#1730)
  • Replaced git+git with git+https in requirements.txt (#1658)
  • Added header info for BERT tokenizer (#1754)
  • Fixed docstring for Tokenizers (#1739)
  • Fixed doc js initialization (#1736)
  • Added missing type hints (#1782)
  • Fixed SentencePiece Tokenizer doc-string (#1706)

Bug fixes

  • Fixed missed mask arg in TorchText transformer (#1758)
  • Fixed bug in RTE and WNLI testing (#1759)
  • Fixed bug in QNLI dataset and corresponding test (#1760)
  • Fixed STSB and WikiTexts tests (#1737)
  • Fixed smoke tests for linux (#1687)
  • Removed redundant dataname in test_shuffle_shard_wrapper (#1733)
  • Fixed non-deterministic test failures for IWSLT (#1699)
  • Fixed typo in nightly branch ref (#1783)
  • Fixed windows utils test (#1761)
  • Fixed test utils (#1757)
  • Fixed pad transform test (#1688)
  • Resolved issues in #1653 + sanitize test names generated by nested_params (#1667)
  • Fixed mock tests due to change in datasets directory (#1749)
  • Deleted prints in test_qqp.py (#1734)
  • Fixed logger issue (#1656)

Others

  • Pinned Jinja2 version to fix broken doc build (#1669)
  • Fixed formatting for all files using pre-commit (#1670)
  • Pinned setuptools to 58.0.4 on Windows (#1746)
  • Added post install script for pywin32 (#1748)
  • Pinned Utf8proc version (#1771)
  • Removed models from experimental (#1643)
  • Cleaned examples folder (#1647)
  • Cleaned stale code (#1654)
  • Took TORCH_HOME env variable into account while setting the cache dir (#1741)
  • Updateed download hooks and datasets to import HttpReader and GDriveReader from download hooks (#1657)
  • Added Model benchmark (#1697)
  • Changed root directory for datasets (#1740)
  • Used _get_torch_home standard utility from torch hub (#1752)
  • Removed ticks (``) from the url under is_module_available (#1753)
  • Prepared repo for auto-formatters (#1546)
  • Fixed flake8 issues introduced from adding auto formatter (#1617)

Don't miss a new text release

NewReleases is sending notifications on new releases.