github pytorch/text v0.9.0-rc5
Torchtext 0.9.0 release note

latest releases: v0.18.0, v0.18.0-rc4, v0.17.2...
3 years ago

Highlights

In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text #664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as Field. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide (link).

The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task (link).

  • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • Sequence tagging: UDPOS, CoNLL2000Chunking
  • Translation: IWSLT2016, IWSLT2017
  • Question answer: SQuAD1, SQuAD2

We add Python 3.9 support in this release

Backwards Incompatible

The current users of the legacy code will experience BC breakage as we have retired the legacy code (#1172, #1181, #1183). The legacy components are placed in torchtext.legacy.data folder as follows:

  • torchtext.data.Pipeline -> torchtext.legacy.data.Pipeline
  • torchtext.data.Batch -> torchtext.legacy.data.Batch
  • torchtext.data.Example -> torchtext.legacy.data.Example
  • torchtext.data.Field -> torchtext.legacy.data.Field
  • torchtext.data.Iterator -> torchtext.legacy.data.Iterator
  • torchtext.data.Dataset -> torchtext.legacy.data.Dataset

This means, all features are still available, but within torchtext.legacy instead of torchtext.

Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release

Category Legacy 0.9.0 release
Language Modeling torchtext.legacy.datasets.WikiText2 torchtext.datasets.WikiText2
  torchtext.legacy.datasets.WikiText103 torchtext.datasets.WikiText103
  torchtext.legacy.datasets.PennTreebank torchtext.datasets.PennTreebank
  torchtext.legacy.datasets.EnWik9 torchtext.datasets.EnWik9
Text Classification torchtext.legacy.datasets.AG_NEWS torchtext.datasets.AG_NEWS
  torchtext.legacy.datasets.SogouNews torchtext.datasets.SogouNews
  torchtext.legacy.datasets.DBpedia torchtext.datasets.DBpedia
  torchtext.legacy.datasets.YelpReviewPolarity torchtext.datasets.YelpReviewPolarity
  torchtext.legacy.datasets.YelpReviewFull torchtext.datasets.YelpReviewFull
  torchtext.legacy.datasets.YahooAnswers torchtext.datasets.YahooAnswers
  torchtext.legacy.datasets.AmazonReviewPolarity torchtext.datasets.AmazonReviewPolarity
  torchtext.legacy.datasets.AmazonReviewFull torchtext.datasets.AmazonReviewFull
  torchtext.legacy.datasets.IMDB torchtext.datasets.IMDB
  torchtext.legacy.datasets.SST deferred
  torchtext.legacy.datasets.TREC deferred
Sequence Tagging torchtext.legacy.datasets.UDPOS torchtext.datasets.UDPOS
  torchtext.legacy.datasets.CoNLL2000Chunking torchtext.datasets.CoNLL2000Chunking
Translation torchtext.legacy.datasets.WMT14 deferred
  torchtext.legacy.datasets.Multi30k deferred
  torchtext.legacy.datasets.IWSLT torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017
Natural Language Inference torchtext.legacy.datasets.XNLI deferred
  torchtext.legacy.datasets.SNLI deferred
  torchtext.legacy.datasets.MultiNLI deferred
Question Answer torchtext.legacy.datasets.BABI20 deferred

Improvements

  • Enable importing metrics/utils/functional from torchtext.legacy.data (#1229)
  • Set up daily caching mechanism with Master job (#1219)
  • Reset the functions in datasets_utils.py as private (#1224)
  • Resolve the download folder for some raw datasets (#1213)
  • Store the hash of the extracted CoNLL2000Chunking files so the extraction step will be skipped if the extracted files are detected (#1204)
  • Fix the total number of lines in doc strings of the datasets (#1200)
  • Extend CI tests to cover all the datasets (#1197, #1201, #1171)
  • Document the number of lines in the dataset splits (#1196)
  • Add hashes to skip the slow extraction if the extracted files are available (#1195)
  • Use decorator to loop over the split argument in the datasets (#1194)
  • Remove offset option from torchtext.datasets, and move torchtext.datasets.common to torchtext.data.dataset_utils (#1188, #1145)
  • Remove the step to clean up the cache in test_iwslt() (#1192)
  • Split IWSLT dataset into IWSLT2016 and IWSLT2017 dataset and re-organize the parameters in the constructors (#1191, #1209)
  • Move the prototype datasets in torchtext.experimental.datasets.raw folder to torchtext.datasets folder (#1182, #1202, #1207, #1211, #1212)
  • Add a decorator add_docstring_header() to generate docstring (#1185)
  • Add EnWiki9 dataset (#1184)
  • Avoid unnecessary downloads and extraction for some raw datasets, and add more logging (#1178)
  • Split raw datasets into individual files (#1156, #1173, #1174, #1175, #1176)
  • Extend the unittest coverage for all the raw datasets (#1157, #1149)
  • Define the relative path of the datasets in the download_from_url() func and skip unnecessary download if the downloaded files are detected (#1158, #1155)
  • Add MD5 and NUM_LINES as the meta information in the __init__ file of torchtext.datasets folder (#1155)
  • Standardize the text dataset doc strings and argument order. (#1151)
  • Report the “exceeds quota” error for the datasets using Google drive links (#1150)
  • Add support for the string-typed split values to the text datasets (#1147)
  • Re-name the argument from data_select to split in the dataset constructor (#1143)
  • Add Python 3.9 support across Linux, MacOS, and Windows platforms (#1139)
  • Switch to the new URL for the IWSLT dataset (#1115)
  • Extend the language shortcut in torchtext.data.utils.get_tokenizer func with the full name when Spacy tokenizers are loaded (#1140)
  • Fix broken CI tests due to spacy 3.0 release (#1138)
  • Pass an embedding layer to the constructor of the BertModel class in the BERT example (#1135)
  • Fix test warnings by switching to assertEqual() in PyTorch TestCase class (#1086)
  • Improve CircleCI tests and conda package (#1128, #1121, #1120, #1106)
  • Simplify TorchScript registration by adopting TORCH_LIBRARY_FRAGMENT macro (#1102)

Bug Fixes

  • Fix the total number of returned lines in setup_iter() func in RawTextIterableDataset (#1142)

Docs

  • Add number of classes to doc strings for text classification data (#1230)
  • Remove Lato font for pytorch/text website (#1227)
  • Add the migration tutorial (#1203, #1216, #1222)
  • Remove the legacy examples on pytorch/text website (#1206)
  • Update README file for 0.9.0 release (#1198)
  • Add CI check to detect undocumented parameters (#1167)
  • Add a static text link for the package version in the doc website (#1161)
  • Fix sphinx warnings and turn warnings into errors (#1163)
  • Add the text datasets to torchtext website (#1153)
  • Add the constructor document for IMDB and SST datasets (#1118)
  • Fix typos in the README file (#1089)
  • Rename "Arguments" to "Args" in the doc strings (#1110)
  • Build docs and push to gh-pages on nightly basis (#1105, #1111, #1112)

Don't miss a new text release

NewReleases is sending notifications on new releases.