Highlights

In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text #664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as Field. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide (link).

The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task (link).

Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
Sequence tagging: UDPOS, CoNLL2000Chunking
Translation: IWSLT2016, IWSLT2017
Question answer: SQuAD1, SQuAD2

We add Python 3.9 support in this release

Backwards Incompatible

The current users of the legacy code will experience BC breakage as we have retired the legacy code (#1172, #1181, #1183). The legacy components are placed in torchtext.legacy.data folder as follows:

torchtext.data.Pipeline -> torchtext.legacy.data.Pipeline
torchtext.data.Batch -> torchtext.legacy.data.Batch
torchtext.data.Example -> torchtext.legacy.data.Example
torchtext.data.Field -> torchtext.legacy.data.Field
torchtext.data.Iterator -> torchtext.legacy.data.Iterator
torchtext.data.Dataset -> torchtext.legacy.data.Dataset

This means, all features are still available, but within torchtext.legacy instead of torchtext.

Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release

Category	Legacy	0.9.0 release
Language Modeling	torchtext.legacy.datasets.WikiText2	torchtext.datasets.WikiText2
	torchtext.legacy.datasets.WikiText103	torchtext.datasets.WikiText103
	torchtext.legacy.datasets.PennTreebank	torchtext.datasets.PennTreebank
	torchtext.legacy.datasets.EnWik9	torchtext.datasets.EnWik9
Text Classification	torchtext.legacy.datasets.AG_NEWS	torchtext.datasets.AG_NEWS
	torchtext.legacy.datasets.SogouNews	torchtext.datasets.SogouNews
	torchtext.legacy.datasets.DBpedia	torchtext.datasets.DBpedia
	torchtext.legacy.datasets.YelpReviewPolarity	torchtext.datasets.YelpReviewPolarity
	torchtext.legacy.datasets.YelpReviewFull	torchtext.datasets.YelpReviewFull
	torchtext.legacy.datasets.YahooAnswers	torchtext.datasets.YahooAnswers
	torchtext.legacy.datasets.AmazonReviewPolarity	torchtext.datasets.AmazonReviewPolarity
	torchtext.legacy.datasets.AmazonReviewFull	torchtext.datasets.AmazonReviewFull
	torchtext.legacy.datasets.IMDB	torchtext.datasets.IMDB
	torchtext.legacy.datasets.SST	deferred
	torchtext.legacy.datasets.TREC	deferred
Sequence Tagging	torchtext.legacy.datasets.UDPOS	torchtext.datasets.UDPOS
	torchtext.legacy.datasets.CoNLL2000Chunking	torchtext.datasets.CoNLL2000Chunking
Translation	torchtext.legacy.datasets.WMT14	deferred
	torchtext.legacy.datasets.Multi30k	deferred
	torchtext.legacy.datasets.IWSLT	torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017
Natural Language Inference	torchtext.legacy.datasets.XNLI	deferred
	torchtext.legacy.datasets.SNLI	deferred
	torchtext.legacy.datasets.MultiNLI	deferred
Question Answer	torchtext.legacy.datasets.BABI20	deferred

Improvements

Enable importing metrics/utils/functional from torchtext.legacy.data (#1229)
Set up daily caching mechanism with Master job (#1219)
Reset the functions in datasets_utils.py as private (#1224)
Resolve the download folder for some raw datasets (#1213)
Store the hash of the extracted CoNLL2000Chunking files so the extraction step will be skipped if the extracted files are detected (#1204)
Fix the total number of lines in doc strings of the datasets (#1200)
Extend CI tests to cover all the datasets (#1197, #1201, #1171)
Document the number of lines in the dataset splits (#1196)
Add hashes to skip the slow extraction if the extracted files are available (#1195)
Use decorator to loop over the split argument in the datasets (#1194)
Remove offset option from torchtext.datasets, and move torchtext.datasets.common to torchtext.data.dataset_utils (#1188, #1145)
Remove the step to clean up the cache in test_iwslt() (#1192)
Split IWSLT dataset into IWSLT2016 and IWSLT2017 dataset and re-organize the parameters in the constructors (#1191, #1209)
Move the prototype datasets in torchtext.experimental.datasets.raw folder to torchtext.datasets folder (#1182, #1202, #1207, #1211, #1212)
Add a decorator add_docstring_header() to generate docstring (#1185)
Add EnWiki9 dataset (#1184)
Avoid unnecessary downloads and extraction for some raw datasets, and add more logging (#1178)
Split raw datasets into individual files (#1156, #1173, #1174, #1175, #1176)
Extend the unittest coverage for all the raw datasets (#1157, #1149)
Define the relative path of the datasets in the download_from_url() func and skip unnecessary download if the downloaded files are detected (#1158, #1155)
Add MD5 and NUM_LINES as the meta information in the __init__ file of torchtext.datasets folder (#1155)
Standardize the text dataset doc strings and argument order. (#1151)
Report the “exceeds quota” error for the datasets using Google drive links (#1150)
Add support for the string-typed split values to the text datasets (#1147)
Re-name the argument from data_select to split in the dataset constructor (#1143)
Add Python 3.9 support across Linux, MacOS, and Windows platforms (#1139)
Switch to the new URL for the IWSLT dataset (#1115)
Extend the language shortcut in torchtext.data.utils.get_tokenizer func with the full name when Spacy tokenizers are loaded (#1140)
Fix broken CI tests due to spacy 3.0 release (#1138)
Pass an embedding layer to the constructor of the BertModel class in the BERT example (#1135)
Fix test warnings by switching to assertEqual() in PyTorch TestCase class (#1086)
Improve CircleCI tests and conda package (#1128, #1121, #1120, #1106)
Simplify TorchScript registration by adopting TORCH_LIBRARY_FRAGMENT macro (#1102)

Bug Fixes

Fix the total number of returned lines in setup_iter() func in RawTextIterableDataset (#1142)

Docs

Add number of classes to doc strings for text classification data (#1230)
Remove Lato font for pytorch/text website (#1227)
Add the migration tutorial (#1203, #1216, #1222)
Remove the legacy examples on pytorch/text website (#1206)
Update README file for 0.9.0 release (#1198)
Add CI check to detect undocumented parameters (#1167)
Add a static text link for the package version in the doc website (#1161)
Fix sphinx warnings and turn warnings into errors (#1163)
Add the text datasets to torchtext website (#1153)
Add the constructor document for IMDB and SST datasets (#1118)
Fix typos in the README file (#1089)
Rename "Arguments" to "Args" in the doc strings (#1110)
Build docs and push to gh-pages on nightly basis (#1105, #1111, #1112)

pytorch/text v0.9.0-rc5 Torchtext 0.9.0 release note on GitHub

Highlights

Backwards Incompatible

Improvements

Bug Fixes

Docs

pytorch/text v0.9.0-rc5
Torchtext 0.9.0 release note

on GitHub