Highlights
In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text #664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as Field
. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide (link).
The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task (link).
- Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
- Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
- Sequence tagging: UDPOS, CoNLL2000Chunking
- Translation: IWSLT2016, IWSLT2017
- Question answer: SQuAD1, SQuAD2
We add Python 3.9 support in this release
Backwards Incompatible
The current users of the legacy code will experience BC breakage as we have retired the legacy code (#1172, #1181, #1183). The legacy components are placed in torchtext.legacy.data
folder as follows:
torchtext.data.Pipeline
->torchtext.legacy.data.Pipeline
torchtext.data.Batch
->torchtext.legacy.data.Batch
torchtext.data.Example
->torchtext.legacy.data.Example
torchtext.data.Field
->torchtext.legacy.data.Field
torchtext.data.Iterator
->torchtext.legacy.data.Iterator
torchtext.data.Dataset
->torchtext.legacy.data.Dataset
This means, all features are still available, but within torchtext.legacy
instead of torchtext.
Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release
Category | Legacy | 0.9.0 release |
---|---|---|
Language Modeling | torchtext.legacy.datasets.WikiText2 | torchtext.datasets.WikiText2 |
torchtext.legacy.datasets.WikiText103 | torchtext.datasets.WikiText103 | |
torchtext.legacy.datasets.PennTreebank | torchtext.datasets.PennTreebank | |
torchtext.legacy.datasets.EnWik9 | torchtext.datasets.EnWik9 | |
Text Classification | torchtext.legacy.datasets.AG_NEWS | torchtext.datasets.AG_NEWS |
torchtext.legacy.datasets.SogouNews | torchtext.datasets.SogouNews | |
torchtext.legacy.datasets.DBpedia | torchtext.datasets.DBpedia | |
torchtext.legacy.datasets.YelpReviewPolarity | torchtext.datasets.YelpReviewPolarity | |
torchtext.legacy.datasets.YelpReviewFull | torchtext.datasets.YelpReviewFull | |
torchtext.legacy.datasets.YahooAnswers | torchtext.datasets.YahooAnswers | |
torchtext.legacy.datasets.AmazonReviewPolarity | torchtext.datasets.AmazonReviewPolarity | |
torchtext.legacy.datasets.AmazonReviewFull | torchtext.datasets.AmazonReviewFull | |
torchtext.legacy.datasets.IMDB | torchtext.datasets.IMDB | |
torchtext.legacy.datasets.SST | deferred | |
torchtext.legacy.datasets.TREC | deferred | |
Sequence Tagging | torchtext.legacy.datasets.UDPOS | torchtext.datasets.UDPOS |
torchtext.legacy.datasets.CoNLL2000Chunking | torchtext.datasets.CoNLL2000Chunking | |
Translation | torchtext.legacy.datasets.WMT14 | deferred |
torchtext.legacy.datasets.Multi30k | deferred | |
torchtext.legacy.datasets.IWSLT | torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017 | |
Natural Language Inference | torchtext.legacy.datasets.XNLI | deferred |
torchtext.legacy.datasets.SNLI | deferred | |
torchtext.legacy.datasets.MultiNLI | deferred | |
Question Answer | torchtext.legacy.datasets.BABI20 | deferred |
Improvements
- Enable importing
metrics
/utils
/functional
fromtorchtext.legacy.data
(#1229) - Set up daily caching mechanism with Master job (#1219)
- Reset the functions in datasets_utils.py as private (#1224)
- Resolve the download folder for some raw datasets (#1213)
- Store the hash of the extracted CoNLL2000Chunking files so the extraction step will be skipped if the extracted files are detected (#1204)
- Fix the total number of lines in doc strings of the datasets (#1200)
- Extend CI tests to cover all the datasets (#1197, #1201, #1171)
- Document the number of lines in the dataset splits (#1196)
- Add hashes to skip the slow extraction if the extracted files are available (#1195)
- Use decorator to loop over the split argument in the datasets (#1194)
- Remove offset option from
torchtext.datasets
, and movetorchtext.datasets.common
totorchtext.data.dataset_utils
(#1188, #1145) - Remove the step to clean up the cache in
test_iwslt()
(#1192) - Split IWSLT dataset into IWSLT2016 and IWSLT2017 dataset and re-organize the parameters in the constructors (#1191, #1209)
- Move the prototype datasets in
torchtext.experimental.datasets.raw
folder totorchtext.datasets
folder (#1182, #1202, #1207, #1211, #1212) - Add a decorator
add_docstring_header()
to generate docstring (#1185) - Add EnWiki9 dataset (#1184)
- Avoid unnecessary downloads and extraction for some raw datasets, and add more logging (#1178)
- Split raw datasets into individual files (#1156, #1173, #1174, #1175, #1176)
- Extend the unittest coverage for all the raw datasets (#1157, #1149)
- Define the relative path of the datasets in the
download_from_url()
func and skip unnecessary download if the downloaded files are detected (#1158, #1155) - Add
MD5
andNUM_LINES
as the meta information in the__init__
file oftorchtext.datasets
folder (#1155) - Standardize the text dataset doc strings and argument order. (#1151)
- Report the “exceeds quota” error for the datasets using Google drive links (#1150)
- Add support for the string-typed split values to the text datasets (#1147)
- Re-name the argument from data_select to split in the dataset constructor (#1143)
- Add Python 3.9 support across Linux, MacOS, and Windows platforms (#1139)
- Switch to the new URL for the IWSLT dataset (#1115)
- Extend the language shortcut in
torchtext.data.utils.get_tokenizer
func with the full name when Spacy tokenizers are loaded (#1140) - Fix broken CI tests due to spacy 3.0 release (#1138)
- Pass an embedding layer to the constructor of the BertModel class in the BERT example (#1135)
- Fix test warnings by switching to
assertEqual()
in PyTorch TestCase class (#1086) - Improve CircleCI tests and conda package (#1128, #1121, #1120, #1106)
- Simplify TorchScript registration by adopting
TORCH_LIBRARY_FRAGMENT
macro (#1102)
Bug Fixes
- Fix the total number of returned lines in
setup_iter()
func inRawTextIterableDataset
(#1142)
Docs
- Add number of classes to doc strings for text classification data (#1230)
- Remove Lato font for
pytorch/text
website (#1227) - Add the migration tutorial (#1203, #1216, #1222)
- Remove the legacy examples on pytorch/text website (#1206)
- Update README file for 0.9.0 release (#1198)
- Add CI check to detect undocumented parameters (#1167)
- Add a static text link for the package version in the doc website (#1161)
- Fix sphinx warnings and turn warnings into errors (#1163)
- Add the text datasets to torchtext website (#1153)
- Add the constructor document for IMDB and SST datasets (#1118)
- Fix typos in the README file (#1089)
- Rename "Arguments" to "Args" in the doc strings (#1110)
- Build docs and push to gh-pages on nightly basis (#1105, #1111, #1112)