pytorch/text 0.5.0 on GitHub

Highlights

We simplify the current torchtext dataset library by leveraging existing utils (DataLoader, Sampler) in PyTorch core library. Separate tokenizer, vocabulary, and data processing functionals. Users will feel empowered to build data processing pipelines.

[Experimental] New abstraction for torchtext dataset

torchtext v0.5.0 release officially introduces a new abstraction for the datasets. Based on the feedback from users, the new abstraction will solve several issues existing in torchtext, including

Several components and functionals are unclear and difficult to adopt. For example, Field class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The current Field class works like a "black box", and users are confused about what's going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library where users build models and pipelines with orthogonal components.
Incompatible with PyTorch core library, like DataLoader and Sampler in torch.utils.data. Some custom modules/functions in torchtext (e.g. Iterator, Batch, splits) should be replaced by the corresponding modules in torch.utils.data.

We have re-written several datasets in torchtext.experimental.datasets, which are using the new abstraction. The old version of the datasets are still available in torchtext.datasets, and the new datasets are opt-in. We expect to replace the legacy datasets with the experimental ones in the future. Torchtext users are welcome to send feedback to issue [#664]

Re-write Sentiment Analysis dataset [#651]
- IMDB
Re-write Language Modeling datasets [#624, #661], including
- WikiText2
- WikiText103
- PennTreebank

SentencePiece binding

The SentencePiece binding provides an effective way to solve the open vocabulary problems in NLP tasks. The binding now supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. It trains a subword models directly from raw text data, which are used to tokenize corpus and convert them into PyTorch tensors [#597]

Backward compatibility

Last release with the support of Python 2
Change the default ngrams value to 1 in text classification datasets [#663]
Temporarily remove a unit test test_get_tokenizer_moses from CI tests. Need to push it back after issue related to moses tokenizer is resolved. [#588]

We would like to thank the open source community, who continues to send pull requests for new features and bug-fixes.

New Features

Add unsupervised learning dataset EnWik9, compressing first 10⁹ bytes of enwiki-20060303-pages-articles.xml [#610]
Several generators are created to build the pipeline for text preprocessing [#624, #610, #597].
Add Bilingual Evaluation Understudy (BLEU) metric for translation task in torch.data.metrics [#627]
Add Cross-Lingual NLI Corpus (XNLI) dataset [#613]

Improvements

Improve download_from_url and extract_archive func. extract_archive func now supports .zip files. download_from_url func now explicitly gets the filename from the url instead of from url header. This allows to download from a non-google drive link [#602]
Add a legal disclaimer for torchtext datasets [#590]
Add installation command to Travis [#585]
Some improvements in the example torchtext/examples/text_classification [#580] [#578] [#576]
Fix and improve docs [#603] [#598] [#594] [#577] [#662]
Add Code of Conduct document [#638]
Add Contributing document [#637]

Bug Fixes

Fix a backward compatibility issue in Vocab class. The old version of torchtext doesn’t have unk_index attribute in Vocab, To avoid BC breaking, the setstate function now checks if there is unk_index attribute in the vocab object [#591]
Resolve an overflow error by decreasing the maxInt value, which is used to check csv.field_size_limit in unicode_csv_reader [#584]

pytorch/text 0.5.0 0.5.0: A new abstraction for torchtext dataset on GitHub