Highlights
We simplify the current torchtext dataset library by leveraging existing utils (DataLoader
, Sampler
) in PyTorch core library. Separate tokenizer, vocabulary, and data processing functionals. Users will feel empowered to build data processing pipelines.
[Experimental] New abstraction for torchtext dataset
torchtext v0.5.0 release officially introduces a new abstraction for the datasets. Based on the feedback from users, the new abstraction will solve several issues existing in torchtext, including
- Several components and functionals are unclear and difficult to adopt. For example,
Field
class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The currentField
class works like a "black box", and users are confused about what's going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library where users build models and pipelines with orthogonal components. - Incompatible with PyTorch core library, like DataLoader and Sampler in torch.utils.data. Some custom modules/functions in torchtext (e.g.
Iterator
,Batch
,splits
) should be replaced by the corresponding modules intorch.utils.data
.
We have re-written several datasets in torchtext.experimental.datasets, which are using the new abstraction. The old version of the datasets are still available in torchtext.datasets
, and the new datasets are opt-in. We expect to replace the legacy datasets with the experimental ones in the future. Torchtext users are welcome to send feedback to issue [#664]
- Re-write Sentiment Analysis dataset [#651]
- IMDB - Re-write Language Modeling datasets [#624, #661], including
- WikiText2
- WikiText103
- PennTreebank
SentencePiece binding
The SentencePiece binding provides an effective way to solve the open vocabulary problems in NLP tasks. The binding now supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. It trains a subword models directly from raw text data, which are used to tokenize corpus and convert them into PyTorch tensors [#597]
Backward compatibility
- Last release with the support of Python 2
- Change the default ngrams value to 1 in text classification datasets [#663]
- Temporarily remove a unit test
test_get_tokenizer_moses
from CI tests. Need to push it back after issue related to moses tokenizer is resolved. [#588]
We would like to thank the open source community, who continues to send pull requests for new features and bug-fixes.
New Features
- Add unsupervised learning dataset EnWik9, compressing first 109 bytes of enwiki-20060303-pages-articles.xml [#610]
- Several generators are created to build the pipeline for text preprocessing [#624, #610, #597].
- Add Bilingual Evaluation Understudy (BLEU) metric for translation task in
torch.data.metrics
[#627] - Add Cross-Lingual NLI Corpus (XNLI) dataset [#613]
Improvements
- Improve
download_from_url
andextract_archive
func.extract_archive
func now supports .zip files.download_from_url
func now explicitly gets the filename from the url instead of from url header. This allows to download from a non-google drive link [#602] - Add a legal disclaimer for torchtext datasets [#590]
- Add installation command to Travis [#585]
- Some improvements in the example torchtext/examples/text_classification [#580] [#578] [#576]
- Fix and improve docs [#603] [#598] [#594] [#577] [#662]
- Add Code of Conduct document [#638]
- Add Contributing document [#637]
Bug Fixes
- Fix a backward compatibility issue in
Vocab
class. The old version of torchtext doesn’t haveunk_index
attribute inVocab
, To avoid BC breaking, thesetstate
function now checks if there isunk_index
attribute in the vocab object [#591] - Resolve an overflow error by decreasing the maxInt value, which is used to check
csv.field_size_limit
inunicode_csv_reader
[#584]