Breaking changes:
- By default, examples are now sorted within a batch by decreasing sequence length (#95, #139). This is required for use of PyTorch
PackedSequence
s, and it can be flexibly overridden with aDataset
constructor flag. - The unknown token is now included as part of
specials
and can be overridden or removed in theField
constructor (part of #107).
New features:
- New word vector API with classes for GloVe and FastText; string descriptors are still accepted for backwards compatibility (#94, #102, #115, #120, thanks @nelson-liu and @bmccann!)
- Reversible tokenization (#107). Introduces a new
Field
subclass,ReversibleField
, with a.reverse
method that detokenizes. All implementations ofReversibleField
should guarantee that the tokenization+detokenization round-trip is idempotent; torchtext provides wrappers for the revtok tokenizer and subword segmenter that satisfy this property. - Skip header line in CSV/TSV loading (#146)
RawField
s that represent any data type without processing (#147, thanks @kylegao91!)
New datasets:
- TREC (#92, thanks @bmccann!)
- IMDb (#93, thanks @bmccann!)
- Multi30k (#116, thanks @bmccann!)
- IWSLT (#126, #128, thanks @bmccann!)
- WMT14 (#138)
Bugfixes:
- Fix pretrained word vector loading (#99, thanks @matt-peters!)
- Fix JSON loader silently ignoring requested columns not present in the file (#105, thanks @nelson-liu!)
- Many fixes for Python 2, especially surrounding Unicode (#105, #112, #135, #153 thanks @nelson-liu!)
- Fix
Pipeline.call
behavior (#113, thanks @nelson-liu!) - Fix README example (#134, thanks @czhang99!)
- Fix WikiText2 loader (#138)
- Fix typo in MT loader (#142, thanks @sivareddyg!)
- Fix
Example.fromlist
behavior on non-strings (#145) - Update test set URL for Multi30k (#149)
- Fix SNLI data loader (#150, thanks @sivareddyg!)
- Fix language modeling iterator (#151)
- Remove transpose as a side effect of
Field.reverse
(#155)