github huggingface/transformers v3.1.0
Pegasus, DPR, self-documented outputs, new pipelines and MT support

latest releases: v4.40.1, v4.40.0, v4.39.3...
3 years ago

Pegasus, mBART, DPR, self-documented outputs and new pipelines

Pegasus

The Pegasus model from PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization by Jingqing Zhang, Yao Zhao, Mohammad Saleh, Peter J. Liu, was added to the library in PyTorch.

Model implemented as a collaboration between Jingqing Zhang and @sshleifer in #6340

  • PegasusForConditionalGeneration (torch version) #6340
  • add pegasus finetuning script #6811 script. (warning very slow)

DPR

The DPR model from Dense Passage Retrieval for Open-Domain Question Answering by Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih was added to the library in PyTorch.

DeeBERT

The DeeBERT model from DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference by Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, Jimmy Lin has been added to the examples/ folder alongside its training script, in PyTorch.

  • Add DeeBERT (entropy-based early exiting for *BERT) #5477 (@ji-xin)

Self-documented outputs

As well as returning tuples, PyTorch and TensorFlow models now return a subclass of ModelOutput that is appropriate. A ModelOutput is a dataclass containing all model returns. This allows for easier inspection, and for self-documenting model outputs.

Models return tuples by default, and return self-documented outputs if the return_dict configuration flag is set to True or if the return_dict=True keyword argument is passed to the forward/call method.

Summary of the behavior:

# The new outputs are opt-in, you have to activate them explicitly with `return_dict=True`
# Either at instantiation
model = BertForSequenceClassification.from_pretrained('bert-base-cased', return_dict=True)
# Or when calling the model
output = model(**inputs, return_dict=True)

# You can access the elements of the outputs with
# (1) named attributes
loss = outputs.loss
logits = outputs.logits

# (2) their names as strings like a dict
loss = outputs["loss"]
logits = outputs["logits"]

# (3) their index as integers or slices in the pre-3.1.0 outputs tuples
loss = outputs[0]
logits = outputs[1]
loss, logits = outputs[:2]

# One **breaking behavior** of these new outputs (which is the reason you have to opt-in to use these new outputs:
# Iterating on the outputs now return the names (keys) instead of the values:
print([element for element in outputs])
>>> ['loss', 'logits']
# Thus you cannot unpack the output like pre-3.1.0 (you get the string names instead of the values):
# (But you can query a slice like indicated in (3) above)
loss_keys, logits_key = outputs

Encoder-Decoder framework

The encoder-decoder framework has been enhanced to allow more encoder decoder model combinations, e.g.:
Bert2Bert, Bert2GPT2, Roberta2Roberta, Longformer2Roberta, ....

TensorFlow as a first-class citizen

As we continue working towards having TensorFlow be a first-class citizen, we continually improve on our TensorFlow API and models.

Machine Translation

MarianMTModel

  • en-zh and 357 other checkpoints for machine translation were added from the Helsinki-NLP group's Tatoeba Project (@sshleifer + @jorgtied). There are now > 1300 supported pairs for machine translation.
  • Marian converter updates #6342 (@sshleifer)
  • Marian distill scripts + integration test #6799 (@sshleifer)

mBART

The mBART model from Multilingual Denoising Pre-training for Neural Machine Translation was can now be accessed through MBartForConditionalGeneration.

examples/seq2seq

  • examples/seq2seq/finetune.py supports --task translation
  • All sequence to sequence tokenizers (T5, Bart, Marian, Pegasus) expose a prepare_seq2seq_batch method that makes batches for sequence to sequence trianing.

PRs:

New documentation

Several new documentation pages have been added and older documentation has been tweaked to be more accurate and understandable. An open in colab button has been added on the tutorial pages.

Trainer updates

New additions to the Trainer

  • Added data collator for permutation (XLNet) language modeling and related calls #5522 (@shngt)
  • Trainer support for iterabledataset #5834 (@Pradhy729)
  • Adding PaddingDataCollator #6442 (@sgugger)
  • Add hyperparameter search to Trainer #6576 (@sgugger)
  • [examples] Add trainer support for question-answering #4829 (@patil-suraj)
  • Adds comet_ml to the list of auto-experiment loggers #6176 (@dsblank)
  • Dataset and DataCollator for BERT Next Sentence Prediction (NSP) task #6644 (@HuangLianzhe)

New models & model architectures

The following model architectures have been added to the library

Regression testing on TPU & TPU CI

Thanks to @zcain117 we now have access to TPU CI for the PyTorch/xla framework. This enables regression testing on the TPU aspects of the Trainer, and offers very simple regression testing on model training performance.

New pipelines

New pipelines have been added:

Community notebooks

Centralized logging

Logging is now centralized. The library offers methods to handle the verbosity level of all loggers contained in the library. [Link to logging doc here]:

Bug fixes and improvements

Don't miss a new transformers release

NewReleases is sending notifications on new releases.