Longformer
- Longformer (@ibeltagy)
- Longformer for QA (@patil-suraj + @patrickvonplaten)
- Longformer fast tokenizer (@patil-suraj)
- Longformer for sequence classification (@patil-suraj)
- Longformer for token classification (@patil-suraj)
- Longformer for Multiple Choice (@patrickvonplaten)
- More user-friendly handling of global attention mask vs local attention mask (@patrickvonplaten)
- Fix longformer attention mask type casting when using APEX (@peskotivesgeroff)
New community notebooks!
- Long Sequence Modeling with Reformer (@patrickvonplaten)
- Fine-tune BART for summarization (@ohmeow)
- Fine-tune a pre-trained Transformer on anyone's tweets (@borisdayma, @lavanyashukla)
- A step-by-step guide to tracking hugging face model performance with wandb (@jxmorris12, @lavanyashukla)
- Fine-tune Longformer for QA (@patil-suraj)
- Pretrain Longformer (@ibeltagy)
- Fine-tune T5 for sentiment span extraction (@enzoampil)
URLs to model weights are not hardcoded anymore (@julien-c)
Archive maps were dictionaries linking pre-trained models to their S3 URLs. Since the arrival of the model hub, these have become obsolete.
⚠️ This PR is breaking for the following models: BART, Flaubert, bert-japanese, bert-base-finnish, bert-base-dutch. ⚠️
Those models now have to be instantiated with their full model id:
"cl-tohoku/bert-base-japanese"
"cl-tohoku/bert-base-japanese-whole-word-masking"
"cl-tohoku/bert-base-japanese-char"
"cl-tohoku/bert-base-japanese-char-whole-word-masking"
"TurkuNLP/bert-base-finnish-cased-v1"
"TurkuNLP/bert-base-finnish-uncased-v1"
"wietsedv/bert-base-dutch-cased"
"flaubert/flaubert_small_cased"
"flaubert/flaubert_base_uncased"
"flaubert/flaubert_base_cased"
"flaubert/flaubert_large_cased"all variants of "facebook/bart"
Update: ⚠️ This PR is also breaking for ALBERT from Tensorflow. See issue #4806 for discussion and resolution ⚠️
Fixes and improvements
- Fix convert_token_type_ids_from_sequences for fast tokenizers (@n1t0, #4503)
- Fixed the default tokenizer of the summarization pipeline (@sshleifer, #4506)
- The
max_len
attribute is now more robust, and warns the user about deprecation (@mfuntowicz, #4528) - Added type hints to
modeling_utils.py
(@bglearning, #3911) - MMBT model now has
nn.Module
as a superclass (@shoarora, #4533) - Fixing tokenization of extra_id symbols in the T5 tokenizer (@mansimov, #4353)
- Slow GPU tests run daily (@julien-c, #4465)
- Removed PyTorch artifacts in TensorFlow XLNet implementation (@ZhuBaohe, #4410)
- Fixed the T5 Cross Attention Position Bias (@ZhuBaohe, #4499)
- The
transformers-cli
is now cross-platform (@BramVanroy, #4131) + (@patrickvonplaten, #4614) - GPT-2, CTRL: Accept
input_ids
andpast
of variable length (@patrickvonplaten, #4581) - Added back
--do_lower_case
to SQuAD examples. - Correct framework test requirement for language generation tests (@sshleifer, #4616)
- Fix
add_special_tokens
on fast tokenizers (@n1t0, #4531) - MNLI & SST-2 bugs were fixed (@stdcoutzyx, #4546)
- Fixed BERT example for NSP and multiple choice (@siboehm, #3953)
- Encoder/decoder fix initialization and save/load bug (@patrickvonplaten, #4680)
- Fix onnx export input names order (@RensDimmendaal, #4641)
- Configuration: ensure that id2label always takes precedence over
num_labels
(@julien-c, direct commit tomaster
) - Make docstring match argument (@sgugger, #4711)
- Specify PyTorch versions for examples (@LysandreJik, #4710)
- Override get_vocab for fast tokenizers (@mfuntowicz, #4717)
- Tokenizer should not add special tokens for text generation (@patrickvonplaten, #4686)