Reformer (@patrickvonplaten)

Added a new model "Reformer": https://arxiv.org/abs/2001.04451 to the library. Original trax code: https://github.com/google/trax/tree/master/trax/models/reformer was translated to PyTorch.
Reformer uses chunked attention and reversible layers to model sequences as long as 500,000 tokens.
Reformer is currently available as a casual language model and will soon also be available as encoder only ("Bert"-like) model.
Two pretrained weights are uploaded: https://huggingface.co/models?search=google%2Freformer
https://huggingface.co/google/reformer-enwik8 is the first char lm in the library

Additional architectures

Model saving, as well as optimizer and scheduler saving mid-training were hanging
Fixed the optimizer weight updates

New method for all tokenizers: tokenizer.decode_batch, to decode an entire batch (@sshleifer)
the NER pipeline now returns entity groups (@enzoampil)

We've started adding community notebooks to the repository. Three notebooks have made their way into our codebase:

-Adds predict stage for glue tasks, and generate result files which can be submitted to gluebenchmark.com (@stdcoutzyx)

Support flake8 3.8 (@julien-c)
Tests are now faster thanks to using dummy smaller models (@sshleifer)
Fixed the eval loss in the trainer (@patil-suraj)
Fixed the p_mask in SQuAD pre-processing (@LysandreJik)
Github Actions pytorch test are no longer pinned to torch==1.4.0 (@mfuntowicz)
Fixed the multiple-choice script with overflowing tokens (@LysandreJik)
Allow for None values in GradientAccumulator (@jarednielsen, improved by @jplu)
MBart tokenizer saving/loading id was fixed (@Mehrad0711)
TF generation: Fix issue for batch output generation of different output length.(@patrickvonplaten)
Fixed the FP-16 support in the T5 model (@patrickvonplaten)
run_language_modeling fix: actually use the overwrite_cache argument (@borisdayma)
Better, version compatible way to get the learning rate in the trainer (@rakeshchada)
Fixed the slow tests that were failing on GPU (@sshleifer, @patrickvonplaten, @LysandreJik)
ONNX conversion tokenizer fix (@RensDimmendaal)
Correct TF formatting to exclude LayerNorms from weight decay (@oliverastrand)
Removed warning of deprecation (@Colanim)
fix no grad in second pruning in run_bertology (@TobiasLee)