Better backward-compatibility for tokenizers following v3.0.0 refactoring
Version v3.0.0, included a refactoring of the tokenizers' backend to allow a simpler and more flexible user-facing API.
This refactoring was conducted with a particular focus on keeping backward compatibility for the v2.X encoding, truncation and padding API but still led to two breaking changes that could have been avoided.
This patch aims to bring back better backward compatibility, by implementing the following updates:
- the
prepare_for_model
method is now publicly exposed again for both slow and fast tokenizers with an API compatible with both the v2.X truncation/padding API and the v3.0 recommended API. - the truncation strategy now defaults again to
longest_first
instead offirst_only
.
Bug fixes and improvements:
- Better support for TransfoXL tokenizer when using TextGenerationPipeline #5465 (@TevenLeScao)
- Fix use of meme Transformer-XL generations #4826 (@tommccoy1)
- Fixing a bug in the NER pipeline which lead to discarding the last identified entity #5439 (@mfuntowicz and @enzoampil)
- Better QAPipelines #5429 (@mfuntowicz)
- Add Question-Answering and MLM heads to the Reformer model #5433 (@patrickvonplaten)
- Refactoring the LongFormer #5219 (@patrickvonplaten)
- Various fixes on tokenizers and tests (@sshleifer)
- Many improvements to the doc and tutorials (@sgugger)
- Fix TensorFlow dataset generator in run_glue #4881 (@jplu)
- Update Bertabs example to work again #5355 (@MichaelJanz)
- Move GenerationMixin to separate file #5254 (@yjernite)