✨ Major features and improvements
- Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
- Improve how tokenizer exceptions for English contractions and punctuations are generated.
- Update language data for Hungarian and Swedish tokenization.
- Update to use Thinc v6 to prepare for spaCy v2.0.
🔴 Bug fixes
- Fix issue #326: Tokenizer is now more consistent and handles abbreviations correctly.
- Fix issue #344: Tokenizer now handles URLs correctly.
- Fix issue #483: Period after two or more uppercase letters is split off in tokenizer exceptions.
- Fix issue #631: Add
richcmp
method toToken
. - Fix issue #718: Contractions with
She
are now handled correctly. - Fix issue #736: Times are now tokenized with correct string values.
- Fix issue #743:
Token
is now hashable. - Fix issue #744:
were
andWere
are now excluded correctly from contractions.
📋 Tests
- Modernise and reorganise all tests and remove model dependencies where possible.
- Improve test speed to ~20s for basic tests (from previously >80s) and ~100s including models (from previously >200s).
- Add fixtures for spaCy components and test utilities, e.g. to create
Doc
object manually. - Add documentation for tests to explain conventions and organisation.
👥 Contributors
Thanks to @oroszgy, @magnusburton, @guyrosin and @danielhers and for the pull requests!