✨ Major features and improvements
- NEW: Alpha support for Dutch tokenization.
- Reorganise and improve format of language data.
- Add shared tag map, entity rules, emoticons and punctuation to language data.
- Convert entity rules, morphological rules and lemmatization rules from JSON to Python.
- Update language data for English, German, Spanish, French, Italian and Portuguese.
🔴 Bug fixes
- Fix issue #649: Update and reorganise stop lists.
- Fix issue #672: Make
token.ent_iob_
return unicode. - Fix issue #674: Add missing lemmas for contracted forms of "be" to
TOKENIZER_EXCEPTIONS
. - Fix issue #683:
Morphology
class now supplies tag map value for the special space tag if it's missing. - Fix issue #684: Ensure
spacy.en.English()
loads the Glove vector data if available. Previously was inconsistent with behaviour ofspacy.load('en')
. - Fix issue #685: Expand
TOKENIZER_EXCEPTIONS
with unicode apostrophe (’
). - Fix issue #689: Correct typo in
STOP_WORDS
. - Fix issue #691: Add tokenizer exceptions for "gonna" and "Gonna".
⚠️ Backwards incompatibilities
No changes to the public, documented API, but the previously undocumented language data and model initialisation processes have been refactored and reorganised. If you were relying on the bin/init_model.py
script, see the new spaCy Developer Resources repo. Code that references internals of the spacy.en
or spacy.de
packages should also be reviewed before updating to this version.
📖 Documentation and examples
- NEW: "Adding languages" workflow.
- NEW: "Part-of-speech tagging" workflow.
- NEW: spaCy Developer Resources repo – scripts, tools and resources for developing spaCy.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @dafnevk, @jvdzwaan, @RvanNieuwpoort, @wrvhage, @jaspb, @savvopoulos and @davedwards for the pull requests!