⚠️ This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version.
✨ New features and improvements
- NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total!
- NEW: 2-4× faster loading times for models with vectors and 2× smaller packages.
- NEW: Alpha support for Armenian, Gujarati and Malayalam.
- NEW: Lookup lemmatization for Polish.
- NEW: Allow
Matcherto match on both
- NEW: Add
- Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora.
- Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu.
- Add support for
- Switch from
- Improve punctuation used in sentencizer.
- Switch to new and more consistent alignment method in
- Reduce stored lexemes data and move non-derivable features to
🔴 Bug fixes
- Fix issue #5056: Introduce support for matching
- Fix issue #5086: Remove
- Fix issue #5131: Improve data processing in named entity linking scripts.
- Fix issue #5137: Fix passing of component configuration to component.
- Fix issue #5144: Fix sentence comparison in test util.
- Fix issue #5166: Fix handling of
exclusive_classesin textcat ensemble.
- Fix issue #5170: Set rank for new vector in
- Fix issue #5181: Prevent
Nonevalues in gold fields.
- Fix issue #5191: Fix
GoldParseinitialization when the number of tokens has changed.
- Fix issue #5193: Correctly pin
- Fix issue #5200: Fix minor bugs in train CLI.
- Fix issue #5216: Modify
Vectors.resizeto work with
- Fix issue #5228: Raise error for inplace resize with new vector dimension.
- Fix issue #5230: Fix
unittestwarnings when saving a model.
- Fix issue #5257: Use inline flags in
- Fix issue #5278, #5359: Add missing
__init__.pyfiles to language data tests.
- Fix issue #5281: Fix comparison predicate handling for
- Fix issue #5287: Normalize
- Fix issue #5292: Fix typo in option name
- Fix issue #5303: Use
max(uint64)for OOV lexeme rank.
- Fix issue #5311: Fix alignment of cards on landing page.
- Fix issue #5320: Fix
most_similarfor vectors with unused rows.
- Fix issue #5344: Prevent pip from installing spaCy on Python 3.4.
- Fix issue #5356: Fix bug in
Span.similaritythat could trigger
- Fix issue #5361: Fix problems with lower and whitespace in variants.
- Fix issue #5373: Improve exceptions for
'd(would/had) in English.
- Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU.
- Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks.
- Fix issue #5429: Modify array type to accommodate
- Fix issue #5430: Check that row is within bounds when adding vector.
- Fix issue #5435: Use
- Fix issue #5436: Fix
- Fix issue #5450: Disallow merging 0-length spans.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validatecommand to find out which models need updating, and print update instructions.
- If you're training new models, you'll want to install the package
spacy-lookups-data, which now includes both the lemmatization tables (as in v2.2) and the normalization tables (new in v2.3). If you're using pretrained models, nothing changes, because the relevant tables are included in the model packages.
- Due to the updated Universal Dependencies training data, the fine-grained part-of-speech tags will change for many provided language models. The coarse-grained part-of-speech tagset remains the same, but the mapping from particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech tagsets contain new merged tags related to contracted forms, such as
"au", which maps to UPOS
ADPbased on the head
"à". This increases the accuracy of the models by improving the alignment between spaCy's tokenization and Universal Dependencies multi-word tokens used for contractions.
- spaCy's custom warnings have been replaced with native Python
warnings. Instead of setting
SPACY_WARNING_IGNORE, use the
warningsfilters to manage warnings.
📖 Documentation and examples
- Fix various typos and inconsistencies.
- Add new projects to the spaCy Universe.
bin/wiki_entity_linkingscripts for Wikipedia to
🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon!
📦 Model packages (43)
| Model | Language | Version | Vectors |
| ------------------- | ---------- | ------: | ----:
zh_core_web_sm | Chinese | 2.3.0 | 𐄂 |
zh_core_web_md | Chinese | 2.3.0 | ✓ |
zh_core_web_lg | Chinese | 2.3.0 | ✓ |
da_core_news_sm | Danish | 2.3.0 | 𐄂 |
da_core_news_md | Danish | 2.3.0 | ✓ |
da_core_news_lg | Danish | 2.3.0 | ✓ |
nl_core_news_sm | Dutch | 2.3.0 | 𐄂 |
nl_core_news_md | Dutch | 2.3.0 | ✓ |
nl_core_news_lg | Dutch | 2.3.0 | ✓ |
en_core_web_sm | English | 2.3.0 | 𐄂 |
en_core_web_md | English | 2.3.0 | ✓ |
en_core_web_lg | English | 2.3.0 | ✓ |
fr_core_news_sm | French | 2.3.0 | 𐄂 |
fr_core_news_md | French | 2.3.0 | ✓ |
fr_core_news_lg | French | 2.3.0 | ✓ |
de_core_news_sm | German | 2.3.0 | 𐄂 |
de_core_news_md | German | 2.3.0 | ✓ |
de_core_news_lg | German | 2.3.0 | ✓ |
el_core_news_sm | Greek | 2.3.0 | 𐄂 |
el_core_news_md | Greek | 2.3.0 | ✓ |
el_core_news_lg | Greek | 2.3.0 | ✓ |
it_core_news_sm | Italian | 2.3.0 | 𐄂 |
it_core_news_md | Italian | 2.3.0 | ✓ |
it_core_news_lg | Italian | 2.3.0 | ✓ |
ja_core_news_sm | Italian | 2.3.0 | 𐄂 |
ja_core_news_md | Italian | 2.3.0 | ✓ |
ja_core_news_lg | Italian | 2.3.0 | ✓ |
lt_core_news_sm | Lithuanian | 2.3.0 | 𐄂 |
lt_core_news_md | Lithuanian | 2.3.0 | ✓ |
lt_core_news_lg | Lithuanian | 2.3.0 | ✓ |
nb_core_news_sm | Norwegian Bokmål | 2.3.0 | 𐄂 |
nb_core_news_md | Norwegian Bokmål | 2.3.0 | ✓ |
nb_core_news_lg | Norwegian Bokmål | 2.3.0 | ✓ |
pl_core_news_sm | Polish | 2.3.0 | 𐄂 |
pl_core_news_md | Polish | 2.3.0 | ✓ |
pl_core_news_lg | Polish | 2.3.0 | ✓ |
pt_core_news_sm | Portuguese | 2.3.0 | 𐄂 |
pt_core_news_md | Portuguese | 2.3.0 | ✓ |
pt_core_news_lg | Portuguese | 2.3.0 | ✓ |
ro_core_news_sm | Romanian | 2.3.0 | 𐄂 |
ro_core_news_md | Romanian | 2.3.0 | ✓ |
ro_core_news_lg | Romanian | 2.3.0 | ✓ |
es_core_news_sm | Spanish | 2.3.0 | 𐄂 |
es_core_news_md | Spanish | 2.3.0 | ✓ |
es_core_news_lg | Spanish | 2.3.0 | ✓ |
xx_ent_wiki_sm | Multi-language | 2.3.0 | 𐄂 |
Thanks to @mabraham, @sloev, @pinealan, @pmbaumgartner, @Baciccin, @nlptechbook, @guerda, @Tiljander, @nikhilsaldanha, @tommilligan, @Jacse, @leicmi, @YohannesDatasci, @mirfan899, @koaning, @umarbutler, @chopeen, @paoloq, @thomasthiebaud, @sebastienharinck, @elben10, @laszabine, @Mlawrence95, @sabiqueqb, @punitvara, @michael-k, @louisguitton, @vondersam, @thoppe, @vishnupriyavr, @ilivans and @osori for the pull requests and contributions.
🙏 Special thanks to everyone who helped us develop and test the new models: @lixiepeng, @lingvisa and @howl-anderson (Chinese), @hvingelby (Danish), @hiroshi-matsuda-rit and @polm (Japanese), @ryszardtuora (Polish) and @avramandrei and @dumitrescustefan (Romanian).