✨ New features and improvements
- NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
- NEW: Experimental
SpanCategorizer
component for labeling arbitrary and potentially overlapping spans of text. - NEW: Use predicted annotations during training via the
[training.annotating_components]
config setting. - Alpha tokenization support for Azerbaijani.
- Part-of-speech tag-based lemmatizers for Catalan and Italian.
- The TextCatCNN and TextCatBOW architectures are now resizable.
- Support updating the
EntityRecognizer
with known incorrect span annotations. - Auto-generate a pretty
README.md
based on the meta inspacy package
.
For more details, see the New in v3.1 usage guide.
📦 New trained pipelines
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
ca_core_news_sm
| Catalan | 98.2 | 87.4 | 79.8 |
ca_core_news_md
| Catalan | 98.3 | 88.2 | 84.0 |
ca_core_news_lg
| Catalan | 98.5 | 88.4 | 84.2 |
ca_core_news_trf
| Catalan | 98.9 | 93.0 | 91.2 |
da_core_news_trf
| Danish | 98.0 | 85.0 | 82.9 |
⚠️ Upgrading from v3.0
- Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the
spacy_version
in your model package meta to">=3.0.0,<3.2.0"
. If you run into degraded performance, retrain your pipeline with v3.1. - Use
spacy init fill-config
to update a v3.0 config for v3.1. - When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in
[initialize.vectors]
. - Logger warnings have been converted to Python warnings. Use
warnings.filterwarnings
or the new helper methodspacy.errors.filter_warning(action, error_msg='')
to manage warnings.
For more information, see Notes on upgrading from v3.0.
🔴 Bug fixes
- Fix issue #7036: Use a context manager when reading model.
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7799: Ensure
spacy ray
command works. - Fix issue #7807: Show warning if entity ruler runs without patterns.
- Fix issue #7886: Fix unknown tokens percentage in
debug data
. - Fix issue #7930: Make
EntityLinker
robust for nO=None. - Fix issue #7925: Skip vector ngram backoff if
minn
is not set. - Fix issue #7973: Fix
debug model
for transformers. - Fix issue #7988: Preserve existing
ENT_KB_ID
inner
annotation. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()
for all empty docs. - Fix issue #8012: Fix ensemble
textcat
with listener. - Fix issue #8054: Add
ENT_ID
andNORM
toDocBin
strings. - Fix issue #8055: Handle partial entities in
Span.as_doc
. - Fix issue #8062: Make all
Span
attrs writable. - Fix issue #8066: Update
debug data
fortextcat
. - Fix issue #8069: Custom warning if
DocBin
is too large. - Fix issue #8099: Update Vietnamese tokenizer.
- Fix issue #8113: Support
to/from_bytes
forKnowledgeBase
andEntityLinker
. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix
. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS
. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1
. - Fix issue #8169: Fix bug from
EntityRuler
:ent_ids
returns None for phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler
. - Fix issue #8265: Address mypy errors.
- Fix issue #8335: Raise error if deps not provided with heads in
Doc
. - Fix issue #8368: Preserve whitespace in
Span.lemma_
. - Fix issue #8388: Don't clobber vectors when loading components from source models.
- Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict
. - Fix issue #8441: Add correct types for
Language.pipe
return values. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs
. - Fix issue #8559: Fix vectors check for sourced components.
- Fix issue #8584: Raise an error for
textcat
with <2 labels.
👥 Contributors
@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD