✨ New features and improvements
- Improved speeds for many components, see speed benchmarks for trained pipelines:
- Speed up parser and NER by using constant-time head lookups (#10048).
- Support unnormalized softmax probabilities in
spacy.Tagger.v2
to speed up inference for the tagger, morphologizer, senter and trainable lemmatizer (#10197). - Speed up parser projectivization functions (#10241).
- Replace
Ragged
with fasterAlignmentArray
inExample
for training (#10319). - Improve
Matcher
speed (#10659). - Improve serialization speed for empty
Doc.spans
(#10250).
- NEW: A trainable lemmatizer component that uses edit trees to transform tokens to lemmas. Add it to your config with
spacy init config -p trainable_lemmatizer
or using the quickstart. - Language updates:
- Big endian support with
thinc
v8.0.14+ andthinc-bigendian-ops
. - Config comparisons with
spacy debug diff-config
. - displaCy support for overlapping span annotation and multiple labeled arcs between the same tokens.
SpanCategorizer.set_candidates
for debugging span suggesters.- The quickstart now supports adding
spancat
andtrainable_lemmatizer
components.
📦 Trained pipelines
v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
fi_core_news_sm
| Finnish | 92.5 | 71.9 | 75.9 |
fi_core_news_md
| Finnish | 95.9 | 78.6 | 80.6 |
fi_core_news_lg
| Finnish | 96.2 | 79.4 | 82.4 |
ko_core_news_sm
| Korean | 86.1 | 65.6 | 71.3 |
ko_core_news_md
| Korean | 94.7 | 80.9 | 83.1 |
ko_core_news_lg
| Korean | 94.7 | 81.3 | 85.3 |
sv_core_news_sm
| Swedish | 95.0 | 75.9 | 74.7 |
sv_core_news_md
| Swedish | 96.3 | 78.5 | 79.3 |
sv_core_news_lg
| Swedish | 96.3 | 79.1 | 81.1 |
🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!
The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.
Model | v3.2 Lemma Acc | v3.3 Lemma Acc |
---|---|---|
da_core_news_md
| 84.9 | 94.8 |
de_core_news_md
| 73.4 | 97.7 |
el_core_news_md
| 56.5 | 88.9 |
fi_core_news_md
| - | 86.2 |
it_core_news_md
| 86.6 | 97.2 |
ko_core_news_md
| - | 90.0 |
lt_core_news_md
| 71.1 | 84.8 |
nb_core_news_md
| 76.7 | 97.1 |
nl_core_news_md
| 81.5 | 94.0 |
pl_core_news_md
| 87.1 | 93.7 |
pt_core_news_md
| 76.7 | 96.9 |
ro_core_news_md
| 81.8 | 95.5 |
sv_core_news_md
| - | 95.5 |
🔴 Bug fixes
- Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
- Fix issue #9443: Fix
Scorer.score_cats
for missing labels. - Fix issue #9669: Fix entity linker batching.
- Fix issue #9903: Handle
_
value for UPOS in CoNLL-U converter. - Fix issue #9904: Fix textcat loss scaling.
- Fix issue #9956: Compare all
Span
attributes consistently. - Fix issue #10073: Add
"spans"
to the output ofdoc.to_json
. - Fix issue #10086: Add tokenizer option to allow
Matcher
handling for all special cases. - Fix issue #10189: Allow
Example
to align whitespace annotation. - Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
- Fix issue #10324: Fix
Tok2Vec
for empty batches. - Fix issue #10347: Update basic functionality for
rehearse
. - Fix issue #10394: Fix
Vectors.n_keys
for floret vectors. - Fix issue #10400: Use
meta
inutil.load_model_from_config
. - Fix issue #10451: Fix
Example.get_matching_ents
. - Fix issue #10460: Fix initial special cases for
Tokenizer.explain
. - Fix issue #10521: Stream large assets on download in spaCy projects.
- Fix issue #10536: Handle unknown tags in
KoreanTokenizer
tag map. - Fix issue #10551: Add automatic vector deduplication for
init vectors
.
🚀 Notes about upgrading from v3.2
- To see the speed improvements for the
Tagger
architecture, edit your configs to switch fromspacy.Tagger.v1
tospacy.Tagger.v2
and then runinit fill-config
. - Span comparisons involving ordering (
<
,<=
,>
,>=
) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956). - Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
Doc.from_docs
now includesDoc.tensor
by default and supports excludes with anexclude
argument in the same format asDoc.to_bytes
. The supported exclude fields arespans
,tensor
anduser_data
.
📖 Documentation and examples
- spaCy universe additions:
- classy-classification: A Python library for classy few-shot and zero-shot classification within spaCy.
- Concise Concepts: Concise Concepts uses few-shot NER based on word embedding similarity.
- Crosslingual Coreference: Crosslingual coreference with an English coreference model plus crosslingual embeddings.
- EDS-NLP: spaCy components to extract information from clinical notes written in French.
- HuSpaCy: Industrial-strength Hungarian natural language processing.
- Klayers: spaCy as a AWS Lambda Layer.
- Named Entity Recognition (NER) using spaCy (video).
- Scrubadub: Remove personally identifiable information from text using spaCy.
- spacy-setfit-textcat: Experiments with SetFit & Few-Shot Classification.
- tmtoolkit: Text mining and topic modeling toolkit.
👥 Contributors
@aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996