✨ New features and improvements
- NEW: Registered scoring functions for each component in the config.
- NEW:
nlp()
andnlp.pipe()
acceptDoc
input, which simplifies setting custom tokenization or extensions before processing. - NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite
config settings forentity_linker
,morphologizer
,tagger
,sentencizer
andsenter
.extend
config setting formorphologizer
for whether existing feature types are preserved.- Support for a wider range of language codes in
spacy.blank()
including IETF language tags, for examplefra
forFrench
andzh-Hans
forChinese
. - New package
spacy-loggers
for additional loggers. - New Irish lemmatizer.
- New Portuguese noun chunks and updated Spanish noun chunks.
- Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
- Japanese reading and inflection from
sudachipy
are annotated asToken.morph
features. - Additional
morph_micro_p/r/f
scores for morphological features fromScorer.score_morph_per_feat()
. LIKE_URL
attribute includes the tokenizer URL pattern.--n-save-epoch
option forspacy pretrain
.- Trained pipelines:
- New transformer pipeline for Japanese
ja_core_news_trf
, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community! - Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a
tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - English attribute ruler patterns updated to improve
Token.pos
andToken.morph
.
- New transformer pipeline for Japanese
For more details, see the New in v3.2 usage guide.
🔴 Bug fixes
- Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
- Fix issue #9032: Retain alignment between doc and context for
Language.pipe(as_tuples=True)
for multiprocessing with custom error handlers. - Fix issue #9136: Ignore prefixes when applying suffix patterns in
Tokenizer
. - Fix issue #9584: Use metaclass to subclass errors to allow better pickling.
⚠️ Backwards incompatibilities
- In the
Tokenizer
, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of°[cfk].
is now° c .
instead of° c.
for most languages. - The tokenizer classes
ChineseTokenizer
,JapaneseTokenizer
,KoreanTokenizer
,ThaiTokenizer
andVietnameseTokenizer
requireVocab
rather thanLanguage
in__init__
. - In
DocBin
, user data is now always serialized according to thestore_user_data
option, see #9190.
📖 Documentation and examples
- Demo projects for floret vectors:
pipelines/floret_vectors_demo
: basic floret vector training and importing.pipelines/floret_fi_core_demo
: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.pipelines/floret_ko_ud_demo
: Korean UD vector and pipeline training, comparing standard vs. floret vectors.
👥 Contributors
@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker