spacy 3.2.0 on Python PyPI

✨ New features and improvements

NEW: Registered scoring functions for each component in the config.
NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
extend config setting for morphologizer for whether existing feature types are preserved.
Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
New package spacy-loggers for additional loggers.
New Irish lemmatizer.
New Portuguese noun chunks and updated Spanish noun chunks.
Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
Japanese reading and inflection from sudachipy are annotated as Token.morph features.
Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
LIKE_URL attribute includes the tokenizer URL pattern.
--n-save-epoch option for spacy pretrain.
Trained pipelines:
- New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
- Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
- Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Universal Dependencies corpora updated to v2.8.
- Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
- English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

Demo projects for floret vectors:
- pipelines/floret_vectors_demo: basic floret vector training and importing.
- pipelines/floret_fi_core_demo: Finnish UD+NER vector and pipeline training, comparing standard vs. floret vectors.
- pipelines/floret_ko_ud_demo: Korean UD vector and pipeline training, comparing standard vs. floret vectors.

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

spacy 3.2.0 v3.2.0: Registered scoring functions, Doc input, floret vectors and more on Python PyPI

✨ New features and improvements

🔴 Bug fixes

⚠️ Backwards incompatibilities

📖 Documentation and examples

👥 Contributors

spacy 3.2.0
v3.2.0: Registered scoring functions, Doc input, floret vectors and more

on Python PyPI