pypi spacy 3.2.0
v3.2.0: Registered scoring functions, Doc input, floret vectors and more

latest releases: 3.7.5, 4.0.0.dev3, 3.7.4...
2 years ago

✨ New features and improvements

  • NEW: Registered scoring functions for each component in the config.
  • NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
  • NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
  • overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
  • extend config setting for morphologizer for whether existing feature types are preserved.
  • Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
  • New package spacy-loggers for additional loggers.
  • New Irish lemmatizer.
  • New Portuguese noun chunks and updated Spanish noun chunks.
  • Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
  • Japanese reading and inflection from sudachipy are annotated as Token.morph features.
  • Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
  • LIKE_URL attribute includes the tokenizer URL pattern.
  • --n-save-epoch option for spacy pretrain.
  • Trained pipelines:
    • New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
    • Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
    • Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
    • Universal Dependencies corpora updated to v2.8.
    • Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
    • English attribute ruler patterns updated to improve Token.pos and Token.morph.

For more details, see the New in v3.2 usage guide.

🔴 Bug fixes

  • Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
  • Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
  • Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
  • Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

⚠️ Backwards incompatibilities

  • In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
  • The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
  • In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

📖 Documentation and examples

👥 Contributors

@adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

Don't miss a new spacy release

NewReleases is sending notifications on new releases.