github explosion/spaCy v2.3.3
v2.3.3: Alpha support for Macedonian and Sanskrit, updates for many languages and bug fixes

✨ New features and improvements

  • NEW: Add alpha support for Macedonian and Sanskrit.
  • Update language data for Croatian, Czech, English, Hebrew, Hindi, Indonesian, Swedish, Thai and Turkish.
  • Add support for aarch64 and ppc64le on linux with binary packages available on conda-forge.

🔴 Bug fixes

  • Fix issue #5610: Make sure sys.argv exists.
  • Fix issue #5643: Add ent_id_ to strings serialized with Doc.
  • Fix issue #5727: Clarify warning for misaligned BILUO tags.
  • Fix issue #5768: Improve tag map initialization and updating.
  • Fix issue #5794: Improve warnings around normalization tables.
  • Fix issue #5796: Update invalid tag maps.
  • Fix issue #5799: Remove hard-coded GPU ID from pretrain.
  • Fix issue #5802: Mark Japanese documents as tagged.
  • Fix issue #5823: Fix typo in unit tests.
  • Fix issue #5838: Fix EntityRenderer to support break lines (after last entity).
  • Fix issue #5843: Prefer earlier spans in EntityRuler.
  • Fix issue #5849: Allow Doc.char_span to snap to token boundaries.
  • Fix issue #5853: Fix span boundary handling in Spanish noun chunks.
  • Fix issue #5861: Add Span index boundary checks.
  • Fix issue #5904: Fix typos in comments.
  • Fix issue #5910: Update default sentencizer characters for Armenian, Greek and Arabic.
  • Fix issue #6014: Fix off-by-one error for best iteration calculation.
  • Fix issue #6112: Fix overlapping German noun chunks.
  • Fix issue #6148: Identify final Matcher pattern node by quantifier.
  • Fix issue #6164: Reorder so tag map is replaced only if a custom file is provided.
  • Fix issue #6218: Reproducibility for TextCategorizer and Tok2Vec.
  • Fix issue #6219: Add re-enabled pipe names back to the meta before serializing.
  • Fix issue #6300: Fix on_match callback and exclude empty match lists from results for DependencyMatcher.
  • Fix issue #6347: Memory leak issues with beam_parse (requires thinc>=7.4.3).
  • Fix issue #6373: Bugfix textcat reproducibility on GPU (requires thinc>=7.4.3).
  • Fix issue #6405: Add all vectors to vocab before pruning.
  • Fix issue #6413: Use int8_t instead of char in Matcher.

👥 Contributors

Thanks to @abchapman93, @baranitharan2020, @bittlingmayer, @bjascob, @borijang, @BramVanroy, @chopeen, @danielvasic, @delzac, @DuyguA, @erip, @florijanstamenkovic, @graue70, @hiroshi-matsuda-rit, @holubvl3, @idoshr, @jgutix, @KKsharma99, @leyendecker, @lizhe2004, @MartinoMensio, @nipunsadvilkar, @Nuccy90, @oculusrepairo, @rahul1990gupta, @rasyidf, @robertsipek, @SamEdwardes, @snsten, @solarmist, @Stannislav, @tamuhey, @tilusnet, @vha14, @wannaphong, @zaibacu for the pull requests and contributions.

latest releases: v2.3.5, v2.3.4
one month ago