✨ New features and improvements
- New
assemble
CLI command for assembling a pipeline from a config without training. - Add support for match alignments in the
Matcher
to align matched tokens with matcher patterns. - Add support for training from streamed corpora.
- Add support for W&B data and model checkpoint logging and versioning in
spacy.WandbLogger.v2
. - Extend
Scorer.score_spans
to support overlapping and unlabeled spans. - Update
debug data
for new v3 components. - Improve language data for Italian.
- Various improvements to error handling and UX.
🔴 Bug fixes
- Fix issue #7408: Add
vocab
kwarg tospacy.load
. - Fix issue #7419: Exclude user hooks in displacy conversion.
- Fix issue #7421: Update
--code
usage in CLI commands. - Fix issue #7424: Preserve sent starts on retokenization without parse.
- Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
- Fix issue #7471: Improve warnings related to listening components.
- Fix issue #7488: Fix
upstream
check in pretraining. - Fix issue #7489: Support
callbacks
entry points. - Fix issue #7497: Merge
doc.spans
inDoc.from_docs()
. - Fix issue #7528: Preserve user data for
DependencyMatcher
on spans. - Fix issue #7557: Fix
__add__
method forPRFScore
. - Fix issue #7574: Fix conversion of custom extension data in
Span.as_doc
andDoc.from_docs
. - Fix issue #7620: Fix
replace_listeners
in configs. - Fix issue #7626: Fix vectors data on GPU.
- Fix issue #7630: Update NEL for entities crossing sentence boundaries.
- Fix issue #7631: Fix parser sourcing in NER converter.
- Fix issue #7642: Fix handling of hyphen string value in config files.
- Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
- Fix issue #7674: Fix handling of unknown tokens in
StaticVectors
. - Fix issue #7690: Fix pickling of
Lemmatizer
. - Fix issue #7749: Update
Tokenizer.explain
for special cases in v3. - Fix issue #7755: Fix config parsing of ints/strings.
- Fix issue #7836: Fix tokenizer cache flushing.
- Fix issue #7847: Fix handling of boolean values in
Example.from_dict
for sent starts.
📖 Documentation and examples
- Add documentation for legacy functions and architectures.
- Add documentation for pretrained pipeline design.
- Add more details about
pipe
and multiprocessing. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!