explosion/spaCy v3.0.0rc3 on GitHub

🌙 This release is a nightly pre-release and not intended for production yet. We recommend using a new virtual environment. For more details on the new features and usage guides, see the v3 documentation.

⚠️⚠️⚠️ Make sure to retrain your models! ⚠️⚠️⚠️
This release includes changes to the config and model architectures, so if you've trained a custom pipeline with v3.0.0rc1 or v3.0.0rc2, you'll need to retrain it. We recommend using the new spaCy projects system to make it easy to re-run your training process. To auto-fill and update your configs, you can use the init fill-config command.

📣 NEW: Want to make the transition from spaCy v2 to spaCy v3 as smooth as possible for you and your organization? We're now offering commercial migration support for your spaCy pipelines! We've put a lot of work into making it easy to upgrade your existing code and training workflows – but custom projects may always need some custom work, especially when it comes to taking advantage of the new capabilities. Details & application →

🚀 Quickstart

pip install -U spacy-nightly --pre

Introducing spaCy v3.0 nightly
New in v3.0: New features, backwards incompatibilities and migration guide.
Installation Quickstart: Install the new version, pipelines and add-ons for your specific setup.
Training Quickstart: Generate a training config for your specific use case.
Benchmarks: Results and accuracy comparisons.
Projects & Project Templates: Get started by cloning a project template.

✨ New features and improvements

Transformer-based pipelines with support for multi-task learning.
Retrained model families for 18 languages and 58 trained pipelines in total, including 5 transformer-based pipelines.
New core pipelines for Macedonian and Russian. Thanks to @borijang, @buriy and @kuk for their contributions!
New training workflow and config system.
Implement custom models using any machine learning framework, including PyTorch, TensorFlow and MXNet.
spaCy Projects for managing end-to-end multi-step workflows from preprocessing to model deployment.
Integrations with Data Version Control (DVC), Streamlit, Weights & Biases, Ray and more.
Parallel training and distributed computing with Ray.
New built-in pipeline components: SentenceRecognizer, Morphologizer, Lemmatizer, AttributeRuler and Transformer.
New and improved pipeline component API and decorators for custom components.
Source trained components from other pipelines in your training config.
DependencyMatcher for matching patterns within the dependency parse using Semgrex operators.
Support for greedy patterns in Matcher.
Type hints and type-based data validation for custom registered functions.
Various new methods, attributes and commands.

⚠️ Backwards incompatibilities

For more info on how to migrate from spaCy v2.x, see the detailed migration guide.

API changes

Pipeline package symlinks, the link command and shortcut names are now deprecated. There can be many different trained pipelines and not just one "English model", so you should always use the full package name like en_core_web_sm explicitly.
A pipeline's meta.json is now only used to provide meta information like the package name, author, license and labels. It's not used to construct the processing pipeline anymore. This is all defined in the config.cfg, which also includes all settings used to train the pipeline.
The train, pretrain and debug data commands now only take a config.cfg.
Language.add_pipe now takes the string name of the component factory instead of the component function.
Custom pipeline components now need to be decorated with the @Language.component or @Language.factory decorator.
The Language.update, Language.evaluate and TrainablePipe.update methods now all take batches of Example objects instead of Doc and GoldParse objects, or raw text and a dictionary of annotations.
The begin_training methods have been renamed to initialize and now take a function that returns a sequence of Example objects to initialize the model instead of a list of tuples.
Matcher.add and PhraseMatcher.add now only accept a list of patterns as the second argument (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument.
The Doc flags like Doc.is_parsed or Doc.is_tagged have been replaced by Doc.has_annotation.
The spacy.gold module has been renamed to spacy.training.
The PRON_LEMMA symbol and -PRON- as an indicator for pronoun lemmas has been removed.
The TAG_MAP and MORPH_RULES in the language data have been replaced by the more flexible AttributeRuler.
The Lemmatizer is now a standalone pipeline component and doesn't provide lemmas by default or switch automatically between lookup and rule-based lemmas. You can now add it to your pipeline explicitly and set its mode on initialization.
Various keyword arguments across functions and methods are now explicitly declared as keyword-only arguments. Those arguments are documented accordingly across the API reference.

Removed or renamed API

Removed	Replacement
`Language.disable_pipes`	`Language.select_pipes`, `Language.disable_pipe`, `Language.enable_pipe`
`Language.begin_training`, `Pipe.begin_training`, ...	`Language.initialize`, `Pipe.initialize`, ...
`Doc.is_tagged`, `Doc.is_parsed`, ...	`Doc.has_annotation`
`GoldParse`	`Example`
`GoldCorpus`	`Corpus`
`KnowledgeBase.load_bulk`, `KnowledgeBase.dump`	`KnowledgeBase.from_disk`, `KnowledgeBase.to_disk`
`Matcher.pipe`, `PhraseMatcher.pipe`	not needed
`gold.offsets_from_biluo_tags`, `gold.spans_from_biluo_tags`, `gold.biluo_tags_from_offsets`	`training.biluo_tags_to_offsets`, `training.biluo_tags_to_spans`, `training.offsets_to_biluo_tags`
`spacy init-model`	`spacy init vectors`
`spacy debug-data`	`spacy debug data`
`spacy profile`	`spacy debug profile`
`spacy link`, `util.set_data_path`, `util.get_data_path`	not needed, symlinks are deprecated

The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been deprecated for a while and many would previously raise errors. Many of them were also mostly internals. If you've been working with more recent versions of spaCy v2.x, it's unlikely that your code relied on them.

Removed	Replacement
`Doc.tokens_from_list`	`Doc.__init__`
`Doc.merge`, `Span.merge`	`Doc.retokenize`
`Token.string`, `Span.string`, `Span.upper`, `Span.lower`	`Span.text`, `Token.text`
`Language.tagger`, `Language.parser`, `Language.entity`	`Language.get_pipe`
keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes`	`exclude=["vocab"]`
`n_threads` argument on `Tokenizer`, `Matcher`, `PhraseMatcher`	`n_process`
`verbose` argument on `Language.evaluate`	logging (`DEBUG`)
`SentenceSegmenter` hook, `SimilarityHook`	user hooks, `Sentencizer`, `SentenceRecognizer`

explosion/spaCy v3.0.0rc3 v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more on GitHub

🚀 Quickstart

✨ New features and improvements

⚠️ Backwards incompatibilities

API changes

Removed or renamed API

explosion/spaCy v3.0.0rc3
v3.0.0rc3: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more

on GitHub