We had to release another update to the v2.0.x
branch of spaCy to resolve a dependency issue, so we decided to also include and/or backport a bunch of features and fixes that were originally intended for v2.1.0
(see here for the nightly version).
✨ New features and improvements
- NEW: Alpha tokenization and language data for Arabic, Urdu, Tatar and Greek.
- NEW: Mecab-based Japanese tokenization and lemmatization.
- NEW: Add Norwegian rule-based and lookup lemmatization.
- NEW: Add Danish lookup lemmatization based on the Den store danske SprogTeknologiske Ordbase, STO dataset, courtesy of The University of Copenhagen.
- NEW: Romanian lookup lemmatization.
- Improve language data for Polish, Turkish, French, Romanian, Swedish and Japanese.
- Improve case-sensitive lookup lemmatization in German.
- Add
Token.sent
property that returns the sentenceSpan
the token is part of. - Add
remove_extension
method onDoc
,Token
andSpan
. - Add
Doc.is_sentenced
property that returnsTrue
if sentence boundaries have been applied. - Allow ignoring warning by code via the
SPACY_WARNING_IGNORE
environment variable. - Add
--silent
option toinfo
command.
🔴 Bug fixes
- Fix issue #1456: Pass additional arguments of
download
command topip
and check if model is already installed before downloading it. - Fix issue #2191: Update
README
section on tests and dependencies. - Fix issue #2194: Ensure that
Doc.noun_chunks_iterator
isn'tNone
before calling it. - Fix issue #2196: Return data in
cli.info
and addsilent
option. - Fix issue #2200: Correct typo in
spacy package
command message. - Fix issue #2210: Fix bug in Spanish noun chunks.
- Fix issue #2211, #2320: Resolve problem in
download
command and userequests
library again. - Fix issue #2219: Fix token similarity of single-letter tokens.
- Fix issue #2222, #2223: Fix typos in documentation and docstrings.
- Fix issue #2226: Use correct, non-deprecated merge syntax in
merge_ents
. - Fix issue #2228: Fix deserialization when using
tensor=False
orsentiment=False
. - Fix issue #2238: Correct Swedish lookup lemmatization.
- Fix issue #2242: Add
remove_extension
method onDoc
,Token
andSpan
. - Fix issue #2266: Add
collapse_phrases
option to displaCy visualizer. - Fix issue #2269: Fix
KeyError
by renamingSP
to_SP
. - Fix issue #2304: Don't require
attrs
argument inDoc.retokenize
and allow ints/unicode. - Fix issue #2361: Escape HTML tags in
displacy.render
. - Fix issue #2376: Improve
Matcher
examples and add section on using pipeline components. - Fix issue #2385: Handle multi-word entities correctly in IOB to BILUO conversion.
- Fix issue #2452: Fix bug that would cause
displacy
arrows to only point in one direction. - Fix issue #2477: Also allow
Span
objects indisplacy.render
. - Fix issue #2490: Update Thinc's dependencies for Python 3.7 compatibility.
- Fix issue #2495: Fix loading tokenizer with custom prefix search.
- Fix issue #2514: Switch from
msgpack-python
tomsgpack
to hopefully prevent conda from downloading a two-year-old spaCy version when installing with latest the Anaconda distribution. - Ensure that
Doc.is_tagged
is set correctly when usingLanguage.pipe
. - Fix bug in
merge_noun_chunks
factory that would returnNone
ifDoc
wasn't parsed. - Explicitly require
pathlib
backport on Python 2 only.
📖 Documentation and examples
- NEW: Edit and execute code examples in your browser – all across the documentation!
- NEW: The spaCy Universe, a collection of plugins, extensions and other resources for spaCy.
- NEW: Experimental rule-based
Matcher
Explorer demo – create token patterns interactively, test them against your text and copy-paste the Python pattern code. - NEW: Document Cython API.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @mollerhoj, @howl-anderson, @pktippa, @skrcode, @miroli, @ivyleavedtoadflax, @5hirish, @therealronnie, @alexvy86, @mn3mos, @polm, @knoxdw, @bellabie, @mauryaland, @LRAbbade, @janimo, @vishnumenon, @tzano, @cclauss, @armsp, @aristorinjuang, @BigstickCarpet, @idealley, @ansgar-t, @mpszumowski, @91ns, @msklvsk, @himkt, @DanielRuf, @nathanathan, @GolanLevy, @nipunsadvilkar, @cjhurst, @aliiae, @mirfan899, @ohenrik, @btrungchi, @kleinay, @DuyguA, @stefan-it, @Eleni170, @datascouting, @tjkemp, @x-ji, @giannisdaras, @kororo and @katarkor for the pull requests and contributions.