📊 Help us improve spaCy and take the User Survey 2018!
✨ New features and improvements
- NEW: Alpha Vietnamese support with tokenization via Pyvi.
- NEW: Improved system for error messages and warnings. Errors now have unique error codes and are referenced in one place, and all unspecified
assert
s have been replaced with descriptive errors. See #2163 for implementation details, and let us know if you have any suggestions for errors and warnings in #2164! - Improve language data for Polish.
- Tidy up dependencies and drop
six
,html5lib
,ftfy
andrequests
. - Improve efficiency (and potentially accuracy) of beam-search training, by randomly using greedy updates for some sentences. This can be controlled by changing the
beam_update_prob
entry innlp.parser.cfg
. The default value is 0.5, so 50% of beam updates will be done as greedy updates.
🔴 Bug fixes
- Fix issue #1554, #1752, #2159: Fix
Token.ent_iob
afterDoc.merge()
, and ensure consistency inDoc.ents
. - Fix issue #1660: Fix loading of multiple vector models.
- Fix issue #1967: Allow entity types with dashes.
- Fix issue #2032: Fix accidentally quadratic runtime in
Vocab.set_vector
. - Fix issue #2050: Correct mistakes in Italian lemmatizer data.
- Fix issue #2073: Make
Token.set_extension
work as expected. - Fix issue #2100, #2151, #2181: Drop
six
andhtml5lib
and prevent dependency conflict with TensorFlow / Keras. - Fix issue #2101: Improve error message if token text is empty string.
- Fix issue #2121: Fix
Language.to_bytes
and pickling in Thinc. - Fix issue #2156: Fix hashtag example in
Matcher
docs. - Fix issue #2177: Don't raise error in
set_extension
ifgetter
andsetter
are specified or ifdefault=None
, and add error ifsetter
is specified with nogetter
.
📖 Documentation and examples
- Add example for TensorBoard's standalone embedding projector.
- Improve example for training a new entity type.
- Add formal
CITATION
for assigning a DOI via Zenodo.
👥 Contributors
Thanks to @jimregan, @justindujardin, @trungtv, @katrinleinweber and @skrcode for the pull requests and contributions.