github piskvorky/gensim 4.0.0

latest releases: 4.3.3, 4.3.2, 4.3.1...
3 years ago

Changes

4.0.0, 2021-03-24

⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

Main highlights

  • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

    a. Efficiency

    model 3.8.3: wall time / peak RAM / throughput 4.0.0: wall time / peak RAM / throughput
    fastText 2.9h / 4.11 GB / 822k words/s 2.3h / 1.26 GB / 914k words/s
    word2vec 1.7h / 0.36 GB / 1685k words/s 1.2h / 0.33 GB / 1762k words/s

    In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

    b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

    c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

    These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

  • Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…

    • Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.

      So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.

  • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

    • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
  • A new Gensim website – finally! 🙃

So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.

👍 New features

📚 Tutorials and docs

🔴 Bug fixes

  • #2891: Fix fastText word-vectors with ngrams off, by @gojomo
  • #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
  • #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
  • #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
  • #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
  • #2940: Fix deprecations in SoftCosineSimilarity, by @Witiko
  • #2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
  • #2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
  • #2973: phrases.export_phrases() doesn't yield all bigrams, by @piskvorky
  • #2942: Segfault when training doc2vec, by @gojomo
  • #3041: Fix RuntimeError in export_phrases (change defaultdict to dict), by @thalishsajeed
  • #3059: Fix race condition in FastText tests, by @sleepy-owl

⚠️ Removed functionality & deprecations

🔮 Testing, CI, housekeeping

Don't miss a new gensim release

NewReleases is sending notifications on new releases.