github piskvorky/gensim 3.2.0
Christmas Come Early

latest releases: 4.3.2, 4.3.1, 4.3.0...
6 years ago

3.2.0, 2017-12-09

🌟 New features:

  • New download API for corpora and pre-trained models (@chaitaliSaini & @menshikh-iv, #1705 & #1632 & #1492)

    • Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:
      import gensim.downloader as api
      
      for article in api.load("wiki-english-20171001"):
          print(article)
    • Don’t waste time searching for good word embeddings, use the curated ones:
      import gensim.downloader as api
      
      model = api.load("glove-twitter-25")
      model.most_similar("engineer")
      
      # [('specialist', 0.957542896270752),
      #  ('developer', 0.9548177123069763),
      #  ('administrator', 0.9432312846183777),
      #  ('consultant', 0.93915855884552),
      #  ('technician', 0.9368376135826111),
      #  ('analyst', 0.9342101216316223),
      #  ('architect', 0.9257484674453735),
      #  ('engineering', 0.9159940481185913),
      #  ('systems', 0.9123805165290833),
      #  ('consulting', 0.9112802147865295)]
    • Blog post introducing the API and design decisions.
    • Jupyter notebook with examples
  • New model: Poincaré embeddings (@jayantj, #1696 & #1700 & #1757 & #1734)

    • Embed a graph (taxonomy) in the same way as word2vec embeds words:
      from gensim.models.poincare import PoincareRelations, PoincareModel
      from gensim.test.utils import datapath
      
      data = PoincareRelations(datapath('poincare_hypernyms.tsv'))
      model = PoincareModel(data)
      model.kv.most_similar("cat.n.01")
      
      # [('kangaroo.n.01', 0.010581353439700418),
      # ('gib.n.02', 0.011171531439892076),
      # ('striped_skunk.n.01', 0.012025106076442395),
      # ('metatherian.n.01', 0.01246679759214648),
      # ('mammal.n.01', 0.013281303506525968),
      # ('marsupial.n.01', 0.013941330203709653)]
    • Tutorial on Poincaré embeddings (Jupyter notebook).
    • Model introduction and the journey of its implementation (blog post).
    • Original paper on arXiv.
  • Optimized FastText (@manneshiva, #1742)

    • New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation.
      import gensim.downloader as api
      from gensim.models import FastText
      
      model = FastText(api.load("text8"))
      model.most_similar("cat")
      
      # [('catnip', 0.8538144826889038),
      #  ('catwalk', 0.8136177062988281),
      #  ('catchy', 0.7828493118286133),
      #  ('caf', 0.7826495170593262),
      #  ('bobcat', 0.7745151519775391),
      #  ('tomcat', 0.7732658386230469),
      #  ('moat', 0.7728310823440552),
      #  ('caye', 0.7666271328926086),
      #  ('catv', 0.7651021480560303),
      #  ('caveat', 0.7643581628799438)]
  • Binary pre-compiled wheels for Windows, OSX and Linux (@menshikh-iv, MacPython/gensim-wheels/#7)

    • Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, fasttext etc.
    • Faster Gensim pip installation
  • Added DeprecationWarnings to deprecated methods and parameters, with a clear schedule for removal.

👍 Improvements:

🔴 Bug fixes:

📚 Tutorial and doc improvements:

⚠️ Deprecations (will be removed in the next major release)

  • Remove

    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki
  • Move

    • gensim.scripts.make_wikicorpusgensim.scripts.make_wiki.py
    • gensim.summarizationgensim.models.summarization
    • gensim.topic_coherencegensim.models._coherence
    • gensim.utilsgensim.utils.utils (old imports will continue to work)
    • gensim.parsing.*gensim.utils.text_utils

Don't miss a new gensim release

NewReleases is sending notifications on new releases.