piskvorky/gensim 3.2.0 on GitHub

3.2.0, 2017-12-09

🌟 New features:

New download API for corpora and pre-trained models (@chaitaliSaini & @menshikh-iv, #1705 & #1632 & #1492)

Download large NLP datasets in one line of Python, then use with memory-efficient data streaming:

import gensim.downloader as api

for article in api.load("wiki-english-20171001"):
    print(article)

Don’t waste time searching for good word embeddings, use the curated ones:

import gensim.downloader as api

model = api.load("glove-twitter-25")
model.most_similar("engineer")

# [('specialist', 0.957542896270752),
#  ('developer', 0.9548177123069763),
#  ('administrator', 0.9432312846183777),
#  ('consultant', 0.93915855884552),
#  ('technician', 0.9368376135826111),
#  ('analyst', 0.9342101216316223),
#  ('architect', 0.9257484674453735),
#  ('engineering', 0.9159940481185913),
#  ('systems', 0.9123805165290833),
#  ('consulting', 0.9112802147865295)]

Blog post introducing the API and design decisions.
Jupyter notebook with examples

New model: Poincaré embeddings (@jayantj, #1696 & #1700 & #1757 & #1734)

Embed a graph (taxonomy) in the same way as word2vec embeds words:

from gensim.models.poincare import PoincareRelations, PoincareModel
from gensim.test.utils import datapath

data = PoincareRelations(datapath('poincare_hypernyms.tsv'))
model = PoincareModel(data)
model.kv.most_similar("cat.n.01")

# [('kangaroo.n.01', 0.010581353439700418),
# ('gib.n.02', 0.011171531439892076),
# ('striped_skunk.n.01', 0.012025106076442395),
# ('metatherian.n.01', 0.01246679759214648),
# ('mammal.n.01', 0.013281303506525968),
# ('marsupial.n.01', 0.013941330203709653)]

Tutorial on Poincaré embeddings (Jupyter notebook).
Model introduction and the journey of its implementation (blog post).
Original paper on arXiv.

Optimized FastText (@manneshiva, #1742)

New fast multithreaded implementation of FastText, natively in Python/Cython. Deprecates the existing wrapper for Facebook’s C++ implementation.

import gensim.downloader as api
from gensim.models import FastText

model = FastText(api.load("text8"))
model.most_similar("cat")

# [('catnip', 0.8538144826889038),
#  ('catwalk', 0.8136177062988281),
#  ('catchy', 0.7828493118286133),
#  ('caf', 0.7826495170593262),
#  ('bobcat', 0.7745151519775391),
#  ('tomcat', 0.7732658386230469),
#  ('moat', 0.7728310823440552),
#  ('caye', 0.7666271328926086),
#  ('catv', 0.7651021480560303),
#  ('caveat', 0.7643581628799438)]

Binary pre-compiled wheels for Windows, OSX and Linux (@menshikh-iv, MacPython/gensim-wheels/#7)
- Users no longer need to have a C compiler for using the fast (Cythonized) version of word2vec, doc2vec, fasttext etc.
- Faster Gensim pip installation
Added DeprecationWarnings to deprecated methods and parameters, with a clear schedule for removal.

👍 Improvements:

Add Montemurro and Zanette's entropy based keyword extraction algorithm. Fix #665 (@PeteBleackley, #1738)
Fix flake8 E731, E402, refactor tests & sklearn API code. Partial fix #1644 (@horpto, #1689)
Reduce distribution size. Fix #1698 (@menshikh-iv, #1699)
Improve scan_vocab speed, build_vocab_from_freq method (@jodevak, #1695)
Improve segment_wiki script (@piskvorky, #1707)
Add custom dtype support for LdaModel. Partially fix #1576 (@xelez, #1656)
Add doc2idx method for gensim.corpora.Dictionary. Fix #1634 (@roopalgarg, #1720)
Add tox and pytest to gensim, integration with Travis and Appveyor. Fix #1613, #1644 (@menshikh-iv, #1721)
Add flag for hiding outdated data for gensim.downloader.info (@menshikh-iv, #1736)
Add reproducible order between Python versions for gensim.corpora.Dictionary (@formi23, #1715)
Update tox.ini, setup.cfg, README.md (@menshikh-iv, #1741)
Add optimized logsumexp for LdaModel (@arlenk, #1745)

🔴 Bug fixes:

Fix ranking formula in gensim.summarization.bm25. Fix #1718 (@souravsingh, #1726)
Fixed incompatibility in persistence for FastText wrapper. Fix #1642 (@chinmayapancholi13, #1723)
Fix gensim.sklearn_api bug with documents_columns parameter. Fix #1676 (@chinmayapancholi13, #1704)
Fix slowdown of CI, remove pytest-cov (@menshikh-iv, #1728)
Replace outdated packages in Dockerfile (@rbahumi, #1730)
Replace num_words to topn in LdaMallet.show_topics. Fix #1747 (@apoorvaeternity, #1749)
Fix os.rename from gensim.downloader when 'src' and 'dst' on different partitions (@anotherbugmaster, #1733)
Fix DeprecationWarning from logsumexp (@dreamgonfly, #1703)
Fix backward compatibility problem in Phrases.load. Fix #1751 (@alexgarel, #1758)
Fix load_word2vec_format from FastText. Fix #1743 (@manneshiva, #1755)
Fix ipython kernel version in Dockerfile. Fix #1762 (@rbahumi, #1764)
Fix writing in segment_wiki (@horpto, #1763)
Fix write method of file requires byte-like object in segment_wiki (@horpto, #1750)
Fix incorrect vectors learned during online training for FastText. Fix #1752 (@manneshiva, #1756)
Fix dtype of model.wv.syn0_vocab on updating vocab for FastText. Fix #1759 (@manneshiva, #1760)
Fix hashing-trick from FastText.build_vocab. Fix #1765 (@manneshiva, #1768)
Add explicit DeprecationWarning for all outdated stuff. Fix #1753 (@menshikh-iv, #1769)
Fix epsilon according to dtype in LdaModel (@menshikh-iv, #1770)

📚 Tutorial and doc improvements:

Update perf numbers of segment_wiki (@piskvorky, #1708)
Update docstring for gensim.summarization.summarize. Fix #1575 (@fbarrios, #1702)
Refactor API Reference for gensim.parsing. Fix #1664 (@CLearERR, #1684)
Fix typos in doc2vec-wikipedia notebook (@youqad, #1727)
Fix PyPI long description rendering (@edigaryev, #1739)
Fix twitter badge src (@menshikh-iv)
Fix maillist badge color (@menshikh-iv)

⚠️ Deprecations (will be removed in the next major release)

Remove
- gensim.examples
- gensim.nosy
- gensim.scripts.word2vec_standalone
- gensim.scripts.make_wiki_lemma
- gensim.scripts.make_wiki_online
- gensim.scripts.make_wiki_online_lemma
- gensim.scripts.make_wiki_online_nodebug
- gensim.scripts.make_wiki
Move
- gensim.scripts.make_wikicorpus ➡ gensim.scripts.make_wiki.py
- gensim.summarization ➡ gensim.models.summarization
- gensim.topic_coherence ➡ gensim.models._coherence
- gensim.utils ➡ gensim.utils.utils (old imports will continue to work)
- gensim.parsing.* ➡ gensim.utils.text_utils

piskvorky/gensim 3.2.0 Christmas Come Early on GitHub

3.2.0, 2017-12-09

piskvorky/gensim 3.2.0
Christmas Come Early

on GitHub