piskvorky/gensim 3.7.0 on GitHub

3.7.0, 2019-01-18

🌟 New features

Fast Online NMF (@anotherbugmaster, #2007)

Benchmark wiki-english-20171001

Model Perplexity Coherence L2 norm Train time (minutes)

LDA 4727.07 -2.514 7.372 138
NMF 975.74 -2.814 7.265 73
NMF (with regularization) 985.57 -2.436 7.269 441

Model	Perplexity	Coherence	L2 norm	Train time (minutes)
LDA	4727.07	-2.514	7.372	138
NMF	975.74	-2.814	7.265	73
NMF (with regularization)	985.57	-2.436	7.269	441

Simple to use (same interface as LdaModel)

from gensim.models.nmf import Nmf
from gensim.corpora import Dictionary
import gensim.downloader as api

text8 = api.load('text8')

dictionary = Dictionary(text8)
dictionary.filter_extremes()

corpus = [
    dictionary.doc2bow(doc) for doc in text8
]

nmf = Nmf(
    corpus=corpus,
    num_topics=5,
    id2word=dictionary,
    chunksize=2000,
    passes=5,
    random_state=42,
)

nmf.show_topics()
"""
[(0, '0.007*"km" + 0.006*"est" + 0.006*"islands" + 0.004*"league" + 0.004*"rate" + 0.004*"female" + 0.004*"economy" + 0.003*"male" + 0.003*"team" + 0.003*"elections"'),
 (1, '0.006*"actor" + 0.006*"player" + 0.004*"bwv" + 0.004*"writer" + 0.004*"actress" + 0.004*"singer" + 0.003*"emperor" + 0.003*"jewish" + 0.003*"italian" + 0.003*"prize"'),
 (2, '0.036*"college" + 0.007*"institute" + 0.004*"jewish" + 0.004*"universidad" + 0.003*"engineering" + 0.003*"colleges" + 0.003*"connecticut" + 0.003*"technical" + 0.003*"jews" + 0.003*"universities"'),
 (3, '0.016*"import" + 0.008*"insubstantial" + 0.007*"y" + 0.006*"soviet" + 0.004*"energy" + 0.004*"info" + 0.003*"duplicate" + 0.003*"function" + 0.003*"z" + 0.003*"jargon"'),
 (4, '0.005*"software" + 0.004*"games" + 0.004*"windows" + 0.003*"microsoft" + 0.003*"films" + 0.003*"apple" + 0.003*"video" + 0.002*"album" + 0.002*"fiction" + 0.002*"characters"')]
"""

See also:
- NMF tutorial
- Full NMF Benchmark

Massive improvement of FastText compatibilities (@mpenkov, #2313)

from gensim.models import FastText

# 'cc.ru.300.bin' - Russian Facebook FT model trained on Common Crawl
# Can be downloaded from https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz

model = FastText.load_fasttext_format("cc.ru.300.bin")

# Fixed hash-function allow to produce same output as FB FastText & works correctly for non-latin languages (for example, Russian)
assert "мяу" in m.wv.vocab  # 'мяу' - vocab word
model.wv.most_similar("мяу")
"""
[('Мяу', 0.6820122003555298),
 ('МЯУ', 0.6373013257980347),
 ('мяу-мяу', 0.593108594417572),
 ('кис-кис', 0.5899622440338135),
 ('гав', 0.5866007804870605),
 ('Кис-кис', 0.5798211097717285),
 ('Кис-кис-кис', 0.5742273330688477),
 ('Мяу-мяу', 0.5699705481529236),
 ('хрю-хрю', 0.5508339405059814),
 ('ав-ав', 0.5479759573936462)]
"""

assert "котогород" not in m.wv.vocab  # 'котогород' - out-of-vocab word
model.wv.most_similar("котогород", topn=3)
"""
[('автогород', 0.5463314652442932),
 ('ТагилНовокузнецкНовомосковскНовороссийскНовосибирскНовотроицкНовочеркасскНовошахтинскНовый',
  0.5423436164855957),
 ('областьНовосибирскБарабинскБердскБолотноеИскитимКарасукКаргатКуйбышевКупиноОбьТатарскТогучинЧерепаново',
  0.5377570390701294)]
"""

# Now we load full model, for this reason, we can continue an training

from gensim.test.utils import datapath
from smart_open import smart_open

with smart_open(datapath("crime-and-punishment.txt"), encoding="utf-8") as infile:  # russian text
    corpus = [line.strip().split() for line in infile]

model.train(corpus, total_examples=len(corpus), epochs=5)

Similarity search improvements (@Witiko, #2016)
- Add similarity search using the Levenshtein distance in gensim.similarities.LevenshteinSimilarityIndex
- Performance optimizations to gensim.similarities.SoftCosineSimilarity (full benchmark)
  
  dictionary size corpus size speed
  
  1000 100 1.0×
  1000 1000 53.4×
  1000 100000 156784.8×
  100000 100 3.8×
  100000 1000 405.8×
  100000 100000 66262.0×
- See updated soft-cosine tutorial for more information and usage examples
Add python3.7 support (@menshikh-iv, #2211)
- Wheels for Window, OSX and Linux platforms (@menshikh-iv, MacPython/gensim-wheels/#12)
- Faster installation

dictionary size	corpus size	speed
1000	100	1.0×
1000	1000	53.4×
1000	100000	156784.8×
100000	100	3.8×
100000	1000	405.8×
100000	100000	66262.0×

👍 Improvements

Optimizations

Reduce Phraser memory usage (drop frequencies) (@jenishah, #2208)
Reduce memory consumption of summarizer (@horpto, #2298)
Replace inline slow equivalent of mean_absolute_difference with fast (@horpto, #2284)
Reuse precalculated updated prior in ldamodel.update_dir_prior (@horpto, #2274)
Improve KeyedVector.wmdistance (@horpto, #2326)
Optimize remove_unreachable_nodes in gensim.summarization (@horpto, #2263)
Optimize mz_entropy from gensim.summarization (@horpto, #2267)
Improve filter_extremes methods in Dictionary and HashDictionary (@horpto, #2303)

Additions

Add KeyedVectors.relative_cosine_similarity (@rsdel2007, #2307)
Add random_seed to LdaMallet (@Zohaggie & @menshikh-iv, #2153)
Add common_terms parameter to sklearn_api.PhrasesTransformer (@pmlk, #2074)
Add method for patch corpora.Dictionary based on special tokens (@Froskekongen, #2200)

Cleanup

Improve six usage (xrange, map, zip) (@horpto, #2264)
Refactor line2doc methods of LowCorpus and MalletCorpus (@horpto, #2269)
Get rid most of warnings in testing (@menshikh-iv, #2191)
Fix non-deterministic test failures (pin PYTHONHASHSEED) (@menshikh-iv, #2196)
Fix "aliasing chunkize to chunkize_serial" warning on Windows (@aquatiko, #2202)
Remove __getitem__ code duplication in gensim.models.phrases (@jenishah, #2206)
Add flake8-rst for docstring code examples (@kataev, #2192)
Get rid py26 stuff (@menshikh-iv, #2214)
Use itertools.chain instead of sum to concatenate lists (@Stigjb, #2212)
Fix flake8 warnings W605, W504 (@horpto, #2256)
Remove unnecessary creations of lists at all (@horpto, #2261)
Fix extra list creation in utils.get_max_id (@horpto, #2254)
Fix deprecation warning np.sum(generator) (@rsdel2007, #2296)
Refactor BM25 (@horpto, #2275)
Fix pyemd import (@ramprakash-94, #2240)
Set metadata=True for make_wikicorpus script by default (@Xinyi2016, #2245)
Remove unimportant warning from Phrases (@rsdel2007, #2331)
Replace open() by smart_open() in gensim.models.fasttext._load_fasttext_format (@rsdel2007, #2335)

🔴 Bug fixes

Fix overflow error for *Vec corpusfile-based training (@bm371613, #2239)
Fix malletmodel2ldamodel conversion (@horpto, #2288)
Replace custom epsilons with numpy equivalent in LdaModel (@horpto, #2308)
Add missing content to tarball (@menshikh-iv, #2194)
Fixes divided by zero when w_star_count==0 (@allenyllee, #2259)
Fix check for callbacks (@allenyllee, #2251)
Fix SvmLightCorpus.serialize if labels instance of numpy.ndarray (@aquatiko, #2243)
Fix poincate viz incompatibility with plotly>=3.0.0 (@jenishah, #2226)
Fix keep_n behavior for Dictionary.filter_extremes (@johann-petrak, #2232)
Fix for sphinx==1.8.1 (last r (@menshikh-iv, #None)
Fix np.issubdtype warnings (@marioyc, #2210)
Drop wrong key -c from gensim.downloader description (@horpto, #2262)
Fix gensim build (docs & pyemd issues) (@menshikh-iv, #2318)
Limit visdom version (avoid py2 issue from the latest visdom release) (@menshikh-iv, #2334)
Fix visdom integration (using viz.line() instead of viz.updatetrace()) (@allenyllee, #2252)

📚 Tutorial and doc improvements

Add gensim-data repo to gensim.downloader & fix rendering of code examples (@menshikh-iv, #2327)
Fix typos in gensim.models (@rsdel2007, #2323)
Fixed typos in notebooks (@rsdel2007, #2322)
Update Doc2Vec documentation: how tags are assigned in corpus_file mode (@persiyanov, #2320)
Fix typos in gensim/models/keyedvectors.py (@rsdel2007, #2290)
Add documentation about ranges to scoring functions for Phrases (@jenishah, #2242)
Update return sections for KeyedVectors.evaluate_word_* (@Stigjb, #2205)
Fix return type in KeyedVector.evaluate_word_analogies (@Stigjb, #2207)
Fix WmdSimilarity documentation (@jagmoreira, #2217)
Replace fify -> fifty in gensim.parsing.preprocessing.STOPWORDS (@coderwassananmol, #2220)
Remove alpha="auto" from LdaMulticore (not supported yet) (@johann-petrak, #2225)
Update Adopters in README (@piskvorky, #2234)
Fix broken link in tutorials.md (@rsdel2007, #2302)

⚠️ Deprecations (will be removed in the next major release)

Remove
- gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
- gensim.examples
- gensim.nosy
- gensim.scripts.word2vec_standalone
- gensim.scripts.make_wiki_lemma
- gensim.scripts.make_wiki_online
- gensim.scripts.make_wiki_online_lemma
- gensim.scripts.make_wiki_online_nodebug
- gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
- "deprecated" functions and attributes
Move
- gensim.scripts.make_wikicorpus ➡ gensim.scripts.make_wiki.py
- gensim.summarization ➡ gensim.models.summarization
- gensim.topic_coherence ➡ gensim.models._coherence
- gensim.utils ➡ gensim.utils.utils (old imports will continue to work)
- gensim.parsing.* ➡ gensim.utils.text_utils