3.7.0, 2019-01-18
🌟 New features
-
Fast Online NMF (@anotherbugmaster, #2007)
-
Benchmark
wiki-english-20171001
Model Perplexity Coherence L2 norm Train time (minutes) LDA 4727.07 -2.514 7.372 138 NMF 975.74 -2.814 7.265 73 NMF (with regularization) 985.57 -2.436 7.269 441 -
Simple to use (same interface as
LdaModel
)from gensim.models.nmf import Nmf from gensim.corpora import Dictionary import gensim.downloader as api text8 = api.load('text8') dictionary = Dictionary(text8) dictionary.filter_extremes() corpus = [ dictionary.doc2bow(doc) for doc in text8 ] nmf = Nmf( corpus=corpus, num_topics=5, id2word=dictionary, chunksize=2000, passes=5, random_state=42, ) nmf.show_topics() """ [(0, '0.007*"km" + 0.006*"est" + 0.006*"islands" + 0.004*"league" + 0.004*"rate" + 0.004*"female" + 0.004*"economy" + 0.003*"male" + 0.003*"team" + 0.003*"elections"'), (1, '0.006*"actor" + 0.006*"player" + 0.004*"bwv" + 0.004*"writer" + 0.004*"actress" + 0.004*"singer" + 0.003*"emperor" + 0.003*"jewish" + 0.003*"italian" + 0.003*"prize"'), (2, '0.036*"college" + 0.007*"institute" + 0.004*"jewish" + 0.004*"universidad" + 0.003*"engineering" + 0.003*"colleges" + 0.003*"connecticut" + 0.003*"technical" + 0.003*"jews" + 0.003*"universities"'), (3, '0.016*"import" + 0.008*"insubstantial" + 0.007*"y" + 0.006*"soviet" + 0.004*"energy" + 0.004*"info" + 0.003*"duplicate" + 0.003*"function" + 0.003*"z" + 0.003*"jargon"'), (4, '0.005*"software" + 0.004*"games" + 0.004*"windows" + 0.003*"microsoft" + 0.003*"films" + 0.003*"apple" + 0.003*"video" + 0.002*"album" + 0.002*"fiction" + 0.002*"characters"')] """
-
See also:
-
-
Massive improvement of
FastText
compatibilities (@mpenkov, #2313)from gensim.models import FastText # 'cc.ru.300.bin' - Russian Facebook FT model trained on Common Crawl # Can be downloaded from https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.ru.300.bin.gz model = FastText.load_fasttext_format("cc.ru.300.bin") # Fixed hash-function allow to produce same output as FB FastText & works correctly for non-latin languages (for example, Russian) assert "мяу" in m.wv.vocab # 'мяу' - vocab word model.wv.most_similar("мяу") """ [('Мяу', 0.6820122003555298), ('МЯУ', 0.6373013257980347), ('мяу-мяу', 0.593108594417572), ('кис-кис', 0.5899622440338135), ('гав', 0.5866007804870605), ('Кис-кис', 0.5798211097717285), ('Кис-кис-кис', 0.5742273330688477), ('Мяу-мяу', 0.5699705481529236), ('хрю-хрю', 0.5508339405059814), ('ав-ав', 0.5479759573936462)] """ assert "котогород" not in m.wv.vocab # 'котогород' - out-of-vocab word model.wv.most_similar("котогород", topn=3) """ [('автогород', 0.5463314652442932), ('ТагилНовокузнецкНовомосковскНовороссийскНовосибирскНовотроицкНовочеркасскНовошахтинскНовый', 0.5423436164855957), ('областьНовосибирскБарабинскБердскБолотноеИскитимКарасукКаргатКуйбышевКупиноОбьТатарскТогучинЧерепаново', 0.5377570390701294)] """ # Now we load full model, for this reason, we can continue an training from gensim.test.utils import datapath from smart_open import smart_open with smart_open(datapath("crime-and-punishment.txt"), encoding="utf-8") as infile: # russian text corpus = [line.strip().split() for line in infile] model.train(corpus, total_examples=len(corpus), epochs=5)
-
Similarity search improvements (@Witiko, #2016)
-
Add similarity search using the Levenshtein distance in
gensim.similarities.LevenshteinSimilarityIndex
-
Performance optimizations to
gensim.similarities.SoftCosineSimilarity
(full benchmark)dictionary size corpus size speed 1000 100 1.0× 1000 1000 53.4× 1000 100000 156784.8× 100000 100 3.8× 100000 1000 405.8× 100000 100000 66262.0× -
See updated soft-cosine tutorial for more information and usage examples
-
-
Add
python3.7
support (@menshikh-iv, #2211)- Wheels for Window, OSX and Linux platforms (@menshikh-iv, MacPython/gensim-wheels/#12)
- Faster installation
👍 Improvements
Optimizations
- Reduce
Phraser
memory usage (drop frequencies) (@jenishah, #2208) - Reduce memory consumption of summarizer (@horpto, #2298)
- Replace inline slow equivalent of mean_absolute_difference with fast (@horpto, #2284)
- Reuse precalculated updated prior in
ldamodel.update_dir_prior
(@horpto, #2274) - Improve
KeyedVector.wmdistance
(@horpto, #2326) - Optimize
remove_unreachable_nodes
ingensim.summarization
(@horpto, #2263) - Optimize
mz_entropy
fromgensim.summarization
(@horpto, #2267) - Improve
filter_extremes
methods inDictionary
andHashDictionary
(@horpto, #2303)
Additions
- Add
KeyedVectors.relative_cosine_similarity
(@rsdel2007, #2307) - Add
random_seed
toLdaMallet
(@Zohaggie & @menshikh-iv, #2153) - Add
common_terms
parameter tosklearn_api.PhrasesTransformer
(@pmlk, #2074) - Add method for patch
corpora.Dictionary
based on special tokens (@Froskekongen, #2200)
Cleanup
- Improve
six
usage (xrange
,map
,zip
) (@horpto, #2264) - Refactor
line2doc
methods ofLowCorpus
andMalletCorpus
(@horpto, #2269) - Get rid most of warnings in testing (@menshikh-iv, #2191)
- Fix non-deterministic test failures (pin
PYTHONHASHSEED
) (@menshikh-iv, #2196) - Fix "aliasing chunkize to chunkize_serial" warning on Windows (@aquatiko, #2202)
- Remove
__getitem__
code duplication ingensim.models.phrases
(@jenishah, #2206) - Add
flake8-rst
for docstring code examples (@kataev, #2192) - Get rid
py26
stuff (@menshikh-iv, #2214) - Use
itertools.chain
instead ofsum
to concatenate lists (@Stigjb, #2212) - Fix flake8 warnings W605, W504 (@horpto, #2256)
- Remove unnecessary creations of lists at all (@horpto, #2261)
- Fix extra list creation in
utils.get_max_id
(@horpto, #2254) - Fix deprecation warning
np.sum(generator)
(@rsdel2007, #2296) - Refactor
BM25
(@horpto, #2275) - Fix pyemd import (@ramprakash-94, #2240)
- Set
metadata=True
formake_wikicorpus
script by default (@Xinyi2016, #2245) - Remove unimportant warning from
Phrases
(@rsdel2007, #2331) - Replace
open()
bysmart_open()
ingensim.models.fasttext._load_fasttext_format
(@rsdel2007, #2335)
🔴 Bug fixes
- Fix overflow error for
*Vec
corpusfile-based training (@bm371613, #2239) - Fix
malletmodel2ldamodel
conversion (@horpto, #2288) - Replace custom epsilons with numpy equivalent in
LdaModel
(@horpto, #2308) - Add missing content to tarball (@menshikh-iv, #2194)
- Fixes divided by zero when w_star_count==0 (@allenyllee, #2259)
- Fix check for callbacks (@allenyllee, #2251)
- Fix
SvmLightCorpus.serialize
iflabels
instance of numpy.ndarray (@aquatiko, #2243) - Fix poincate viz incompatibility with
plotly>=3.0.0
(@jenishah, #2226) - Fix
keep_n
behavior forDictionary.filter_extremes
(@johann-petrak, #2232) - Fix for
sphinx==1.8.1
(last r (@menshikh-iv, #None) - Fix
np.issubdtype
warnings (@marioyc, #2210) - Drop wrong key
-c
fromgensim.downloader
description (@horpto, #2262) - Fix gensim build (docs & pyemd issues) (@menshikh-iv, #2318)
- Limit visdom version (avoid py2 issue from the latest visdom release) (@menshikh-iv, #2334)
- Fix visdom integration (using
viz.line()
instead ofviz.updatetrace()
) (@allenyllee, #2252)
📚 Tutorial and doc improvements
- Add gensim-data repo to
gensim.downloader
& fix rendering of code examples (@menshikh-iv, #2327) - Fix typos in
gensim.models
(@rsdel2007, #2323) - Fixed typos in notebooks (@rsdel2007, #2322)
- Update
Doc2Vec
documentation: how tags are assigned incorpus_file
mode (@persiyanov, #2320) - Fix typos in
gensim/models/keyedvectors.py
(@rsdel2007, #2290) - Add documentation about ranges to scoring functions for
Phrases
(@jenishah, #2242) - Update return sections for
KeyedVectors.evaluate_word_*
(@Stigjb, #2205) - Fix return type in
KeyedVector.evaluate_word_analogies
(@Stigjb, #2207) - Fix
WmdSimilarity
documentation (@jagmoreira, #2217) - Replace
fify -> fifty
ingensim.parsing.preprocessing.STOPWORDS
(@coderwassananmol, #2220) - Remove
alpha="auto"
fromLdaMulticore
(not supported yet) (@johann-petrak, #2225) - Update Adopters in README (@piskvorky, #2234)
- Fix broken link in
tutorials.md
(@rsdel2007, #2302)
⚠️ Deprecations (will be removed in the next major release)
-
Remove
gensim.models.wrappers.fasttext
(obsoleted by the new nativegensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new nativegensim.scripts.segment_wiki
implementation)- "deprecated" functions and attributes
-
Move
gensim.scripts.make_wikicorpus
➡gensim.scripts.make_wiki.py
gensim.summarization
➡gensim.models.summarization
gensim.topic_coherence
➡gensim.models._coherence
gensim.utils
➡gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡gensim.utils.text_utils