piskvorky/gensim 3.6.0 on GitHub

3.6.0, 2018-09-20

🌟 New features

File-based training for *2Vec models (@persiyanov, #2127 & #2078 & #2048)

New training mode for *2Vec models (word2vec, doc2vec, fasttext) that allows model training to scale linearly with the number of cores (full GIL elimination). The result of our Google Summer of Code 2018 project by Dmitry Persiyanov.

Benchmark on the full English Wikipedia, Intel(R) Xeon(R) CPU @ 2.30GHz 32 cores (GCE cloud), MKL BLAS:

Model	Queue-based version [sec]	File-based version [sec]	speed up	Accuracy (queue-based)	Accuracy (file-based)
Word2Vec	9230	2437	3.79x	0.754 (± 0.003)	0.750 (± 0.001)
Doc2Vec	18264	2889	6.32x	0.721 (± 0.002)	0.683 (± 0.003)
FastText	16361	10625	1.54x	0.642 (± 0.002)	0.660 (± 0.001)

Usage:

import gensim.downloader as api
from multiprocessing import cpu_count
from gensim.utils import save_as_line_sentence
from gensim.test.utils import get_tmpfile
from gensim.models import Word2Vec, Doc2Vec, FastText


# Convert any corpus to the needed format: 1 document per line, words delimited by " "
corpus = api.load("text8")
corpus_fname = get_tmpfile("text8-file-sentence.txt")
save_as_line_sentence(corpus, corpus_fname)

# Choose num of cores that you want to use (let's use all, models scale linearly now!)
num_cores = cpu_count()

# Train models using all cores
w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores)
d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores)
ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)

Read notebook tutorial with full description.

👍 Improvements

Add scikit-learn wrapper for FastText (@mcemilg, #2178)
Add multiprocessing support for BM25 (@Shiki-H, #2146)
Add name_only option for downloader api (@aneesh-joshi, #2143)
Make word2vec2tensor script compatible with python3 (@vsocrates, #2147)
Add custom filter for Wikicorpus (@mattilyra, #2089)
Make similarity_matrix support non-contiguous dictionaries (@Witiko, #2047)

🔴 Bug fixes

Fix memory consumption in AuthorTopicModel (@philipphager, #2122)
Correctly process empty documents in AuthorTopicModel (@probinso, #2133)
Fix ZeroDivisionError keywords issue with short input (@LShostenko, #2154)
Fix min_count handling in phrases detection using npmi_scorer (@lopusz, #2072)
Remove duplicate count from Phraser log message (@robguinness, #2151)
Replace np.integer -> np.int in AuthorTopicModel (@menshikh-iv, #2145)

📚 Tutorial and doc improvements

Update docstring with new analogy evaluation method (@akutuzov, #2130)
Improve prune_at parameter description for gensim.corpora.Dictionary (@yxonic, #2128)
Fix default -> auto prior parameter in documentation for lda-related models (@Laubeee, #2156)
Use heading instead of bold style in gensim.models.translation_matrix (@nzw0301, #2164)
Fix quote of vocabulary from gensim.models.Word2Vec (@nzw0301, #2161)
Replace deprecated parameters with new in docstring of gensim.models.Doc2Vec (@xuhdev, #2165)
Fix formula in Mallet documentation (@Laubeee, #2186)
Fix minor semantic issue in docs for Phrases (@RunHorst, #2148)
Fix typo in documentation (@KenjiOhtsuka, #2157)
Additional documentation fixes (@piskvorky, #2121)

⚠️ Deprecations (will be removed in the next major release)

Remove
- gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
- gensim.examples
- gensim.nosy
- gensim.scripts.word2vec_standalone
- gensim.scripts.make_wiki_lemma
- gensim.scripts.make_wiki_online
- gensim.scripts.make_wiki_online_lemma
- gensim.scripts.make_wiki_online_nodebug
- gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
- "deprecated" functions and attributes
Move
- gensim.scripts.make_wikicorpus ➡ gensim.scripts.make_wiki.py
- gensim.summarization ➡ gensim.models.summarization
- gensim.topic_coherence ➡ gensim.models._coherence
- gensim.utils ➡ gensim.utils.utils (old imports will continue to work)
- gensim.parsing.* ➡ gensim.utils.text_utils