3.6.0, 2018-09-20
🌟 New features
-
File-based training for
*2Vec
models (@persiyanov, #2127 & #2078 & #2048)New training mode for
*2Vec
models (word2vec, doc2vec, fasttext) that allows model training to scale linearly with the number of cores (full GIL elimination). The result of our Google Summer of Code 2018 project by Dmitry Persiyanov.Benchmark on the full English Wikipedia, Intel(R) Xeon(R) CPU @ 2.30GHz 32 cores (GCE cloud), MKL BLAS:
Model Queue-based version [sec] File-based version [sec] speed up Accuracy (queue-based) Accuracy (file-based) Word2Vec 9230 2437 3.79x 0.754 (± 0.003) 0.750 (± 0.001) Doc2Vec 18264 2889 6.32x 0.721 (± 0.002) 0.683 (± 0.003) FastText 16361 10625 1.54x 0.642 (± 0.002) 0.660 (± 0.001) Usage:
import gensim.downloader as api from multiprocessing import cpu_count from gensim.utils import save_as_line_sentence from gensim.test.utils import get_tmpfile from gensim.models import Word2Vec, Doc2Vec, FastText # Convert any corpus to the needed format: 1 document per line, words delimited by " " corpus = api.load("text8") corpus_fname = get_tmpfile("text8-file-sentence.txt") save_as_line_sentence(corpus, corpus_fname) # Choose num of cores that you want to use (let's use all, models scale linearly now!) num_cores = cpu_count() # Train models using all cores w2v_model = Word2Vec(corpus_file=corpus_fname, workers=num_cores) d2v_model = Doc2Vec(corpus_file=corpus_fname, workers=num_cores) ft_model = FastText(corpus_file=corpus_fname, workers=num_cores)
👍 Improvements
- Add scikit-learn wrapper for
FastText
(@mcemilg, #2178) - Add multiprocessing support for
BM25
(@Shiki-H, #2146) - Add
name_only
option for downloader api (@aneesh-joshi, #2143) - Make
word2vec2tensor
script compatible withpython3
(@vsocrates, #2147) - Add custom filter for
Wikicorpus
(@mattilyra, #2089) - Make
similarity_matrix
support non-contiguous dictionaries (@Witiko, #2047)
🔴 Bug fixes
- Fix memory consumption in
AuthorTopicModel
(@philipphager, #2122) - Correctly process empty documents in
AuthorTopicModel
(@probinso, #2133) - Fix ZeroDivisionError
keywords
issue with short input (@LShostenko, #2154) - Fix
min_count
handling in phrases detection usingnpmi_scorer
(@lopusz, #2072) - Remove duplicate count from
Phraser
log message (@robguinness, #2151) - Replace
np.integer
->np.int
inAuthorTopicModel
(@menshikh-iv, #2145)
📚 Tutorial and doc improvements
- Update docstring with new analogy evaluation method (@akutuzov, #2130)
- Improve
prune_at
parameter description forgensim.corpora.Dictionary
(@yxonic, #2128) - Fix
default
->auto
prior parameter in documentation for lda-related models (@Laubeee, #2156) - Use heading instead of bold style in
gensim.models.translation_matrix
(@nzw0301, #2164) - Fix quote of vocabulary from
gensim.models.Word2Vec
(@nzw0301, #2161) - Replace deprecated parameters with new in docstring of
gensim.models.Doc2Vec
(@xuhdev, #2165) - Fix formula in Mallet documentation (@Laubeee, #2186)
- Fix minor semantic issue in docs for
Phrases
(@RunHorst, #2148) - Fix typo in documentation (@KenjiOhtsuka, #2157)
- Additional documentation fixes (@piskvorky, #2121)
⚠️ Deprecations (will be removed in the next major release)
-
Remove
gensim.models.wrappers.fasttext
(obsoleted by the new nativegensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new nativegensim.scripts.segment_wiki
implementation)- "deprecated" functions and attributes
-
Move
gensim.scripts.make_wikicorpus
➡gensim.scripts.make_wiki.py
gensim.summarization
➡gensim.models.summarization
gensim.topic_coherence
➡gensim.models._coherence
gensim.utils
➡gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡gensim.utils.text_utils