piskvorky/gensim 3.1.0 on GitHub

3.1.0, 2017-11-06

🌟 New features:

Massive optimizations to LSI model training (@isamaru, #1620 & #1622)
- LSI model allows use of single precision (float32), to consume 40% less memory while being 40% faster.
- LSI model can now also accept CSC matrix as input, for further memory and speed boost.
- Overall, if your entire corpus fits in RAM: 3x faster LSI training (SVD) in 4x less memory!
```
# just an example; the corpus stream is up to you
streaming_corpus = gensim.corpora.MmCorpus("my_tfidf_corpus.mm.gz")

# convert your corpus to a CSC sparse matrix (assumes the entire corpus fits in RAM)
in_memory_csc_matrix = gensim.matutils.corpus2csc(streaming_corpus, dtype=np.float32)

# then pass the CSC to LsiModel directly
model = LsiModel(corpus=in_memory_csc_matrix, num_topics=500, dtype=np.float32)
```
- Even if you continue to use streaming corpora (your training dataset is too large for RAM), you should see significantly faster processing times and a lower memory footprint. In our experiments with a very large LSI model, we saw a drop from 29 GB peak RAM and 38 minutes (before) to 19 GB peak RAM and 26 minutes (now):
```
model = LsiModel(corpus=streaming_corpus, num_topics=500, dtype=np.float32)
```
Add common terms to Phrases. Fix #1258 (@alexgarel, #1568)
- Phrases allows to use common terms in bigrams. Before, if you are searching to reveal ngrams like car_with_driver and car_without_driver, you can either remove stop words before processing, but you will only find car_driver, or you won't find any of those forms (because they have three words, but also because high frequency of with will avoid them to be scored correctly), inspired by ES common grams token filter.
```
phr_old = Phrases(corpus)
phr_new = Phrases(corpus, common_terms=stopwords.words('en'))

print(phr_old[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car_with", "driver"]
print(phr_new[["we", "provide", "car", "with", "driver"]])  # ["we", "provide", "car_with_driver"]
```

New segment_wiki.py script (@menshikh-iv, #1483 & #1694)

CLI script for processing a raw Wikipedia dump (the xml.bz2 format provided by MediaWiki) to extract its articles in a plain text format. It extracts each article's title, section names and section content and saves them as json-line:

python -m gensim.scripts.segment_wiki -f enwiki-latest-pages-articles.xml.bz2 | gzip > enwiki-latest-pages-articles.json.gz

Processing the entire English Wikipedia dump (13.5 GB, link here) takes about 2.5 hours (i7-6700HQ, SSD).

The output format is one article per line, serialized into JSON:

 for line in smart_open('enwiki-latest-pages-articles.json.gz'):  # read the file we just created
     article = json.loads(line)
     print("Article title: %s" % article['title'])
     for section_title, section_text in zip(article['section_titles'], article['section_texts']):
         print("Section title: %s" % section_title)
         print("Section text: %s" % section_text)

👍 Improvements:

Speedup FastText tests (@horpto, #1686)
Add optimization for SlicedCorpus.__len__ (@horpto, #1679)
Make word_vec return immutable vector. Fix #1651 (@CLearERR, #1662)
Drop Win x32 support & add rolling builds (@menshikh-iv, #1652)
Fix scoring function in Phrases. Fix #1533, #1635 (@michaelwsherman, #1573)
Add configuration for flake8 to setup.cfg (@mcobzarenco, #1636)
Add build_vocab_from_freq to Word2Vec, speedup scan_vocab (@jodevak, #1599)
Add most_similar_to_given method for KeyedVectors (@TheMathMajor, #1582)
Add __getitem__ method to Sparse2Corpus to allow direct queries (@isamaru, #1621)

🔴 Bug fixes:

Add single core mode to CoherenceModel. Fix #1683 (@horpto, #1685)
Fix ResourceWarnings in tests. Partially fix #1519 (@horpto, #1660)
Fix DeprecationWarnings generated by deprecated assertEquals. Partial fix #1519 (@poornagurram, #1658)
Fix DeprecationWarnings for regex string literals. Fix #1646 (@franklsf95, #1649)
Fix pagerank algorithm. Fix #805 (@xelez, #1653)
Fix FastText inconsistent dtype. Fix #1637 (@mcobzarenco, #1638)
Fix test_filename_filtering test (@nehaljwani, #1647)

📚 Tutorial and doc improvements:

Fix code/docstring style (@menshikh-iv, #1650)
Update error message for supervised FastText. Fix #1498 (@ElSaico, #1645)
Add "DOI badge" to README. Fix #1610 (@dphov, #1639)
Remove duplicate annoy notebook. Fix #1415 (@Karamax, #1640)
Fix duplication and wrong markup in docs (@horpto, #1633)
Refactor dendrogram & topic network notebooks (@parulsethi, #1571)
Fix release badge (@menshikh-iv, #1631)

⚠️ Deprecation part (will come into force in the next major release)

Remove
- gensim.examples
- gensim.nosy
- gensim.scripts.word2vec_standalone
- gensim.scripts.make_wiki_lemma
- gensim.scripts.make_wiki_online
- gensim.scripts.make_wiki_online_lemma
- gensim.scripts.make_wiki_online_nodebug
- gensim.scripts.make_wiki
Move
- gensim.scripts.make_wikicorpus ➡ gensim.scripts.make_wiki.py
- gensim.summarization ➡ gensim.models.summarization
- gensim.topic_coherence ➡ gensim.models._coherence
- gensim.utils ➡ gensim.utils.utils (old imports will continue to work)
- gensim.parsing.* ➡ gensim.utils.text_utils

Also, we'll create experimental subpackage for unstable models. Specific lists will be available in the next major release.