piskvorky/gensim 3.7.2 on GitHub

3.7.2, 2019-04-06

🌟 New Features

gensim.models.fasttext.load_facebook_model function: load full model (slower, more CPU/memory intensive, supports training continuation)

>>> from gensim.test.utils import datapath
>>>
>>> cap_path = datapath("crime-and-punishment.bin")
>>> fb_model = load_facebook_model(cap_path)
>>>
>>> 'landlord' in fb_model.wv.vocab  # Word is out of vocabulary
False
>>> oov_term = fb_model.wv['landlord']
>>>
>>> 'landlady' in fb_model.wv.vocab  # Word is in the vocabulary
True
>>> iv_term = fb_model.wv['landlady']
>>>
>>> new_sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']]
>>> fb_model.build_vocab(new_sent, update=True)
>>> fb_model.train(sentences=new_sent, total_examples=len(new_sent), epochs=5)

gensim.models.fasttext.load_facebook_vectors function: load embeddings only (faster, less CPU/memory usage, does not support training continuation)

>>> fbkv = load_facebook_vectors(cap_path)
>>>
>>> 'landlord' in fbkv.vocab  # Word is out of vocabulary
False
>>> oov_vector = fbkv['landlord']
>>>
>>> 'landlady' in fbkv.vocab  # Word is in the vocabulary
True
>>> iv_vector = fbkv['landlady']

🔴 Bug fixes

Fix unicode error when loading FastText vocabulary (@mpenkov, #2390)
Avoid division by zero in fasttext_inner.pyx (@mpenkov, #2404)
Avoid incorrect filename inference when loading model (@mpenkov, #2408)
Handle invalid unicode when loading native FastText models (@mpenkov, #2411)
Avoid divide by zero when calculating vectors for terms with no ngrams (@mpenkov, #2411)

📚 Tutorial and doc improvements

Add link to bindr (rogueleaderr, #2387)

👍 Improvements

Undo the hash2index optimization (mpenkov, #2370)

⚠️ Changes in FastText behavior

Out-of-vocab word handling

To achieve consistency with the reference implementation from Facebook,
a FastText model will now always report any word, out-of-vocabulary or
not, as being in the model, and always return some vector for any word
looked-up. Specifically:

'any_word' in ft_model will always return True. Previously, it
returned True only if the full word was in the vocabulary. (To test if a
full word is in the known vocabulary, you can consult the wv.vocab
property: 'any_word' in ft_model.wv.vocab will return False if the full
word wasn't learned during model training.)
ft_model['any_word'] will always return a vector. Previously, it
raised KeyError for OOV words when the model had no vectors
for any ngrams of the word.
If no ngrams from the term are present in the model,
or when no ngrams could be extracted from the term, a vector pointing
to the origin will be returned. Previously, a vector of NaN (not a number)
was returned as a consequence of a divide-by-zero problem.
Models may use more more memory, or take longer for word-vector
lookup, especially after training on smaller corpuses where the previous
non-compliant behavior discarded some ngrams from consideration.

Loading models in Facebook .bin format

The gensim.models.FastText.load_fasttext_format function (deprecated) now loads the entire model contained in the .bin file, including the shallow neural network that enables training continuation.
Loading this NN requires more CPU and RAM than previously required.

Since this function is deprecated, consider using one of its alternatives (see below).

Furthermore, you must now pass the full path to the file to load, including the file extension.
Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model.
This behavior was confusing, so we removed it.

⚠️ Deprecations (will be removed in the next major release)

Remove
- gensim.models.FastText.load_fasttext_format: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)
- gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
- gensim.examples
- gensim.nosy
- gensim.scripts.word2vec_standalone
- gensim.scripts.make_wiki_lemma
- gensim.scripts.make_wiki_online
- gensim.scripts.make_wiki_online_lemma
- gensim.scripts.make_wiki_online_nodebug
- gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)
- "deprecated" functions and attributes
Move
- gensim.scripts.make_wikicorpus ➡ gensim.scripts.make_wiki.py
- gensim.summarization ➡ gensim.models.summarization
- gensim.topic_coherence ➡ gensim.models._coherence
- gensim.utils ➡ gensim.utils.utils (old imports will continue to work)
- gensim.parsing.* ➡ gensim.utils.text_utils