3.7.2, 2019-04-06
🌟 New Features
-
gensim.models.fasttext.load_facebook_model
function: load full model (slower, more CPU/memory intensive, supports training continuation)>>> from gensim.test.utils import datapath >>> >>> cap_path = datapath("crime-and-punishment.bin") >>> fb_model = load_facebook_model(cap_path) >>> >>> 'landlord' in fb_model.wv.vocab # Word is out of vocabulary False >>> oov_term = fb_model.wv['landlord'] >>> >>> 'landlady' in fb_model.wv.vocab # Word is in the vocabulary True >>> iv_term = fb_model.wv['landlady'] >>> >>> new_sent = [['lord', 'of', 'the', 'rings'], ['lord', 'of', 'the', 'flies']] >>> fb_model.build_vocab(new_sent, update=True) >>> fb_model.train(sentences=new_sent, total_examples=len(new_sent), epochs=5)
-
gensim.models.fasttext.load_facebook_vectors
function: load embeddings only (faster, less CPU/memory usage, does not support training continuation)>>> fbkv = load_facebook_vectors(cap_path) >>> >>> 'landlord' in fbkv.vocab # Word is out of vocabulary False >>> oov_vector = fbkv['landlord'] >>> >>> 'landlady' in fbkv.vocab # Word is in the vocabulary True >>> iv_vector = fbkv['landlady']
🔴 Bug fixes
- Fix unicode error when loading FastText vocabulary (@mpenkov, #2390)
- Avoid division by zero in fasttext_inner.pyx (@mpenkov, #2404)
- Avoid incorrect filename inference when loading model (@mpenkov, #2408)
- Handle invalid unicode when loading native FastText models (@mpenkov, #2411)
- Avoid divide by zero when calculating vectors for terms with no ngrams (@mpenkov, #2411)
📚 Tutorial and doc improvements
- Add link to bindr (rogueleaderr, #2387)
👍 Improvements
⚠️ Changes in FastText behavior
Out-of-vocab word handling
To achieve consistency with the reference implementation from Facebook,
a FastText
model will now always report any word, out-of-vocabulary or
not, as being in the model, and always return some vector for any word
looked-up. Specifically:
'any_word' in ft_model
will always returnTrue
. Previously, it
returnedTrue
only if the full word was in the vocabulary. (To test if a
full word is in the known vocabulary, you can consult thewv.vocab
property:'any_word' in ft_model.wv.vocab
will returnFalse
if the full
word wasn't learned during model training.)ft_model['any_word']
will always return a vector. Previously, it
raisedKeyError
for OOV words when the model had no vectors
for any ngrams of the word.- If no ngrams from the term are present in the model,
or when no ngrams could be extracted from the term, a vector pointing
to the origin will be returned. Previously, a vector of NaN (not a number)
was returned as a consequence of a divide-by-zero problem. - Models may use more more memory, or take longer for word-vector
lookup, especially after training on smaller corpuses where the previous
non-compliant behavior discarded some ngrams from consideration.
Loading models in Facebook .bin format
The gensim.models.FastText.load_fasttext_format
function (deprecated) now loads the entire model contained in the .bin file, including the shallow neural network that enables training continuation.
Loading this NN requires more CPU and RAM than previously required.
Since this function is deprecated, consider using one of its alternatives (see below).
Furthermore, you must now pass the full path to the file to load, including the file extension.
Previously, if you specified a model path that ends with anything other than .bin, the code automatically appended .bin to the path before loading the model.
This behavior was confusing, so we removed it.
⚠️ Deprecations (will be removed in the next major release)
-
Remove
gensim.models.FastText.load_fasttext_format
: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)gensim.models.wrappers.fasttext
(obsoleted by the new nativegensim.models.fasttext
implementation)gensim.examples
gensim.nosy
gensim.scripts.word2vec_standalone
gensim.scripts.make_wiki_lemma
gensim.scripts.make_wiki_online
gensim.scripts.make_wiki_online_lemma
gensim.scripts.make_wiki_online_nodebug
gensim.scripts.make_wiki
(all of these obsoleted by the new nativegensim.scripts.segment_wiki
implementation)- "deprecated" functions and attributes
-
Move
gensim.scripts.make_wikicorpus
➡gensim.scripts.make_wiki.py
gensim.summarization
➡gensim.models.summarization
gensim.topic_coherence
➡gensim.models._coherence
gensim.utils
➡gensim.utils.utils
(old imports will continue to work)gensim.parsing.*
➡gensim.utils.text_utils