piskvorky/gensim 4.1.2 on GitHub

4.1.2, 2021-09-17

This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.

4.1.1, 2021-09-14

This is a bugfix release that addresses compatibility issues with older versions of numpy.

4.1.0, 2021-08-15

Gensim 4.1 brings two major new functionalities:

Ensemble LDA for robust training, selection and comparison of LDA models.
FastSS module for super fast Levenshtein "fuzzy search" queries. Used e.g. for "soft term similarity" calculations.

There are several minor changes that are not backwards compatible with previous versions of Gensim.
The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump.
Nevertheless, we describe them below.

Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

We now handle both positive and negative keyword parameters consistently.
They may now be either:

A string, in which case the value is reinterpreted as a list of one element (the string value)
A vector, in which case the value is reinterpreted as a list of one element (the vector)
A list of strings
A list of vectors

So you can now simply do:

    model.most_similar(positive='war', negative='peace')

instead of the slightly more involved

model.most_similar(positive=['war'], negative=['peace'])

Both invocations remain correct, so you can use whichever is most convenient.
If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])

then you will need to specify the lists explicitly in gensim 4.1.

Deprecated obsolete `step` parameter from doc2vec

With the newer version, do this:

model.infer_vector(..., epochs=123)

instead of this:

model.infer_vector(..., steps=123)

Plus a large number of smaller improvements and fixes, as usual.

⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

👍 New features

#3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
#3163: Optimize word mover distance (WMD) computation, by @flowlight0
#3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
#3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
#3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
#3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
#3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
#3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
#2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
#2978: Optimize performance of Author-Topic model, by @horpto
#3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

📚 Tutorials and docs

#3155: Correct parameter name in documentation of fasttext.py, by @bizzyvinci
#3148: Fix broken link to mycorpus.txt in documentation, by @rohit901
#3142: Use more permanent pdf link and update code link, by @dymil
#3141: Update link for online LDA paper, by @dymil
#3133: Update link to Hoffman paper (online VB LDA), by @jonaschn
#3129: [MRG] Add bronze sponsor: TechTarget, by @piskvorky
#3126: Fix typos in make_wiki_online.py and make_wikicorpus.py, by @nicolasassi
#3125: Improve & unify docs for dirichlet priors, by @jonaschn
#3123: Fix hyperlink for doc2vec tutorial, by @AdityaSoni19031997
#3121: [MRG] Add bronze sponsor: eaccidents.com, by @piskvorky
#3120: Fix URL for ldamodel.py, by @jonaschn
#3118: Fix URL in doc string, by @jonaschn
#3107: Draw attention to sponsoring in README, by @piskvorky
#3105: Fix documentation links: Travis to Github Actions, by @piskvorky
#3057: Clarify doc comment in LdaModel.inference(), by @yocen
#2964: Document that preprocessing.strip_punctuation is limited to ASCII, by @sciatro

🔴 Bug fixes

#3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
#3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
#3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
#3131: Add missing import to NMF docs and models/init.py, by @properGrammar
#3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
#2830: Fixed KeyError in coherence model, by @pietrotrope

⚠️ Removed functionality & deprecations

#3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
#3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

🔮 Testing, CI, housekeeping

#3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
#3143: replace _mul function with explicit casts, by @mpenkov
#2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
#2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro