gensim v4.1.0 Release Notes

Release Date: 2021-08-15 // 5 months ago
  • Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    ๐Ÿ—„ Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    ๐Ÿ›  Plus a large number of smaller improvements and fixes, as usual.

    โš ๏ธ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • ๐Ÿ #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • โšก๏ธ #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • ๐Ÿ‘ท #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • ๐ŸŒฒ #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • ๐ŸŽ #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    ๐Ÿ“„ :books: Tutorials and docs

    ๐Ÿ›  :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • ๐Ÿ›  #2830: Fixed KeyError in coherence model, by @pietrotrope

    ๐Ÿšš :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • ๐Ÿšš #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    โœ… ๐Ÿ”ฎ Testing, CI, housekeeping

    • โšก๏ธ #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • โœ… #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro