All Versions
82
Latest Version
Avg Release Cycle
82 days
Latest Release
947 days ago

Changelog History
Page 1

  • v4.2.0 Changes

    April 29, 2022

    :+1: New features

    • #3188: Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors, by @rock420
    • ๐Ÿ‘€ #3194: Added random_seed parameter to make LsiModel reproducible, by @parashardhapola
    • #3247: Sparse2Corpus: update getitem to work on slices, lists and ellipsis, by @PrimozGodec
    • #3264: Detect when a fasttext executable is available in PATH, by @pabs3
    • #3271: Added new ValueError in place of assertion error for no model data provided in lsi model, by @mark-todd
    • #3299: Enable test_word2vec_stand_alone_script by using sys.executable for python, by @pabs3
    • #3317: Added encoding parameter to TextDirectoryCorpus, by @Sandman-Ren
    • #2656: Streamlining most_similar_cosmul and evaluate_word_analogies, by @n3hrox

    ๐Ÿ“„ :books: Tutorials and docs

    • ๐Ÿ— #3227: Fix FastText doc-comment example for build_vocab and train to use correct argument names, by @HLasse
    • ๐Ÿ“„ #3235: Fix TFIDF docs, by @piskvorky
    • ๐Ÿ™‹ #3257: Dictionary doc: ref FAQ entry about filter_extremes corpus migration, by @zacchiro
    • ๐Ÿ“„ #3279: Add the FastSS and Levenshtein modules to docs, by @piskvorky
    • ๐Ÿ“š #3284: Documentation fixes + added CITATION.cff, by @piskvorky
    • โœ๏ธ #3289: Typos, text and code fix in LDA tutorial, by @davebulaval
    • ๐Ÿšš #3301: Remove unused Jupyter screenshots, by @pabs3
    • ๐Ÿ“š #3307: Documentation fixes, by @piskvorky
    • ๐Ÿ“œ #3339: Fix parsing error in FastText docs, by @MattYoon
    • #3251: Apply new convention of delimiting instance params in str function, by @menshikh-iv

    ๐Ÿ›  :red_circle: Bug fixes

    ๐Ÿšš :warning: Removed functionality & deprecations

    โœ… ๐Ÿ”ฎ Testing, CI, housekeeping

  • v4.1.2 Changes

    September 17, 2021

    ๐Ÿ›  This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.

  • v4.1.1 Changes

    September 14, 2021

    ๐Ÿ›  This is a bugfix release that addresses compatibility issues with older versions of numpy.

  • v4.1.0 Changes

    August 15, 2021

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    ๐Ÿ—„ Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    ๐Ÿ›  Plus a large number of smaller improvements and fixes, as usual.

    โš ๏ธ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • ๐Ÿ #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • โšก๏ธ #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • ๐Ÿ‘ท #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • ๐ŸŒฒ #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • ๐ŸŽ #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    ๐Ÿ“„ :books: Tutorials and docs

    ๐Ÿ›  :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • ๐Ÿ›  #2830: Fixed KeyError in coherence model, by @pietrotrope

    ๐Ÿšš :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • ๐Ÿšš #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    โœ… ๐Ÿ”ฎ Testing, CI, housekeeping

    • โšก๏ธ #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • โœ… #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
  • v4.0.1 Changes

    April 01, 2021
  • v4.0 Changes

    โฌ†๏ธ Production stability is important to Gensim, so we're improving the process of upgrading already-trained saved models. There'll be an explicit model upgrade script between each 4.n to 4.(n+1) Gensim release. Check progress here.

  • v4.0.0.rc1 Changes

    March 19, 2021

    โšก๏ธ โš ๏ธ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    ๐Ÿš€ Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

    Main highlights (see also ๐Ÿ‘ Improvements below)

    a. Efficiency

    | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput |
    |----------|------------|--------|
    | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / **1.26 GB** / 914k words/s |
    | word2vec | 1.7h / 0.36 GB / 1685k words/s | **1.2h** / 0.33 GB / 1762k words/s |
    
    In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. ([4.0 benchmarks](https://github.com/RaRe-Technologies/gensim/issues/2887#issuecomment-711097334))
    

    b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see ๐Ÿ”ด Bug fixes below)

    c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

    These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • โฌ‡๏ธ Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

      • Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

      So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

    • โฌ‡๏ธ Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website โ€“ย finally! ๐Ÿ™ƒ

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    โœ… This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

    :star2: New Features

    ๐Ÿ›  :red_circle: Bug fixes

    • 0๏ธโƒฃ fix RuntimeError in export_phrases (change defaultdict to dict) (thalishsajeed, #3041)

    :books: Tutorial and doc improvements

    • fix various documentation warnings (mpenkov, #3077)
    • Fix broken link in run_doc how-to (sezanzeb, #2991)
    • Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (Witiko, #3003)
    • Make the link to the Gensim 3.8.3 documentation dynamic (Witiko, #2996)

    ๐Ÿšš :warning: Removed functionality

    ๐Ÿ”ฎ Miscellaneous

  • v4.0.0.beta Changes

    October 31, 2020

    4.0.0beta, 2020-10-31

    โšก๏ธ โš ๏ธ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Main highlights

    โšก๏ธ Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

    a. Efficiency

    | model | 3.8.3
    wall time / peak RAM / throughput | 4.0.0
    wall time / peak RAM / throughput | | --- | --- | --- | | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

    In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. 4.0 benchmarks.

    b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see ๐Ÿ”ด Bug fixes below)

    c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

    โšก๏ธ These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    โฌ‡๏ธ Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, wrappers for 3rd party libraries: Mallet, scikit-learn, DTM model, Vowpal Wabbit, wordrank, varembed.

    ๐Ÿ‘ Why? Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules and wrappers.

    ๐Ÿ“„ So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim, linked to as "contributed" from Gensim docs.

    โฌ‡๏ธ Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

    - If you still need Python 2 for some reason, stay at Gensim 3.8.3.

    A new Gensim website โ€“ finally! ๐Ÿ™ƒ

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    โœ… This is the direction we'll keep going forward: less kitchen-sink of "latest academic fad", more focus on robust engineering, targetting common NLP & document similarity use-cases.

    ๐Ÿš€ Why a pre-release?

    ๐ŸŽ This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback : code, documentation, workflowsโ€ฆ Please let us know on the mailing list!

    ๐Ÿš€ Install the pre-release with:

    pip install --pre --upgrade gensim
    

    ๐Ÿš€ What will change between this pre-release and a "full" 4.0 release?

    Check progress here.


    ๐Ÿฑ ๐Ÿ‘ Improvements

    ๐Ÿฑ ๐Ÿ“š Tutorials and docs

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • #2891: Fix fastText word-vectors with ngrams off, by @gojomo
    • #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
    • #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
    • ๐Ÿ—„ #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
    • #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
    • ๐Ÿ—„ #2940; Fix deprecations in SoftCosineSimilarity, by @Witiko
    • #2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
    • ๐Ÿšš #2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
    • #2973: phrases.export_phrases() doesn't yield all bigrams
    • #2942: Segfault when training doc2vec

    ๐Ÿฑ โš ๏ธ Removed functionality & deprecations

    • #6: No more binary wheels for x32 platforms, by menshikh-iv
    • #2899: Renamed overly broad similarities.index to the more appropriate similarities.annoy, by @piskvorky
    • ๐Ÿ“ฆ #2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov
    • #2926: Rename num_words to topn in dtm_coherence, by @MeganStodel
    • ๐Ÿšš #2937: Remove Keras dependency, by @piskvorky
    • โœ‚ Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.
  • v3.8.3 Changes

    May 03, 2020

    ๐Ÿฑ โš ๏ธ 3.8.x will be the last gensim version to support Py2.7. Starting with 4.0.0, gensim will only support Py3.5 and above

    3.8.3, 2020-05-03

    ๐Ÿ›  This is primarily a bugfix release to bring back Py2.7 compatibility to gensim 3.8.

    ๐Ÿฑ ๐Ÿ”ด Bug fixes

    • Bring back Py27 support (PR #2812, @mpenkov)
    • ๐Ÿ›  Fix wrong version reported by setup.py (Issue #2796)
    • ๐Ÿ›  Fix missing C extensions (Issues #2794 and #2802)

    ๐Ÿฑ ๐Ÿ‘ Improvements

    ๐Ÿฑ ๐Ÿ“š Tutorial and doc improvements

    ๐Ÿš€ โš ๏ธ Deprecations (will be removed in the next major release)

    โœ‚ Remove

    • gensim.models.FastText.load_fasttext_format: use load_facebook_vectors to load embeddings only (faster, less CPU/memory usage, does not support training continuation) and load_facebook_model to load full model (slower, more CPU/memory intensive, supports training continuation)
    • gensim.models.wrappers.fasttext (obsoleted by the new native gensim.models.fasttext implementation)
    • gensim.examples
    • gensim.nosy
    • gensim.scripts.word2vec_standalone
    • gensim.scripts.make_wiki_lemma
    • gensim.scripts.make_wiki_online
    • gensim.scripts.make_wiki_online_lemma
    • gensim.scripts.make_wiki_online_nodebug
    • gensim.scripts.make_wiki (all of these obsoleted by the new native gensim.scripts.segment_wiki implementation)

    - "deprecated" functions and attributes

    ๐Ÿšš Move

    • gensim.scripts.make_wikicorpus โžก gensim.scripts.make_wiki.py
    • gensim.summarization โžก gensim.models.summarization
    • gensim.topic_coherence โžก gensim.models._coherence
    • gensim.utils โžก gensim.utils.utils (old imports will continue to work)
    • gensim.parsing.* โžก gensim.utils.text_utils
  • v3.8.3-pre

    April 28, 2020