All Versions
10
Latest Version
Avg Release Cycle
308 days
Latest Release
1377 days ago

Changelog History

  • v4.0.0 Changes

    December 10, 2020

    ๐Ÿš€ โš ๏ธ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ โš ๏ธ

    Major Changes

    ๐Ÿš€ This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

    ๐Ÿ‘€ 1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)

    1. The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
    2. There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences. ๐Ÿ‘ 4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

    ๐Ÿš€ The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

    Benchmarks

    โš™ Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

    old version (chardet 3.0.4)
    Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
    [Clang 11.0.3 (clang-1103.0.32.62)]
    --------------------------------------------------------------------------------
    Calls per second for each encoding:
    ascii: 25559.439366240098
    big5: 7.187002209518091
    cp932: 4.71090956645177
    cp949: 2.937256786994428
    euc-jp: 4.870580412090848
    euc-kr: 6.6910755971933416
    euc-tw: 87.71098043480079
    gb2312: 6.614302607154443
    ibm855: 27.595893549680685
    ibm866: 29.93483661732791
    iso-2022-jp: 3379.5052775763434
    iso-2022-kr: 26181.67290886392
    iso-8859-1: 120.63424740403983
    iso-8859-5: 32.65106262196898
    iso-8859-7: 62.480089080556084
    koi8-r: 13.72481001727257
    maccyrillic: 33.018537255804496
    shift_jis: 4.996013583677438
    tis-620: 14.323112928341818
    utf-16: 166771.53081510935
    utf-32: 198782.18009478672
    utf-8: 13.966236809766901
    utf-8-sig: 193732.28637413395
    windows-1251: 23.038910006925768
    windows-1252: 99.48409117053738 
    windows-1255: 6.336261495718825
    
    Total time: 357.05358052253723s (10.054513372323958 calls per second)
    
    ๐Ÿ†• new version (chardet 4.0.0)
    Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
    [Clang 11.0.3 (clang-1103.0.32.62)]
    --------------------------------------------------------------------------------
    .......................................................................................................................................................................................................................................................................................................................................................................
    Calls per second for each encoding:
    ascii: 38176.31067961165
    big5: 12.86915132656389
    cp932: 4.656400877065864
    cp949: 7.282976434315926
    euc-jp: 4.329381447610525
    euc-kr: 8.16386823884839
    euc-tw: 90.230745070368
    gb2312: 14.248865889128146
    ibm855: 33.30225548069821
    ibm866: 44.181691968506
    iso-2022-jp: 3024.2295767539117
    iso-2022-kr: 25055.57945041816
    iso-8859-1: 59.25262902122995
    iso-8859-5: 39.7069713674529
    iso-8859-7: 61.008422013862194
    koi8-r: 41.21560517643845
    maccyrillic: 31.402474369805002
    shift_jis: 4.9091652743515155
    tis-620: 14.408875278821073
    utf-16: 177349.00634249471
    utf-32: 186413.51111111112
    utf-8: 108.62174360115105
    utf-8-sig: 181965.46637744035
    windows-1251: 43.16933400329809
    windows-1252: 211.27653358317968
    windows-1255: 16.15113643694104
    
    Total time: 268.0230791568756s (13.394368915143872 calls per second)
    

    ๐Ÿš€ Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

    Full changelog

  • v3.0.4 Changes

    June 08, 2017

    ๐Ÿ›  This minor bugfix release just fixes some packaging and documentation issues:

    • ๐Ÿ›  Fix issue with setup.py where pytest_runner was always being installed. (PR #119, thanks @zmedico)
    • โœ… Make sure test.py is included in the manifest (PR #118, thanks @zmedico)
    • ๐Ÿ›  Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)
    • ๐Ÿ“š Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)
  • v3.0.3 Changes

    May 16, 2017

    ๐Ÿš€ This release fixes a crash when debugging logging was enabled. (Issue #115, PRs #117 and #125)

  • v3.0.2 Changes

    April 12, 2017

    ๐Ÿ›  Fixes an issue where detect would sometimes return None instead of a dict with the keys encoding, language, and confidence (Issue #113, PR #114).

  • v3.0.1 Changes

    April 11, 2017

    ๐Ÿ›  This bugfix release fixes a crash in the EUC-TW prober when it encountered certain strings (Issue #67).

  • v3.0.0 Changes

    April 11, 2017

    ๐Ÿš€ This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:

    • โž• Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)
    • Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)
    • โœ‚ Removed Python 3.2 from testing, but add 3.4 - 3.6
    • Ensure that stdin is open with mode 'rb' for chardetect CLI. (PR #38, thanks @lpsinger)
    • ๐Ÿ›  Fixed chardetect crash with non-ascii file names (PR #39, thanks @nkanaev)
    • Made naming conventions more Pythonic throughout (no more mTypicalPositiveRatio, and instead typical_positive_ratio)
    • โœ… Modernized test scripts and infrastructure so we've got Travis testing and all that stuff
    • Rename filter_without_english_words to filter_international_words and make it match current Mozilla implementation (PR #44, thanks @rsnair2)
    • Updated filter_english_letters to match C implementation (c665459)
    • ๐Ÿ Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a0)
    • ๐Ÿ‘ Allow CLI sub-package to be importable (PR #55)
    • โž• Add a hypotheis-based test (PR #66, thanks @DRMacIver)
    • Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to bytes.decode() (PR #73, thanks @snoack)
    • ๐Ÿ›  Fixed broken links in docs (PR #90, thanks @roskakori)
    • โž• Added early exit to chardetect when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)
    • ๐ŸŽ Use bytearray objects internally instead of wrap_ord calls, which provides a nice performance boost across the board (PR #106)
    • โž• Add language property to probers and UniversalDetector results (PR #180)
    • ๐Ÿ— Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)
  • v2.3.0 Changes

    October 07, 2014

    ๐Ÿš€ In this release, we:

    • โž• Added support for CP932 detection (thanks to @hashy).
    • ๐Ÿ›  Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).
    • ๐Ÿ“œ Modified chardetect to use argparse for argument parsing.
    • ๐Ÿšš Moved docs to a gh-pages branch. You can now access them at http://chardet.github.io.
  • v2.2.1 Changes

    October 21, 2014

    ๐Ÿ›  Fix missing paren in chardetect.py

  • v2.2.0 Changes

    October 21, 2014

    ๐Ÿ”€ First version after merger with charade. Loads of little changes.

  • v1.1

    July 26, 2012