chardet/CHANGELOG and chardet Releases

All Versions

Latest Version

4.0.0

Avg Release Cycle

308 days

Latest Release

1232 days ago

Changelog History

v4.0.0 Changes
December 10, 2020
🚀 ⚠️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ ⚠️

Major Changes

🚀 This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:

👀 1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
1. The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
2. There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences. 👍 4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.
🚀 The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

Benchmarks

⚙ Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

old version (chardet 3.0.4)
```
Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 25559.439366240098
big5: 7.187002209518091
cp932: 4.71090956645177
cp949: 2.937256786994428
euc-jp: 4.870580412090848
euc-kr: 6.6910755971933416
euc-tw: 87.71098043480079
gb2312: 6.614302607154443
ibm855: 27.595893549680685
ibm866: 29.93483661732791
iso-2022-jp: 3379.5052775763434
iso-2022-kr: 26181.67290886392
iso-8859-1: 120.63424740403983
iso-8859-5: 32.65106262196898
iso-8859-7: 62.480089080556084
koi8-r: 13.72481001727257
maccyrillic: 33.018537255804496
shift_jis: 4.996013583677438
tis-620: 14.323112928341818
utf-16: 166771.53081510935
utf-32: 198782.18009478672
utf-8: 13.966236809766901
utf-8-sig: 193732.28637413395
windows-1251: 23.038910006925768
windows-1252: 99.48409117053738 
windows-1255: 6.336261495718825

Total time: 357.05358052253723s (10.054513372323958 calls per second)
```
🆕 new version (chardet 4.0.0)
```
Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
.......................................................................................................................................................................................................................................................................................................................................................................
Calls per second for each encoding:
ascii: 38176.31067961165
big5: 12.86915132656389
cp932: 4.656400877065864
cp949: 7.282976434315926
euc-jp: 4.329381447610525
euc-kr: 8.16386823884839
euc-tw: 90.230745070368
gb2312: 14.248865889128146
ibm855: 33.30225548069821
ibm866: 44.181691968506
iso-2022-jp: 3024.2295767539117
iso-2022-kr: 25055.57945041816
iso-8859-1: 59.25262902122995
iso-8859-5: 39.7069713674529
iso-8859-7: 61.008422013862194
koi8-r: 41.21560517643845
maccyrillic: 31.402474369805002
shift_jis: 4.9091652743515155
tis-620: 14.408875278821073
utf-16: 177349.00634249471
utf-32: 186413.51111111112
utf-8: 108.62174360115105
utf-8-sig: 181965.46637744035
windows-1251: 43.16933400329809
windows-1252: 211.27653358317968
windows-1255: 16.15113643694104

Total time: 268.0230791568756s (13.394368915143872 calls per second)
```
🚀 Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.

Full changelog
- Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard
- ➕ Add API option to get all the encodings confidence (#111) @mdamien
- 👉 Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard
- ➕ Add benchmark script (d702545, 8dccd00, 726973e, 71a0fad) @dan-blanchard
- 📦 Include license file in the generated wheel package (#141) @jdufresne
- ⬇️ Drop support for Python 2.6 (#143) @jdufresne
- ✂ Remove unused coverage configuration (#142) @jdufresne
- 📦 Doc the chardet package suitable for production (#144) @jdufresne
- Pass python_requires argument to setuptools (#150) @jdufresne
- ⚡️ Update pypi.python.org URL to pypi.org (#155) @jdufresne
- Typo fix (#159) @saintamh
- 👌 Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok
- ✅ Test Python 3.7 and 3.8 and document support (#175) @jdufresne
- ⬇️ Drop support for end-of-life Python 3.4 (#181) @jdufresne
- ↪ Workaround for distutils bug in python 2.7 (#165) @xeor
- ✂ Remove deprecated license_file from setup.cfg (#182) @jdufresne
- ✂ Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne
- ➕ Add testing for Python 3.9 (#201) @jdufresne
- ➕ Adds explicit os and distro definitions (#140) @edumco
- ✂ Remove shebang from nonexecutable script (#192) @hrnciar
- ✂ Remove use of deprecated 'setup.py test' (#187) @jdufresne
- ✂ Remove unnecessary numeric placeholders from format strings (#176) @jdufresne
- ⚡️ Update links (#152) @aaaxx
- ✂ Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne
- 🌲 Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard
- Switch from Travis to GitHub Actions (#204) @dan-blanchard
- Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard
- ➕ Add language to detect_all output (1e208b7) @dan-blanchard
v3.0.4 Changes
June 08, 2017
🛠 This minor bugfix release just fixes some packaging and documentation issues:
- 🛠 Fix issue with setup.py where pytest_runner was always being installed. (PR #119, thanks @zmedico)
- ✅ Make sure test.py is included in the manifest (PR #118, thanks @zmedico)
- 🛠 Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)
- 📚 Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)
v3.0.3 Changes
May 16, 2017
🚀 This release fixes a crash when debugging logging was enabled. (Issue #115, PRs #117 and #125)
v3.0.2 Changes
April 12, 2017
🛠 Fixes an issue where detect would sometimes return None instead of a dict with the keys encoding, language, and confidence (Issue #113, PR #114).
v3.0.1 Changes
April 11, 2017
🛠 This bugfix release fixes a crash in the EUC-TW prober when it encountered certain strings (Issue #67).
v3.0.0 Changes
April 11, 2017
🚀 This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:
- ➕ Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)
- Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)
- ✂ Removed Python 3.2 from testing, but add 3.4 - 3.6
- Ensure that stdin is open with mode 'rb' for chardetect CLI. (PR #38, thanks @lpsinger)
- 🛠 Fixed chardetect crash with non-ascii file names (PR #39, thanks @nkanaev)
- Made naming conventions more Pythonic throughout (no more mTypicalPositiveRatio, and instead typical_positive_ratio)
- ✅ Modernized test scripts and infrastructure so we've got Travis testing and all that stuff
- Rename filter_without_english_words to filter_international_words and make it match current Mozilla implementation (PR #44, thanks @rsnair2)
- Updated filter_english_letters to match C implementation (c665459)
- 🏁 Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a0)
- 👍 Allow CLI sub-package to be importable (PR #55)
- ➕ Add a hypotheis-based test (PR #66, thanks @DRMacIver)
- Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to bytes.decode() (PR #73, thanks @snoack)
- 🛠 Fixed broken links in docs (PR #90, thanks @roskakori)
- ➕ Added early exit to chardetect when encoding is detected instead of looping through entire file (PR #103, thanks @jpz)
- 🐎 Use bytearray objects internally instead of wrap_ord calls, which provides a nice performance boost across the board (PR #106)
- ➕ Add language property to probers and UniversalDetector results (PR #180)
- 🏗 Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)
v2.3.0 Changes
October 07, 2014
🚀 In this release, we:
- ➕ Added support for CP932 detection (thanks to @hashy).
- 🛠 Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).
- 📜 Modified chardetect to use argparse for argument parsing.
- 🚚 Moved docs to a gh-pages branch. You can now access them at http://chardet.github.io.
v2.2.1 Changes
October 21, 2014
🛠 Fix missing paren in chardetect.py
v2.2.0 Changes
October 21, 2014
🔀 First version after merger with charade. Loads of little changes.
v1.1
July 26, 2012

chardet changelog

Python character encoding detector

Changelog History

v4.0.0 Changes

Major Changes

Benchmarks

old version (chardet 3.0.4)

🆕 new version (chardet 4.0.0)

Full changelog

v3.0.4 Changes

v3.0.3 Changes

v3.0.2 Changes

v3.0.1 Changes

v3.0.0 Changes

v2.3.0 Changes

v2.2.1 Changes

v2.2.0 Changes

v1.1