Changelog History
-
v4.0.0 Changes
December 10, 2020๐ โ ๏ธ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+ โ ๏ธ
Major Changes
๐ This release is multiple years in the making, and provides some quality of life improvements to chardet. The primary user-facing changes are:
๐ 1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
- The
CharsetGroupProber
class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup. - There is now a
chardet.detect_all
function that returns a list of possible encodings for the input with associated confidences. ๐ 4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.
๐ The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).
Benchmarks
โ Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM
old version (chardet 3.0.4)
Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- Calls per second for each encoding: ascii: 25559.439366240098 big5: 7.187002209518091 cp932: 4.71090956645177 cp949: 2.937256786994428 euc-jp: 4.870580412090848 euc-kr: 6.6910755971933416 euc-tw: 87.71098043480079 gb2312: 6.614302607154443 ibm855: 27.595893549680685 ibm866: 29.93483661732791 iso-2022-jp: 3379.5052775763434 iso-2022-kr: 26181.67290886392 iso-8859-1: 120.63424740403983 iso-8859-5: 32.65106262196898 iso-8859-7: 62.480089080556084 koi8-r: 13.72481001727257 maccyrillic: 33.018537255804496 shift_jis: 4.996013583677438 tis-620: 14.323112928341818 utf-16: 166771.53081510935 utf-32: 198782.18009478672 utf-8: 13.966236809766901 utf-8-sig: 193732.28637413395 windows-1251: 23.038910006925768 windows-1252: 99.48409117053738 windows-1255: 6.336261495718825 Total time: 357.05358052253723s (10.054513372323958 calls per second)
๐ new version (chardet 4.0.0)
Benchmarking chardet 4.0.0 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- ....................................................................................................................................................................................................................................................................................................................................................................... Calls per second for each encoding: ascii: 38176.31067961165 big5: 12.86915132656389 cp932: 4.656400877065864 cp949: 7.282976434315926 euc-jp: 4.329381447610525 euc-kr: 8.16386823884839 euc-tw: 90.230745070368 gb2312: 14.248865889128146 ibm855: 33.30225548069821 ibm866: 44.181691968506 iso-2022-jp: 3024.2295767539117 iso-2022-kr: 25055.57945041816 iso-8859-1: 59.25262902122995 iso-8859-5: 39.7069713674529 iso-8859-7: 61.008422013862194 koi8-r: 41.21560517643845 maccyrillic: 31.402474369805002 shift_jis: 4.9091652743515155 tis-620: 14.408875278821073 utf-16: 177349.00634249471 utf-32: 186413.51111111112 utf-8: 108.62174360115105 utf-8-sig: 181965.46637744035 windows-1251: 43.16933400329809 windows-1252: 211.27653358317968 windows-1255: 16.15113643694104 Total time: 268.0230791568756s (13.394368915143872 calls per second)
๐ Thank you to @aaaxx, @edumco, @hrnciar, @hroncok, @jdufresne, @mdamien, @saintamh , @xeor for submitting pull requests, to all of our users for being patient with how long this release has taken.
Full changelog
- Convert single-byte charset probers to use nested dicts for language models (#121) @dan-blanchard
- โ Add API option to get all the encodings confidence (#111) @mdamien
- ๐ Make sure pyc files are not in tarballs (d7c7343) @dan-blanchard
- โ Add benchmark script (d702545, 8dccd00, 726973e, 71a0fad) @dan-blanchard
- ๐ฆ Include license file in the generated wheel package (#141) @jdufresne
- โฌ๏ธ Drop support for Python 2.6 (#143) @jdufresne
- โ Remove unused coverage configuration (#142) @jdufresne
- ๐ฆ Doc the chardet package suitable for production (#144) @jdufresne
- Pass python_requires argument to setuptools (#150) @jdufresne
- โก๏ธ Update pypi.python.org URL to pypi.org (#155) @jdufresne
- Typo fix (#159) @saintamh
- ๐ Support pytest 4, don't apply marks directly to parameters (PR #174, Issue #173) @hroncok
- โ Test Python 3.7 and 3.8 and document support (#175) @jdufresne
- โฌ๏ธ Drop support for end-of-life Python 3.4 (#181) @jdufresne
- โช Workaround for distutils bug in python 2.7 (#165) @xeor
- โ Remove deprecated license_file from setup.cfg (#182) @jdufresne
- โ Remove deprecated 'sudo: false' from Travis configuraiton (#200) @jdufresne
- โ Add testing for Python 3.9 (#201) @jdufresne
- โ Adds explicit os and distro definitions (#140) @edumco
- โ Remove shebang from nonexecutable script (#192) @hrnciar
- โ Remove use of deprecated 'setup.py test' (#187) @jdufresne
- โ Remove unnecessary numeric placeholders from format strings (#176) @jdufresne
- โก๏ธ Update links (#152) @aaaxx
- โ Remove shebang and executable bit from chardet/cli/chardetect.py (#171) @jdufresne
- ๐ฒ Handle weird logging edge case in universaldetector.py (056a2a4) @dan-blanchard
- Switch from Travis to GitHub Actions (#204) @dan-blanchard
- Properly set CharsetGroupProber.state to FOUND_IT (PR #203, Issue #202) @dan-blanchard
- โ Add language to detect_all output (1e208b7) @dan-blanchard
- The
-
v3.0.4 Changes
June 08, 2017๐ This minor bugfix release just fixes some packaging and documentation issues:
- ๐ Fix issue with
setup.py
wherepytest_runner
was always being installed. (PR #119, thanks @zmedico) - โ
Make sure
test.py
is included in the manifest (PR #118, thanks @zmedico) - ๐ Fix a bunch of old URLs in the README and other docs. (PRs #123 and #129, thanks @qfan and @jdufresne)
- ๐ Update documentation to no longer imply we test/support Python 3 versions before 3.3 (PR #130, thanks @jdufresne)
- ๐ Fix issue with
-
v3.0.3 Changes
May 16, 2017 -
v3.0.2 Changes
April 12, 2017 -
v3.0.1 Changes
April 11, 2017๐ This bugfix release fixes a crash in the EUC-TW prober when it encountered certain strings (Issue #67).
-
v3.0.0 Changes
April 11, 2017๐ This release is long overdue, but still mostly serves as a placeholder for the impending 4.0.0 release, which will have retrained models for better accuracy. For now, this release will get the following improvements up on PyPI:
- โ Added support for Turkish ISO-8859-9 detection (PR #41, thanks @queeup)
- Commented out large unused sections of Big5 and EUC-KR tables to save memory (8bc4b89)
- โ Removed Python 3.2 from testing, but add 3.4 - 3.6
- Ensure that stdin is open with mode
'rb'
forchardetect
CLI. (PR #38, thanks @lpsinger) - ๐ Fixed
chardetect
crash with non-ascii file names (PR #39, thanks @nkanaev) - Made naming conventions more Pythonic throughout (no more
mTypicalPositiveRatio
, and insteadtypical_positive_ratio
) - โ Modernized test scripts and infrastructure so we've got Travis testing and all that stuff
- Rename
filter_without_english_words
tofilter_international_words
and make it match current Mozilla implementation (PR #44, thanks @rsnair2) - Updated
filter_english_letters
to match C implementation (c665459) - ๐ Temporarily disabled Hungarian ISO-8859-2 and Windows-1250 detection because it is very inaccurate (da6c0a0)
- ๐ Allow CLI sub-package to be importable (PR #55)
- โ Add a
hypotheis
-based test (PR #66, thanks @DRMacIver) - Strip endianness from UTF with BOM predictions so that the encoding can be passed directly to
bytes.decode()
(PR #73, thanks @snoack) - ๐ Fixed broken links in docs (PR #90, thanks @roskakori)
- โ Added early exit to
chardetect
when encoding is detected instead of looping through entire file (PR #103, thanks @jpz) - ๐ Use
bytearray
objects internally instead ofwrap_ord
calls, which provides a nice performance boost across the board (PR #106) - โ Add
language
property to probers andUniversalDetector
results (PR #180) - ๐ Mark the 5 known test failures as such so we can have more useful Travis build results in the meantime (d588407)
-
v2.3.0 Changes
October 07, 2014๐ In this release, we:
- โ Added support for CP932 detection (thanks to @hashy).
- ๐ Fixed an issue where UTF-8 with a BOM would not be detected as UTF-8-SIG (#8).
- ๐ Modified
chardetect
to useargparse
for argument parsing. - ๐ Moved docs to a
gh-pages
branch. You can now access them at http://chardet.github.io.
-
v2.2.1 Changes
October 21, 2014๐ Fix missing paren in chardetect.py
-
v2.2.0 Changes
October 21, 2014๐ First version after merger with charade. Loads of little changes.
-
v1.1
July 26, 2012