All Versions
47
Latest Version
Avg Release Cycle
81 days
Latest Release
770 days ago

Changelog History
Page 3

  • v5.0.1 Changes

    March 10, 2017

    ๐Ÿ› Bug fix:

    • The unescape_html fixer will decode entities between € and Ÿ as what they would be in Windows-1252, even without the help of fix_encoding.

    This better matches what Web browsers do, and fixes a regression that version 4.4 introduced in an example that uses … as an ellipsis.

  • v5.0 Changes

    February 17, 2017

    ๐Ÿ’ฅ Breaking changes:

    • โฌ‡๏ธ Dropped support for Python 2. If you need Python 2 support, you should get version 4.4, which has the same features as this version.

    • The top-level functions require their arguments to be given as keyword arguments.

    ๐Ÿ”– Version 5.0 also now has tests for the command-line invocation of ftfy.

  • v4.4.0 Changes

    February 17, 2017

    Heuristic changes:

    • ๐Ÿ ftfy can now fix mojibake involving the Windows-1250 or ISO-8859-2 encodings.

    • The fix_entities fixer is now applied after fix_encoding. This makes more situations resolvable when both fixes are needed.

    • With a few exceptions for commonly-used characters such as ^, it is now considered "weird" whenever a diacritic appears in non-combining form, such as the diaeresis character ยจ.

    • It is also now weird when IPA phonetic letters, besides ษ™, appear next to capital letters.

    • These changes to the heuristics, and others we've made in recent versions, let us lower the "cost" for fixing mojibake in some encodings, causing them to be fixed in more cases.

  • v4.3.1 Changes

    January 12, 2017

    ๐Ÿ› Bug fix:

    • remove_control_chars was removing U+0D ('\r') prematurely. That's the job of fix_line_breaks.
  • v4.3.0 Changes

    December 29, 2016

    ftfy has gotten by for four years without dependencies on other Python ๐Ÿšง libraries, but now we can spare ourselves some code and some maintenance burden by delegating certain tasks to other libraries that already solve them well. This version now depends on the html5lib and wcwidth libraries.

    ๐Ÿ”‹ Feature changes:

    • The remove_control_chars fixer will now remove some non-ASCII control characters as well, such as deprecated Arabic control characters and byte-order marks. Bidirectional controls are still left as is.

    This should have no impact on well-formed text, while cleaning up many characters that the Unicode Consortium deems "not suitable for markup" (see Unicode Technical Report #20).

    • The unescape_html fixer uses a more thorough list of HTML entities, which it imports from html5lib.

    • ftfy.formatting now uses wcwidth to compute the width that a string will occupy in a text console.

    Heuristic changes:

    • โšก๏ธ Updated the data file of Unicode character categories to Unicode 9, as used in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same data.)

    ๐Ÿ—„ Pending deprecations:

    • ๐Ÿšš The remove_bom option will become deprecated in 5.0, because it has been superseded by remove_control_chars.

    • ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It was renamed to fix_encoding in 4.0.

    • ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, please specify ftfy < 5 in your dependencies if you haven't already.

  • v4.2.0 Changes

    September 28, 2016

    Heuristic changes:

    • Math symbols next to currency symbols are no longer considered 'weird' by the heuristic. This fixes a false positive where text that involved the multiplication sign and British pounds or euros (as in '5ร—ยฃ35') could turn into Hebrew letters.

    • A heuristic that used to be a bonus for certain punctuation now also gives a bonus to successfully decoding other common codepoints, such as the non-breaking space, the degree sign, and the byte order mark.

    • In version 4.0, we tried to "future-proof" the categorization of emoji (as a kind of symbol) to include codepoints that would likely be assigned to emoji later. The future happened, and there are even more emoji than we expected. We have expanded the range to include those emoji, too.

    ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), but this expanded range should include the emoji from Unicode 9 and 10.

    • Emoji are increasingly being modified by variation selectors and skin-tone modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit right in with emoji, instead of being considered 'marks' as their Unicode category would suggest.

    This enables fixing mojibake that involves iOS's new diverse emoji.

    • An old heuristic that wasn't necessary anymore considered Latin text with high-numbered codepoints to be 'weird', but this is normal in languages such as Vietnamese and Azerbaijani. This does not seem to have caused any false positives, but it caused ftfy to be too reluctant to fix some cases of broken text in those languages.

    The heuristic has been changed, and all languages that use Latin letters should be on even footing now.

  • v4.1.1 Changes

    April 13, 2016
    • ๐Ÿ› Bug fix: in the command-line interface, the -e option had no effect on Python 3 when using standard input. Now, it correctly lets you specify a different encoding for standard input.
  • v4.1.0 Changes

    February 25, 2016

    Heuristic changes:

    • ftfy can now deal with "lossy" mojibake. If your text has been run through a strict Windows-1252 decoder, such as the one in Python, it may contain the replacement character ๏ฟฝ (U+FFFD) where there were bytes that are unassigned in Windows-1252.

    Although ftfy won't recover the lost information, it can now detect this situation, replace the entire lossy character with ๏ฟฝ, and decode the rest of the characters. Previous versions would be unable to fix any string that contained U+FFFD.

    As an example, text in curly quotes that gets corrupted รขโ‚ฌล“ like this รขโ‚ฌ๏ฟฝ now gets fixed to be โ€œ like this ๏ฟฝ.

    • โšก๏ธ Updated the data file of Unicode character categories to Unicode 8.0, as used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the same data.)

    • Heuristics now count characters such as ~ and ^ as punctuation instead of wacky math symbols, improving the detection of mojibake in some edge cases.

    ๐Ÿ†• New features:

    • A new module, ftfy.formatting, can be used to justify Unicode text in a monospaced terminal. It takes into account that each character can take up anywhere from 0 to 2 character cells.

    • โšก๏ธ Internally, the utf-8-variants codec was simplified and optimized.

  • v4.0.0 Changes

    April 10, 2015

    ๐Ÿ’ฅ Breaking changes:

    • 0๏ธโƒฃ The default normalization form is now NFC, not NFKC. NFKC replaces a large number of characters with 'equivalent' characters, and some of these replacements are useful, but some are not desirable to do by default.

    • The fix_text function has some new options that perform more targeted operations that are part of NFKC normalization, such as fix_character_width, without requiring hitting all your text with the huge mallet that is NFKC.

      • If you were already using NFC normalization, or in general if you want to preserve the spacing of CJK text, you should be sure to set fix_character_width=False.
    • The remove_unsafe_private_use parameter has been removed entirely, after two versions of deprecation. The function name fix_bad_encoding is also gone.

    ๐Ÿ†• New features:

    • ๐Ÿ›  Fixers for strange new forms of mojibake, including particularly clear cases of mixed UTF-8 and Windows-1252.

    • ๐Ÿ†• New heuristics, so that ftfy can fix more stuff, while maintaining approximately zero false positives.

    • The command-line tool trusts you to know what encoding your input is in, and assumes UTF-8 by default. You can still tell it to guess with the -g option.

    • ๐Ÿ”ง The command-line tool can be configured with options, and can be used as a pipe.

    • Recognizes characters that are new in Unicode 7.0, as well as emoji from Unicode 8.0+ that may already be in use on iOS.

    ๐Ÿ—„ Deprecations:

    • fix_text_encoding is being renamed again, for conciseness and consistency. It's now simply called fix_encoding. The name fix_text_encoding is available but emits a warning.

    ๐Ÿ—„ Pending deprecations:

    • ๐Ÿ‘ Python 2.6 support is largely coincidental.

    • ๐Ÿ“Œ Python 2.7 support is on notice. If you use Python 2, be sure to pin a version of ftfy less than 5.0 in your requirements.

  • v3.4.0 Changes

    January 15, 2015

    ๐Ÿ†• New features:

    • ๐Ÿ›  ftfy.fixes.fix_surrogates will fix all 16-bit surrogate codepoints, which would otherwise break various encoding and output functions.

    ๐Ÿ—„ Deprecations:

    • remove_unsafe_private_use emits a warning, and will disappear in the next minor or major version.