All Versions
44
Latest Version
Avg Release Cycle
81 days
Latest Release
202 days ago

Changelog History
Page 3

  • v4.3.1 Changes

    January 12, 2017

    πŸ› Bug fix:

    • remove_control_chars was removing U+0D ('\r') prematurely. That's the job of fix_line_breaks.
  • v4.3.0 Changes

    December 29, 2016

    ftfy has gotten by for four years without dependencies on other Python 🚧 libraries, but now we can spare ourselves some code and some maintenance burden by delegating certain tasks to other libraries that already solve them well. This version now depends on the html5lib and wcwidth libraries.

    πŸ”‹ Feature changes:

    • The remove_control_chars fixer will now remove some non-ASCII control characters as well, such as deprecated Arabic control characters and byte-order marks. Bidirectional controls are still left as is.

    This should have no impact on well-formed text, while cleaning up many characters that the Unicode Consortium deems "not suitable for markup" (see Unicode Technical Report #20).

    • The unescape_html fixer uses a more thorough list of HTML entities, which it imports from html5lib.

    • ftfy.formatting now uses wcwidth to compute the width that a string will occupy in a text console.

    Heuristic changes:

    • ⚑️ Updated the data file of Unicode character categories to Unicode 9, as used in Python 3.6.0. (No matter what version of Python you're on, ftfy uses the same data.)

    πŸ—„ Pending deprecations:

    • 🚚 The remove_bom option will become deprecated in 5.0, because it has been superseded by remove_control_chars.

    • ftfy 5.0 will remove the previously deprecated name fix_text_encoding. It was renamed to fix_encoding in 4.0.

    • ftfy 5.0 will require Python 3.2 or later, as planned. Python 2 users, please specify ftfy < 5 in your dependencies if you haven't already.

  • v4.2.0 Changes

    September 28, 2016

    Heuristic changes:

    • Math symbols next to currency symbols are no longer considered 'weird' by the heuristic. This fixes a false positive where text that involved the multiplication sign and British pounds or euros (as in '5Γ—Β£35') could turn into Hebrew letters.

    • A heuristic that used to be a bonus for certain punctuation now also gives a bonus to successfully decoding other common codepoints, such as the non-breaking space, the degree sign, and the byte order mark.

    • In version 4.0, we tried to "future-proof" the categorization of emoji (as a kind of symbol) to include codepoints that would likely be assigned to emoji later. The future happened, and there are even more emoji than we expected. We have expanded the range to include those emoji, too.

    ftfy is still mostly based on information from Unicode 8 (as Python 3.5 is), but this expanded range should include the emoji from Unicode 9 and 10.

    • Emoji are increasingly being modified by variation selectors and skin-tone modifiers. Those codepoints are now grouped with 'symbols' in ftfy, so they fit right in with emoji, instead of being considered 'marks' as their Unicode category would suggest.

    This enables fixing mojibake that involves iOS's new diverse emoji.

    • An old heuristic that wasn't necessary anymore considered Latin text with high-numbered codepoints to be 'weird', but this is normal in languages such as Vietnamese and Azerbaijani. This does not seem to have caused any false positives, but it caused ftfy to be too reluctant to fix some cases of broken text in those languages.

    The heuristic has been changed, and all languages that use Latin letters should be on even footing now.

  • v4.1.1 Changes

    April 13, 2016
    • πŸ› Bug fix: in the command-line interface, the -e option had no effect on Python 3 when using standard input. Now, it correctly lets you specify a different encoding for standard input.
  • v4.1.0 Changes

    February 25, 2016

    Heuristic changes:

    • ftfy can now deal with "lossy" mojibake. If your text has been run through a strict Windows-1252 decoder, such as the one in Python, it may contain the replacement character οΏ½ (U+FFFD) where there were bytes that are unassigned in Windows-1252.

    Although ftfy won't recover the lost information, it can now detect this situation, replace the entire lossy character with οΏ½, and decode the rest of the characters. Previous versions would be unable to fix any string that contained U+FFFD.

    As an example, text in curly quotes that gets corrupted Ò€œ like this Ҁ� now gets fixed to be β€œ like this οΏ½.

    • ⚑️ Updated the data file of Unicode character categories to Unicode 8.0, as used in Python 3.5.0. (No matter what version of Python you're on, ftfy uses the same data.)

    • Heuristics now count characters such as ~ and ^ as punctuation instead of wacky math symbols, improving the detection of mojibake in some edge cases.

    πŸ†• New features:

    • A new module, ftfy.formatting, can be used to justify Unicode text in a monospaced terminal. It takes into account that each character can take up anywhere from 0 to 2 character cells.

    • ⚑️ Internally, the utf-8-variants codec was simplified and optimized.

  • v4.0.0 Changes

    April 10, 2015

    πŸ’₯ Breaking changes:

    • 0️⃣ The default normalization form is now NFC, not NFKC. NFKC replaces a large number of characters with 'equivalent' characters, and some of these replacements are useful, but some are not desirable to do by default.

    • The fix_text function has some new options that perform more targeted operations that are part of NFKC normalization, such as fix_character_width, without requiring hitting all your text with the huge mallet that is NFKC.

      • If you were already using NFC normalization, or in general if you want to preserve the spacing of CJK text, you should be sure to set fix_character_width=False.
    • The remove_unsafe_private_use parameter has been removed entirely, after two versions of deprecation. The function name fix_bad_encoding is also gone.

    πŸ†• New features:

    • πŸ›  Fixers for strange new forms of mojibake, including particularly clear cases of mixed UTF-8 and Windows-1252.

    • πŸ†• New heuristics, so that ftfy can fix more stuff, while maintaining approximately zero false positives.

    • The command-line tool trusts you to know what encoding your input is in, and assumes UTF-8 by default. You can still tell it to guess with the -g option.

    • πŸ”§ The command-line tool can be configured with options, and can be used as a pipe.

    • Recognizes characters that are new in Unicode 7.0, as well as emoji from Unicode 8.0+ that may already be in use on iOS.

    πŸ—„ Deprecations:

    • fix_text_encoding is being renamed again, for conciseness and consistency. It's now simply called fix_encoding. The name fix_text_encoding is available but emits a warning.

    πŸ—„ Pending deprecations:

    • πŸ‘ Python 2.6 support is largely coincidental.

    • πŸ“Œ Python 2.7 support is on notice. If you use Python 2, be sure to pin a version of ftfy less than 5.0 in your requirements.

  • v3.4.0 Changes

    January 15, 2015

    πŸ†• New features:

    • πŸ›  ftfy.fixes.fix_surrogates will fix all 16-bit surrogate codepoints, which would otherwise break various encoding and output functions.

    πŸ—„ Deprecations:

    • remove_unsafe_private_use emits a warning, and will disappear in the next minor or major version.
  • v3.3.1 Changes

    December 12, 2014

    βͺ This version restores compatibility with Python 2.6.

  • v3.3.0 Changes

    August 16, 2014

    Heuristic changes:

    • Certain symbols are marked as "ending punctuation" that may naturally occur after letters. When they follow an accented capital letter and look like mojibake, they will not be "fixed" without further evidence. An example is that "MARQUÉ…" will become "MARQUΓ‰...", and not "MARQUΙ…".

    πŸ†• New features:

    • ftfy.explain_unicode is a diagnostic function that shows you what's going on in a Unicode string. It shows you a table with each code point in hexadecimal, its glyph, its name, and its Unicode category.

    • πŸ›  ftfy.fixes.decode_escapes adds a feature missing from the standard library: it lets you decode a Unicode string with backslashed escape sequences in it (such as "\u2014") the same way that Python itself would.

    • πŸš€ ftfy.streamtester is a release of the code that I use to test ftfy on an endless stream of real-world data from Twitter. With the new heuristics, the false positive rate of ftfy is about 1 per 6 million tweets. (See the "Accuracy" section of the documentation.)

    πŸ—„ Deprecations:

    • πŸ‘ Python 2.6 is no longer supported.

    • remove_unsafe_private_use is no longer needed in any current version of Python. This fixer will disappear in a later version of ftfy.

  • v3.2.0 Changes

    June 27, 2014
    • fix_line_breaks fixes three additional characters that are considered line breaks in some environments, such as Javascript, and Python's "codecs" library. These are all now replaced with \n:

      U+0085 , with alias "NEXT LINE" U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR