⚡️ Updated the heuristic to fix the letter ß in UTF-8/MacRoman mojibake, which had regressed since version 5.6.
🛠 Packaging fixes to pyproject.toml.
⚡️ Updated the heuristic to fix the letter Ñ with more confidence.
🛠 Fixed type annotations and added py.typed.
📦 ftfy is packaged using Poetry now, and wheels are created and uploaded to PyPI.
👍 Allow the keyword argument
fix_entitiesas a deprecated alias for
unescape_html, raising a warning.
ftfy.formattingfunctions now disregard ANSI terminal escapes when calculating text width.
💄 This version is purely a cosmetic change, updating the maintainer's e-mail ➕ address and the project's canonical location on GitHub.
remove_terminal_escapesstep was accidentally not being used. This version restores it.
Specified in setup.py that ftfy 6 requires Python 3.6 or later.
📄 Use a lighter link color when the docs are viewed in dark mode.
ftfy.fix_and_explain()can describe all the transformations that happen when fixing a string. This is similar to what
ftfy.fixes.fix_encoding_and_explain()did in previous versions, but it can fix more than the encoding.
fix_encoding_and_explain()are now in the top-level ftfy module.
🔄 Changed the heuristic entirely. ftfy no longer needs to categorize every Unicode character, but only characters that are expected to appear in mojibake.
🚀 Because of the new heuristic, ftfy will no longer have to release a new version for every new version of Unicode. It should also run faster and use less RAM when imported.
ftfy.badness.is_bad(text)can be used to determine whether there appears to be mojibake in a string. Some users were already using the old function
sequence_weirdness()for that, but this one is actually designed for that purpose.
Instead of a pile of named keyword arguments, ftfy functions now take in a TextFixerConfig object. The keyword arguments still work, and become settings that override the defaults in TextFixerConfig.
➕ Added support for UTF-8 mixups with Windows-1253 and Windows-1254.
📚 Overhauled the documentation: https://ftfy.readthedocs.org
This version is brought to you by the letter à and the number 0xC3.
👉 Tweaked the heuristic to decode, for example, "Ã " as the letter "à" more often.
This combines with the non-breaking-space fixer to decode "Ã " as "à" as well. However, in many cases, the text " Ã " was intended to be " à ", preserving the space -- the underlying mojibake had two spaces after it, but the Web coalesced them into one. We detect this case based on common French and Portuguese words, and preserve the space when it appears intended.
Thanks to @zehavoc for bringing to my attention how common this case is.
- ⚡️ Updated the data file of Unicode character categories to Unicode 13, as used in Python 3.9. (No matter what version of Python you're on, ftfy uses the same data.)
👌 Improved detection of UTF-8 mojibake of Greek, Cyrillic, Hebrew, and Arabic scripts.
🛠 Fixed the undeclared dependency on setuptools by removing the use of
⚡️ Updated the data file of Unicode character categories to Unicode 12.1, as used in Python 3.8. (No matter what version of Python you're on, ftfy uses the same data.)
Corrected an omission where short sequences involving the ACUTE ACCENT character were not being fixed.
unescape_htmlfunction now supports all the HTML5 entities that appear in
html.entities.html5, including those with long names such as
Unescaping of numeric HTML entities now uses the standard library's
html.unescape, making edge cases consistent.
(The reason we don't run
html.unescapeon all text is that it's not always appropriate to apply, and can lead to false positive fixes. The text "This&NotThat" should not have "&Not" replaced by a symbol, as
- 👍 On top of Python's support for HTML5 entities, ftfy will also convert HTML
escapes of common Latin capital letters that are (nonstandardly) written
in all caps, such as