PyPDF2 v2.1.0 Release Notes

Release Date: 2022-06-06 // about 2 years ago
  • ๐Ÿš€ The highlight of the 2.1.0 release is the most massive improvement to the text extraction capabilities of PyPDF2 since 2016 ๐Ÿฅณ๐ŸŽŠ A very big thank you goes to pubpub-zz who took a lot of time and knowledge about the PDF format to finally get those improvements into PyPDF2. Thank you ๐Ÿค—๐Ÿ’š

    In case the new function causes any issues, you can use _extract_text_old for the old functionality. Please also open a bug ticket in that case.

    There were several people who have attempted to bring similar improvements to ๐Ÿ”€ PyPDF2. All of those were valuable. The main reason why they didn't get merged is the big amount of open PRs / issues. pubpub-zz was the most comprehensive โœ… PR which also incorporated the latest changes of PyPDF2 2.0.0.

    Thank you to VictorCarlquist for #858 and asabramo for #464 ๐Ÿค—

    ๐Ÿ†• New Features (ENH)

    • Massive text extraction improvement (#924). Closed many open issues:
      • Exceptions / missing spaces in extract_text() method (#17) ๐Ÿ•บ
        • Whitespace issues in extract_text() (#42) ๐Ÿ’ƒ
        • pypdf2 reads the hifenated words in a new line (#246)
      • PyPDF2 failing to read unicode character (#37)
        • Unable to read bullets (#230)
      • ExtractText yields nothing for apparently good PDF (#168) ๐ŸŽ‰
      • Encoding issue in extract_text() (#235)
      • extractText() doesn't work on Chinese PDF (#252)
      • encoding error (#260)
      • Trouble with apostophes in names in text "O'Doul" (#384)
      • extract_text works for some PDF files, but not the others (#437)
      • Euro sign not being recognized by extractText (#443)
      • Failed extracting text from French texts (#524)
      • extract_text doesn't extract ligatures correctly (#598)
      • reading spanish text - mark convert issue (#635)
      • Read PDF changed from text to random symbols (#654)
      • .extractText() reads / as 1. (#789)
    • โšก๏ธ Update glyphlist (#947) - inspired by #464
    • Allow adding PageRange objects (#948)

    ๐Ÿ› Bug Fixes (BUG)

    • Delete .python-version file (#944)
    • Compare StreamObject.decoded_self with None (#931)

    Robustness (ROB)

    • Fix some conversion errors on non conform PDF (#932)

    ๐Ÿ“š Documentation (DOC)

    • Elaborate on PDF text extraction difficulties (#939)
    • Add logo (#942)
    • rotate vs Transformation().rotate (#937)
    • Example how to use PyPDF2 with AWS S3 (#938)
    • ๐Ÿ—„ How to deprecate (#930)
    • โœ๏ธ Fix typos on robustness page (#935)
    • ๐Ÿšš Remove scripts (pdfcat) from docs (#934)

    Developer Experience (DEV)

    • Ignore .python-version file
    • ๐Ÿ—„ Mark deprecated code with no-cover (#943)
    • ๐Ÿš€ Automatically create Github releases from tags (#870)

    โœ… Testing (TST)

    • Text extraction for non-latin alphabets (#954)
    • โš  Ignore PdfReadWarning in benchmark (#949)
    • ๐Ÿšš writer.remove_text (#946)
    • โœ… Add test for Tree and _security (#945)

    ๐Ÿ’… Code Style (STY)

    • ๐Ÿ— black, isort, Flake8, splitting buildCharMap (#950)

    Full Changelog