PyPDF2 v2.1.0 Release Notes
Release Date: 2022-06-06 // about 2 years ago-
๐ The highlight of the 2.1.0 release is the most massive improvement to the text extraction capabilities of PyPDF2 since 2016 ๐ฅณ๐ A very big thank you goes to pubpub-zz who took a lot of time and knowledge about the PDF format to finally get those improvements into PyPDF2. Thank you ๐ค๐
In case the new function causes any issues, you can use
_extract_text_old
for the old functionality. Please also open a bug ticket in that case.There were several people who have attempted to bring similar improvements to ๐ PyPDF2. All of those were valuable. The main reason why they didn't get merged is the big amount of open PRs / issues. pubpub-zz was the most comprehensive โ PR which also incorporated the latest changes of PyPDF2 2.0.0.
Thank you to VictorCarlquist for #858 and asabramo for #464 ๐ค
๐ New Features (ENH)
- Massive text extraction improvement (#924). Closed many open issues:
- Exceptions / missing spaces in extract_text() method (#17) ๐บ
- Whitespace issues in extract_text() (#42) ๐
- pypdf2 reads the hifenated words in a new line (#246)
- PyPDF2 failing to read unicode character (#37)
- Unable to read bullets (#230)
- ExtractText yields nothing for apparently good PDF (#168) ๐
- Encoding issue in extract_text() (#235)
- extractText() doesn't work on Chinese PDF (#252)
- encoding error (#260)
- Trouble with apostophes in names in text "O'Doul" (#384)
- extract_text works for some PDF files, but not the others (#437)
- Euro sign not being recognized by extractText (#443)
- Failed extracting text from French texts (#524)
- extract_text doesn't extract ligatures correctly (#598)
- reading spanish text - mark convert issue (#635)
- Read PDF changed from text to random symbols (#654)
- .extractText() reads / as 1. (#789)
- Exceptions / missing spaces in extract_text() method (#17) ๐บ
- โก๏ธ Update glyphlist (#947) - inspired by #464
- Allow adding PageRange objects (#948)
๐ Bug Fixes (BUG)
- Delete .python-version file (#944)
- Compare StreamObject.decoded_self with None (#931)
Robustness (ROB)
- Fix some conversion errors on non conform PDF (#932)
๐ Documentation (DOC)
- Elaborate on PDF text extraction difficulties (#939)
- Add logo (#942)
- rotate vs Transformation().rotate (#937)
- Example how to use PyPDF2 with AWS S3 (#938)
- ๐ How to deprecate (#930)
- โ๏ธ Fix typos on robustness page (#935)
- ๐ Remove scripts (pdfcat) from docs (#934)
Developer Experience (DEV)
- Ignore .python-version file
- ๐ Mark deprecated code with no-cover (#943)
- ๐ Automatically create Github releases from tags (#870)
โ Testing (TST)
- Text extraction for non-latin alphabets (#954)
- โ Ignore PdfReadWarning in benchmark (#949)
- ๐ writer.remove_text (#946)
- โ Add test for Tree and _security (#945)
๐ Code Style (STY)
- ๐ black, isort, Flake8, splitting buildCharMap (#950)
- Massive text extraction improvement (#924). Closed many open issues: