Popularity

7.5

Growing

Activity

9.8

Stars 4,002

Watchers 53

Forks 409

Last Commit 8 days ago

Description

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

Programming language: Python

License: GNU Affero General Public License v3.0

Tags: Text Processing Specific Formats Processing PDF Natural Language Processing OCR Utilities

PyMuPDF alternatives and similar packages

Based on the "PDF" category.
Alternatively, view PyMuPDF alternatives based on common mentions on social networks and blogs.

PyPDF2

8.6 9.5 L2 PyMuPDF VS PyPDF2

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
WeasyPrint

8.3 9.4 L1 PyMuPDF VS WeasyPrint

The awesome document factory

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

Promo www.influxdata.com

PDFMiner

8.3 0.0 L3 PyMuPDF VS PDFMiner

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
borb

6.8 5.2 PyMuPDF VS borb

borb is a library for reading, creating and manipulating PDF files in python.
Camelot

6.7 6.9 PyMuPDF VS Camelot

A Python library to extract tabular data from PDFs
pdftabextract

6.5 0.0 L3 PyMuPDF VS pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
ReportLab

3.4 - PyMuPDF VS ReportLab

Allowing Rapid creation of rich PDF documents.
Meltano Singer SDK

2.3 9.7 PyMuPDF VS Meltano Singer SDK

Write 70% less code by using the SDK to build custom extractors and loaders that adhere to the Singer standard: https://sdk.meltano.com

* Code Quality Rankings and insights are calculated and provided by Lumnify.
They vary from L1 to L5 with "L5" being the highest.

Do you think we are missing an alternative of PyMuPDF or a related project?

Add another 'PDF' Package

Popular Comparisons

README

PyMuPDF 1.21.0

Release date: August 13, 2022

On PyPI since August 2016:

Author

Artifex, based on code by Jorj X. McKie and Ruikai Liu.

Introduction

PyMuPDF adds Python bindings and abstractions to MuPDF, a lightweight PDF, XPS, and eBook viewer, renderer, and toolkit. Both PyMuPDF and MuPDF are maintained and developed by Artifex Software, Inc.

MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (eBooks) formats, and it is known for its top performance and exceptional rendering quality.

With PyMuPDF you can access files with extensions like .pdf, .xps, .oxps, .cbz, .fb2 or .epub. In addition, about 10 popular image formats can also be handled like documents: .png, .jpg, .bmp, .tiff, etc.

Usage

For all supported document types (i.e. including images) you can

Decrypt the document.
Access meta information, links and bookmarks.
Render pages in raster formats (PNG and some others), or the vector format SVG.
Search for text.
Extract text and images.
Convert to other formats: PDF, (X)HTML, XML, JSON, text.
Do OCR (Optical Character Recognition) if Tesseract is installed.

To some degree, PyMuPDF can also be used as an image converter: it can read a range of input formats and can produce Portable Network Graphics (PNG), Portable Anymaps (PNM, etc.), Portable Arbitrary Maps (PAM), Adobe PostScript and Adobe Photoshop documents, making the use of other graphics packages obselete in these cases. But interfacing with e.g. PIL/Pillow for image input and output is easy as well.

For PDF documents, there exists a plethora of additional features: they can be created, joined or split up. Pages can be inserted, deleted, re-arranged or modified in many ways (including annotations and form fields).

Images and fonts can be extracted or inserted.

You may want to have a look at this cool GUI example script, which lets you insert, delete, replace or re-position images under your visual control.

If fontTools is installed, subsets can be built for eligible fonts based on their usage in the document. Especially for new PDFs, this can lead to significant file size reductions.
Embedded files are fully supported.
PDFs can be reformatted to support double-sided printing, posterizing, applying logos or watermarks
Password protection is fully supported: decryption, encryption, encryption method selection, permission level and user / owner password setting.
Support of the PDF Optional Content concept for images, text and drawings.
Low-level PDF structures can be accessed and modified.
Command line module "python -m fitz ...". A versatile utility with the following features
- encryption / decryption / optimization
- creation of sub-documents
- document joining
- image / font extraction
- full support of embedded files
- layout-preserving text extraction (all documents)

Have a look at the basic demos, the examples (which contain complete, working programs), and notebooks.

Documentation

Documentation is written using Sphinx and is available online. It is currently a combination of a reference guide and user manual.

You can view it online at Read the Docs. This site also provides download options for PDF.
For a quick start look at the tutorial and the recipes chapters.

The latest changelog can be viewed here.

Installation

PyMuPDF requires Python 3.7 or later.

For versions 3.7 and up, Python wheels exist for Windows (32bit and 64bit), Linux (64bit, Intel and ARM) and Mac OSX (64bit, Intel only), so it can be installed from PyPI in the usual way. To ensure pip support for the latest wheel platform tags, we strongly recommend to always upgrade pip first.

python -m pip install --upgrade pip
python -m pip install --upgrade pymupdf

There are no mandatory external dependencies. However, some optional features become available only if additional packages are installed:

Pillow for using pillow image output directly from PyMuPDF
fontTools for creating font subsets.
pymupdf-fonts contains some nice fonts for your text output.
Tesseract-OCR for optical character recognition in images and document pages. Tesseract is separate software, not a Python package. To enable OCR functions in PyMuPDF, the system environment variable "TESSDATA_PREFIX" must be defined and contain the tessdata folder name of the Tesseract installation location.

Older wheels - also with support for older Python versions - can be found here and on PyPI.

Note: If pip cannot find a wheel that is compatible with your platform, it will automatically build and install from source using the PyMuPDF sdist; this requires only that SWIG is installed on your system.

License and Copyright

PyMuPDF and MuPDF are available under both, open-source AGPL and commercial license agreements.

Please read the full text of the AGPL license agreement (which is also included here in file COPYING) to ensure that your use case complies with the guidelines of this license. If you determine you cannot meet the requirements of the AGPL, please contact Artifex for more information regarding a commercial license.

Artifex is the exclusive commercial licensing agent for MuPDF.

Artifex, the Artifex logo, MuPDF, and the MuPDF logo are registered trademarks of Artifex Software Inc. PyMuPDF and the PyMuPDF logo are trademarks of Artifex Software, Inc. © 2022 Artifex Software, Inc. All rights reserved.

Contact

Please use the Discussions menu for questions, comments, or asking for help, and submit issues here.

*Note that all licence references and agreements mentioned in the PyMuPDF README section above are relevant to that project's source code only.