This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

Module overview

After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. Lines can be detected in the scanned images using the imgproc module. If the pages are skewed or rotated, this can be detected and fixed with methods from imgproc and functions in textboxes. Lines or text box positions can be clustered in order to detect table columns and rows using the clustering module. When columns and rows were successfully detected, they can be converted to a page grid with the extract module and their contents can be extracted using fit_texts_into_grid in the same module. extract also allows you to export the data as pandas DataFrame.

Code Quality Rank: L3
Programming language: Python
License: Apache License 2.0

pdftabextract alternatives and related packages

Based on the "PDF" category

Do you think we are missing an alternative of pdftabextract or a related project?

Add another 'PDF' Package

pdftabextract Recommendations

There are no recommendations yet. Be the first to promote pdftabextract!

Have you used pdftabextract? Share your experience. Write a short recommendation and pdftabextract, you and your project will be promoted on Awesome Python.
Recommend pdftabextract

Recently added pdftabextract resources

Do you know of a usefull tutorial, book or news relevant to pdftabextract?
Be the first to add one!