Markus Konrad | WZB Data Science Blog

Data Mining OCR PDFs — Using pdftabextract to liberate tabular data from scanned documents

Detected clusters of vertical lines with pdftabextract

During the last months I often had to deal with the problem of extracting tabular data from scanned documents. These documents included quite old sources like catalogs of German newspapers in the 1920s to 30s or newer sources like lists of schools in Germany from the 1990s. All sources were of mixed scanning quality (including rotated or skewed pages) and had very different table layouts. Some had visible table column borders, others only table header borders so the actual table cells were only visually separated by “white-space”. Automated data extraction with tools from ABBYY or using Tabula failed in most cases. Because of the big variety of scanning quality and table layouts, a general single-solution approach didn’t work out. Hence I created a set of common tools that allow to detect table layouts on scanned pages in OCR PDFs, enable visual verification of the detected layouts and finally allow the extraction of the data in the tables. To detect and extract the data I created a Python library named pdftabextract which is now published on PyPI and can be installed with pip. The detected layouts can be verified page by page using pdf2xml-viewer. This post will cover an introduction to both tools by showing all necessary steps in order to extract tabular data from an example page. The necessary files can be found in the examples directory of the pdftabextract github repository. A Jupyter Notebook for this example is also available there.

Creating a “balloon plot” as alternative to a heat map with ggplot2

Heat maps are great to compare observations with lots of variables (which must be comparable in terms of unit, domain, etc.). In some cases however, traditional heat maps might not suffice, for example when you want to compare multiple groups of observations. One solution is to use facets. Another solution, which I want to explain here, is to make a “ballon plot” with a fixed grid of rows and columns.

Author Archives: Markus Konrad

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

Python

R

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

Post Navigation

Recent posts

Categories

Links

Links

Recent Posts

Recent Comments

Archives

Categories

Meta