I wanted to share a small lab report on a project about the development of school sites in eastern Germany since 1992. Rita Nikolai (HU Berlin), Marcel Helbig (WZB) and I published our results a few months ago (see this WZB Discussion Paper or this WZBrief), but I’d like to provide some additional information on the (technical) background in this post as this was not the aim of the mentioned papers.
Checkboxes and crosses: data mining PDFs with the help of image processing
From time to time, I work with “open data” published by public authorities. Often, these data do not deserve the label “open data” and this is mainly because they are provided as PDF files. PDFs are not machine readable, at least not without lot of programming work. I don’t know if this way of publishing data is done on purpose (because authorities are requested to publish open data but they do not want it to be actually analyzed in large scale) or if it is sheer ignorance.
For a recent project I came across a particular nasty type of PDFs: Scores from a school inspection are listed in a large table where each score is marked with a cross (see a full PDF for such a school inspection):
While most data can be extracted from PDF by converting them to a plain text representation, this is not possible for such PDFs. This is because the most important information, the scores, is not existent in the plain text representation of the PDF. The crosses that mark the score are essentially vector-graphics embedded in the PDF. In this article I will explain how to extract such information.
Data Mining OCR PDFs — Using pdftabextract to liberate tabular data from scanned documents
During the last months I often had to deal with the problem of extracting tabular data from scanned documents. These documents included quite old sources like catalogs of German newspapers in the 1920s to 30s or newer sources like lists of schools in Germany from the 1990s. All sources were of mixed scanning quality (including rotated or skewed pages) and had very different table layouts. Some had visible table column borders, others only table header borders so the actual table cells were only visually separated by “white-space”. Automated data extraction with tools from ABBYY or using Tabula failed in most cases. Because of the big variety of scanning quality and table layouts, a general single-solution approach didn’t work out. Hence I created a set of common tools that allow to detect table layouts on scanned pages in OCR PDFs, enable visual verification of the detected layouts and finally allow the extraction of the data in the tables. To detect and extract the data I created a Python library named pdftabextract which is now published on PyPI and can be installed with pip. The detected layouts can be verified page by page using pdf2xml-viewer. This post will cover an introduction to both tools by showing all necessary steps in order to extract tabular data from an example page. The necessary files can be found in the examples directory of the pdftabextract github repository. A Jupyter Notebook for this example is also available there.
Data Mining OCR PDFs β Getting Things Straight
The first article of my series about extracting tabular data from PDFs focused on rather simple cases; cases that allowed us to convert the PDFs to plain text documents and parse the extracted text line-per-line. We also learned from the first article that the only information that we can access in PDFs is the textual data that is distributed across the pages in the form of individual text boxes, which have properties like a position, width, height and the actual content (text). There’s usualy no information stored about rows/columns or other table-like structures.
Now in the next two articles I want to focus on rather complicated documents: PDFs that have complex table structures or are even scans of documents that were processed via Optical Character Recognition (OCR). Such documents are often “messy” — someone scanned hundreds of pages and of course sometimes the pages are sloped or skewed and the margins differ. It is mostly impossible to extract structured information from such a messy data source by just converting the PDF to a plain text document as described in the previous article. Hence we must use the attributes of the OCR-procossed text boxes (such as the texts’ positions) to recognize patterns in them from which we might infer the table layout.
So the basic goal is to analyse the text boxes and their properties, especially their positions in form of the distribution of their x- and y-coordinates on the page and see if we can construct a table layout from that, so that we can “fit” the text boxes into the calculated table cells. This is something that I’ll explain in the third article of this series. Because before we can do that, we need to clarify some prerequisites which I’ll do in this article:
- When we use OCR, to what should we pay attention?
- How can we extract the text boxes and their properties from a PDF document?
- How can we display and inspect the text boxes?
- How can we straighten skewed pages?
Data Mining PDFs β The simple cases
Extracting data from PDFs can be a laborious task. When you only want to extract all text from a PDF and don’t care about which text is a headline or a paragraph or how text boxes relate to each other, you won’t have much headaches with PDFs, because this is quite straight forward to achieve. But if you want to extract structured information (especially tabular data) it really gets cumbersome, because unlike many other document formats, PDFs usually don’t carry any information about row-column-relationships, even if it looks like you have a table in front of you when you open a PDF document. From a technical point of view, the only information we usually have in PDFs is in forms of text boxes, which have some attributes like:
- position in relation to the page
- width and height
- font attributes (font family, size, etc.)
- the actual content (text) of the text box
So there’s no information in the document like “this text is in row 3, column 5” of a table. All we have is the above attributes from which we might infer a cell position in a table. In a short series of blog posts I want to explain how this can be done. In this first post I will focus on the “simple cases” of data extraction from PDFs, which means cases where we can extract tabular information without the need to calculate the table cells from the individual text box positions. In the upcoming posts I will explain how to handle the harder cases of PDFs: So called “sandwich” documents, i.e. PDFs that contain the scanned pages from some document together with “hidden” text from optical character recognition (OCR) of the scanned pages.
Recent Comments