October | 2018 | WZB Data Science Blog

Checkboxes and crosses: data mining PDFs with the help of image processing

From time to time, I work with “open data” published by public authorities. Often, these data do not deserve the label “open data” and this is mainly because they are provided as PDF files. PDFs are not machine readable, at least not without lot of programming work. I don’t know if this way of publishing data is done on purpose (because authorities are requested to publish open data but they do not want it to be actually analyzed in large scale) or if it is sheer ignorance.

For a recent project I came across a particular nasty type of PDFs: Scores from a school inspection are listed in a large table where each score is marked with a cross (see a full PDF for such a school inspection):

While most data can be extracted from PDF by converting them to a plain text representation, this is not possible for such PDFs. This is because the most important information, the scores, is not existent in the plain text representation of the PDF. The crosses that mark the score are essentially vector-graphics embedded in the PDF. In this article I will explain how to extract such information.

Monthly Archives: October 2018

Checkboxes and crosses: data mining PDFs with the help of image processing

Linkdump #94

R

Python

Interesting articles, projects and news

Recent posts

Categories

Links

Links

Recent Posts

Recent Comments

Archives

Categories

Meta