Author Archives: Markus Konrad

Linkdump #96

R
Python
Interesting articles, projects and news

Linkdump #95

R
Python
Interesting articles, projects and news

Checkboxes and crosses: data mining PDFs with the help of image processing

From time to time, I work with “open data” published by public authorities. Often, these data do not deserve the label “open data” and this is mainly because they are provided as PDF files. PDFs are not machine readable, at least not without lot of programming work. I don’t know if this way of publishing data is done on purpose (because authorities are requested to publish open data but they do not want it to be actually analyzed in large scale) or if it is sheer ignorance.

For a recent project I came across a particular nasty type of PDFs: Scores from a school inspection are listed in a large table where each score is marked with a cross (see a full PDF for such a school inspection):

While most data can be extracted from PDF by converting them to a plain text representation, this is not possible for such PDFs. This is because the most important information, the scores, is not existent in the plain text representation of the PDF. The crosses that mark the score are essentially vector-graphics embedded in the PDF. In this article I will explain how to extract such information.

Read More →

Linkdump #94

R
Python
Interesting articles, projects and news

Linkdump #93

R
Python
Interesting articles, projects and news

Linkdump #92

R
Python
Interesting articles, projects and news

Linkdump #91

R
Python
Interesting articles, projects and news

Linkdump #90

R
Python
Interesting articles, projects and news

Linkdump #89

R
Python
Interesting articles, projects and news

Tools and packages for geospatial processing with Python

In the social sciences, geospatial data appears quite often. You may have social indicators for different places on earth at different administrative levels, e.g. countries, states or municipalities. Or you may study spatial distribution of hospitals or schools in a given area, or visualize GPS referenced data from an experiment. For such scenarios, there’s fortunately a rich supply of open-source tools and packages. As I’ve worked recently quite a lot with geospatial data, I want to introduce some of this software, especially those available for the Python programming language.

Read More →