I wanted to share a small lab report on a project about the development of school sites in eastern Germany since 1992. Rita Nikolai (HU Berlin), Marcel Helbig (WZB) and I published our results a few months ago (see this WZB Discussion Paper or this WZBrief), but I’d like to provide some additional information on the (technical) background in this post as this was not the aim of the mentioned papers.
Checkboxes and crosses: data mining PDFs with the help of image processing
From time to time, I work with “open data” published by public authorities. Often, these data do not deserve the label “open data” and this is mainly because they are provided as PDF files. PDFs are not machine readable, at least not without lot of programming work. I don’t know if this way of publishing data is done on purpose (because authorities are requested to publish open data but they do not want it to be actually analyzed in large scale) or if it is sheer ignorance.
For a recent project I came across a particular nasty type of PDFs: Scores from a school inspection are listed in a large table where each score is marked with a cross (see a full PDF for such a school inspection):
While most data can be extracted from PDF by converting them to a plain text representation, this is not possible for such PDFs. This is because the most important information, the scores, is not existent in the plain text representation of the PDF. The crosses that mark the score are essentially vector-graphics embedded in the PDF. In this article I will explain how to extract such information.
Tools and packages for geospatial processing with Python
In the social sciences, geospatial data appears quite often. You may have social indicators for different places on earth at different administrative levels, e.g. countries, states or municipalities. Or you may study spatial distribution of hospitals or schools in a given area, or visualize GPS referenced data from an experiment. For such scenarios, there’s fortunately a rich supply of open-source tools and packages. As I’ve worked recently quite a lot with geospatial data, I want to introduce some of this software, especially those available for the Python programming language.
Creating and plotting Voronoi regions for geographic data with geovoronoi
Recently, I’ve worked a lot with geospatial data in Python. One thing that we needed for our analysis was generating Voronoi regions (or “cells”) from a given set of coordinates inside certain administrative boundaries (a country, a state, etc.). Such regions are interesting for spatial analysis, because each random point inside a Voronoi region is closest to the cell’s “origin point” (the point the cell was generated from) than to any other cell’s origin. As a practical example: In Melbourne parents can see which is the closest school for their home, by looking at an online map of Voronoi regions of schools.
These regions allow to calculate an estimate of a “coverage”: For each point’s Voronoi region, the area can be calculated, which represents the area theoretically covered by this point. Referring to the Melbourne example: The schools at the edge of the city cover a larger area than those in the city center. This approach of course does not take geographic properties into account. So if there’s a large lake inside a cell, it is also part of the covered area. Still, Voronoi tessellation is useful when looking at how the shape of the Voronoi regions changes over time, for example when new schools open or others close. We could then see for example, if the coverage of schools in the city center becomes better over the years, whereas in the rural areas it gets more sparse.
So all in all, Voronoi regions can be a very useful tool in spatial data analysis. QGIS provides a tool for Voronoi tessellation but we needed a more flexible approach that also fit into our workflow and could be used in our Python scripts. I decided to write a small Python package named geovoronoi that takes a set of points, a boundary object (the geographic shape enclosing the points – e.g. a country boundary) and then calculates the Voronoi regions using SciPy. These regions are then “cut” to the enclosing shape (using the excellent shapely package). The resulting Voronoi cells can then be used for further calculations (areas, distances, unions, etc.) and can also be visualized on a map.
The package geovoronoi is now available on PyPI (install it with pip install geovoronoi[plotting]
) and the source is uploaded on the WZB’s GitHub page.
Vectorization and parallelization in Python with NumPy and Pandas
Modern computers are equipped with processors that allow fast parallel computation at several levels: Vector or array operations, which allow to execute similar operations simultaneously on a bunch of data, and parallel computing, which allows to distribute data chunks on several CPU cores and process them in parallel. When working with large amounts of data, it is important to know how to exploit these features because this can reduce computation time drastically. Taking advantage of this usually requires some extra effort during implementation. With packages like NumPy and Python’s multiprocessing
module the additional work is manageable and usually pays off when compared to the enormous waiting time that you may need when doing large-scale calculations inefficiently.
Web scraping with automated browsers using Selenium
Web scraping, i.e. automated data mining from websites, usually involves fetching a web page’s HTML document, parsing it, extracting the required information, and optionally follow links within this document to other web pages to repeat this process. This approach is sufficient for many websites that display information in a static way, i.e. do not respond to user interaction dynamically by the means of JavaScript. In these cases, web scraping can be implemented with Python packages such as requests and BeautifulSoup. Even interactive elements such as forms can be emulated by observing the HTTP POST and GET data that is send to the server, whenever a form is submitted. However, this approach has limits. Sometimes, it is necessary to automate a whole browser in order to implement web scraping on JavaScript-heavy websites as will be shown with a short example in this post.
Topic Model Evaluation in Python with tmtoolkit
Topic modeling is a method for finding abstract topics in a large collection of documents. With it, it is possible to discover the mixture of hidden or “latent” topics that varies from document to document in a given corpus. As an unsupervised machine learning approach, topic models are not easy to evaluate since there is no labelled “ground truth” data to compare with. However, since topic modeling typically requires defining some parameters beforehand (first and foremost the number of topics k to be discovered), model evaluation is crucial in order to find an “optimal” set of parameters for the given data.
Several metrics exist for this task and some of them will be covered in this post. Furthermore, as calculating many models on a large text corpus is a computationally intensive task, I introduce the Python package tmtoolkit which allows to utilize all availabel CPU cores in your machine by computing and evaluating the models in parallel.
Geocoding an address and performing point-polygon tests with GDAL/OGR in Python
Suppose you have a list of addresses and want to connect them with some kind of location-based information. For example, your addresses might scatter across several neighborhoods and you want to find out to which neighborhood each address belongs, because you have further information (like mean income, percentage of migrants, etc.) about each neighborhood and want to combine it with your data. In many countries, administrative authorities gather such geographical information and provide the data on their websites.
In the given scenario, three steps are necessary in order to combine the addresses with geographical information:
- Geocoding the address, i.e. finding out the geographical coordinates (latitude, longitude) for this address
- Given a file with geographical information (GIS data) that form several distinct areas as polygons, finding out which of these polygons contains the geocoded address
- Obtain necessary information such as a neighborhood identifier from the polygon
This short post shows how to do that with the Python packages googlemaps and GDAL.