I wanted to share a small lab report on a project about the development of school sites in eastern Germany since 1992. Rita Nikolai (HU Berlin), Marcel Helbig (WZB) and I published our results a few months ago (see this WZB Discussion Paper or this WZBrief), but I’d like to provide some additional information on the (technical) background in this post as this was not the aim of the mentioned papers.
After the German reunification, school policy in eastern Germany had a lot of challenges to master since the birth rate decreased massively and many East Germans left the region. The dense network of public schools, which was a relic of the GDR, was thinned out. At the same time, many private schools were founded. Some claim that private schools now contribute substantially to school site coverage in rural areas (e.g. German education report 2016, p.7). In order to verify this claim, we had a look at historical data from 1992 to 2015 to analyze the development of school sites in all eastern German federal states except for Berlin.
Obtaining and preparing the data
The technically most challenging part was obtaining, preparing and merging the data from the five eastern German federal states, especially for the data from the 90s. The ultimate goal for data preparation was to obtain a list of schools with their name, full address and school form for each year and each federal state. The address could then be used for geocoding, i.e. to obtain geographic coordinates which are the foundation for further analysis on geographic density of the school network.
Education policy lies in the domain of federal states in Germany and hence each state had to be queried separately in order to obtain the relevant information. Furthermore, each of those states would provide the data in different formats, ranging from paper booklets to PDFs and Excel files. Those formats also varied in each state for the different years.
Paper booklets were of course the hardest to process. They had to be scanned and OCR software was applied to generate “sandwich PDFs” that contained the scanned pages and the detected text. From some of the PDFs, the tabular data could be extracted with the help of a parser script tailored to the specific table layout (regular expressions were used extensively). For other PDFs the table layouts were more complex or the quality of the OCR was not good enough to use text parsing in order to extract the tabular data. Here, my Python package pdftabextract came in handy.
After the schools’ addresses could be extracted, the data was ready for geocoding. We chose the Google Maps Geocoding API for that, because it (1) provides an API client as Python package, and (2) seems to provide good results also for partial addresses or slightly misspelled addresses.
The latter was especially important because of the sometimes problematic text recognition quality from the scanned documents. No OCR software is 100% correct, so some names and more crucially some addresses of schools would be misspelled. Another issue was that postal codes and their areas changed quite often in our region and period of the study. Hence, these postal codes were unreliable and so could not be used as part of the query during geocoding. Especially for the early 90s, there was another problem: Many street names in eastern Germany were changed after the reunification (e.g. those honoring former GDR presidents like Wilhelm Pieck) and since Google’s geocoding service does not use historical data these addresses could not be geocoded correctly.
We used a semi-automated approach to handle these issues. First, the geocoding was done using the street name, house number, city name and – to avoid problems with duplicate addresses in other federal states – the name of the state. I created a small wrapper package for the googlemaps package which implements result caching. For those addresses for which geocoding could determine a coordinate, automated validity checks were applied which identified coordinates outside of the state’s boundaries. For addresses with incorrect or unknown coordinates, manual research had to be done. Luckily, we had only a few dozens of these. Additionally, a simple interactive map was created also for identifying false geocoding coordinates. This map would later evolve into a full interactive website, which I will present later on.
In order to check if a coordinate is inside a federal state’s border and also in order to show these borders later, it’s necessary to obtain the shape of the border as vector data. I found it surprisingly difficult to find up-to-date and exact border shapes. One idea was to extract administrative borders from Open Street Map, however it was quite difficult to process the large amount of complex geographical data. For example, I could extract the administrative boundaries of a federal state, but unfortunately it would always be the borders including the sea and not only the landmass. Luckily, I found the OSM Admin Boundaries Map website which provides an extremely handy interface to extract administrative boundaries of all levels worldwide using Open Street Map data.
As already explained, the educational landscape in Germany is quite complex due to federal states’ sovereignty about that topic. Each state may have different school forms and in order to be able to compare results across different federal states, those school forms were classified into three coarse categories: Primary schools, secondary schools on which an university-entrance diploma can be obtained (“gymnasiale Oberstufe”) and secondary schools on which such a diploma can’t be obtained. The challenge here was that school forms also changed over the years because of education reforms having been passed in the federal states.
Covering five eastern German states, 23 years and three school categories, the final dataset contains more than 130,000 entries.
Clusters and Voronoi regions
Once all addresses were geocoded, the coordinates can be used for calculations. One interesting measure for us was the distance of private schools to their nearest public school. While it would be possible to get the driving distance between two places using the Google Maps Directions API, it is impractical to do so for the thousands of possible address pairs. Instead, we used the great-circle distance which is a good approximation given the dense road network in Germany. Using a fast vectorized version of the haversine formula which I already presented in this blog post, the computations could be done quite quickly.
We were also interested in clusters of school sites in order to see if private schools tend to be located in such clusters (i.e. urban areas). We used hierarchical clustering with distance criterion (implemented in SciPy). This distance criterion was different per school category and describes the maximum pairwise distance within a cluster. Google’s API delivers WGS84 coordinates, which are geographic (longitude / latitude) coordinates, but not suitable for mapping. Hence the coordinates have to be transformed to a suitable coordinate reference system (CRS) for our area. In our case this means the data was transformed to the ETRS89 / LCC Germany CRS which is suitable for Germany. Applying such transformations can be done with the package pyproj. The clusters could then be plotted using matplotlib with the boundary shape of the respective state. An example is shown below. Each cluster is denoted by an ID and several dots with similar color around that ID show the individual school sites of that cluster. On this map, many schools form their own single-unit cluster because Mecklenburg-Vorpommern is quite sparsely populated and many schools were closed by 2015.
Another measure that we were interested in was the “coverage area” of each school. We turned to Voronoi regions as an approximation which gives the theoretical area (not taking into account geographical features such as lakes, forests, etc.) covered by each school site. See also my blog post on Voronoi regions for geo-data for more information. This project motivated me to create the geovoronoi package for Python which allows to construct Voronoi regions from geographic sites within some shape (e.g. administrative borders of a state). The following figure shows Voronoi regions for the same data as in the cluster example (secondary schools in Mecklenburg-Vorpommern 2015):
Website
The online map which was used initially as a quick way to evaluate the geocoding results turned more and more into a very useful interactive tool for research as more “layers” for data display were added: (1) showing distances between schools, when clicking on a site; (2) displaying the school clusters; and (3) representing the Voronoi regions. Eventually this led to the companion website schulenkarte.wzb.eu that allows to explore visually what is presented in the research papers.
The website is backed by an SQL database that contains all information on school sites, their coordinates, the clusters they form and the shape of the Voronoi region around them. A minimalistic REST webservice was built with Flask which is queried by the actual frontend. This frontend only consists of an HTML page with some JavaScript using the excellent leaflet library for rendering and handling the interactive map. A sub-page shows several figures which are referred to in the paper and were generated with R and ggplot2.
Future research
For now, we concentrated on gathering and merging the data, doing exploratory analysis, drawing descriptive conclusions and retracing historical developments since the German reunification. Further research may combine this rich dataset with other sources for example in order to examine effects of different legislature in different federal states, or effects of demographic developments.
It would also be interesting to be able to retrace the opening and/or closing of individual schools over time. Unfortunately, most data sources that were available to us did not have an identifier for each school which would allow to track it over time. Both the address and the name are not reliable enough to be used as such an identifier (street names or postal codes change, school names change or are misspelled due to OCR, etc.). To solve this issue, an approach could be implemented which combines string distances between addresses or names and geographic distances in order to track schools over time (i.e. two school entries in year Y and Y+1 must be geographically close and their names or addresses must be similar).
It would also be important to add new data since 2016 and make yearly updates in the future so that the development of school sites can be further monitored.
The full data is available for download on the website schulenkarte.wzb.eu.
Recent Comments