In the social sciences, geospatial data appears quite often. You may have social indicators for different places on earth at different administrative levels, e.g. countries, states or municipalities. Or you may study spatial distribution of hospitals or schools in a given area, or visualize GPS referenced data from an experiment. For such scenarios, there’s fortunately a rich supply of open-source tools and packages. As I’ve worked recently quite a lot with geospatial data, I want to introduce some of this software, especially those available for the Python programming language.
Important file formats
Before we start, I want to introduce some important file formats. In the geographic information system (GIS) world, there’s a myriad of file formats, for example ESRI shapefiles, GeoJSON, KML, etc. In databases, geographic shapes are often stored in WKT or WKB format. The most important formats for our scenarios are ESRI shapefiles (.shp) and GeoJSON files (.GeoJson or .json). File conversion between those formats is possible with the ogr2ogr tool.
Getting boundary shapes and other geospatial data
You may need different kinds of geographic entities for your analysis: 1) certain spots or locations that can be represented as a single geographic coordinate (an address, a hospital, etc.), 2) routes or streets which can be represented as a series of coordinates or 3) shapes such as administrative borders that can be represented as polygons of geographic coordinates.
Even in the time of Google Maps, OpenStreetMap (OSM) and so on, I found it surprisingly difficult to obtain some of these data. Finding the geographic coordinates of addresses is the most straight-forward: This process is called geocoding and can for example be done via Google Maps APIs with the googlemaps package in Python. With its Places API it’s also possible to query the locations of certain points of interest such as hospitals or schools within a given area. Such things can also be achieved with OSM, but its Overpass API seems way more complex. At least there is a Python package overpy that wraps the API calls.
The Python package osmnx is designed for retrieving and analyzing street networks. It obtains data from OSM. With this package it’s also possible to retrieve administrative boundary shapes. However, for administrative boundaries that have a sea shore, it would always return the sea border, i.e. some kilometers away from the shore. I couldn’t find a way to retrieve only the landmass border, neither with this package nor with OSM alone.
Fortunately, there’s the OSM Boundaries map which provides an export of all administrative boundaries in OSM from country level (level 2) down to municipality and sometimes even neighborhood level (level 10) with the option to provide only the landmass border. In order to download a selected set of boundaries, it is necessary to login with an OSM account and deselect the CLI option at the bottom. Via CLI (a special web address sent to the server) it is also possible to download a complete range of administrative boundaries for example for a given country. See the documentation. For the examples that follow below, I downloaded the administrative boundaries of level 4 (federal states) for three states in Germany: Mecklenburg-Vorpommern, Brandenburg and Berlin in GeoJSON format.
Loading data with fiona
The package fiona allows for reading and writing of a big variety of GIS file formats. The API follows the standard Python IO style. For opening the Mecklenburg-Vorpommern GeoJSON file in read-only mode and with automatic file type detection we can simply write:
import fiona
mv_collection = fiona.open('Mecklenburg-Vorpommern_AL4.GeoJson')
Each file is thought to be a collection of features (the geographic entities). These have common properties, such as the driver (file format), CRS (coordinate reference system) and schema (the template of properties and shape type for each geographic entity):
>>> mv_collection.driver
'GeoJSON'
>>> mv_collection.crs
{'init': 'epsg:4326'}
>>> mv_collection.schema
{'properties': OrderedDict([('srid', 'str'),
('id', 'str'),
('name', 'str'),
('localname', 'str'),
...
('alltags', 'str')]),
'geometry': 'MultiPolygon'}
There can be several features in a collection, for example if we decided to export the administrative boundaries of several states at once into a single file with the OSM Boundaries map. One can iterate over these features with a for
loop. However, in this case we only have a single feature (the boundary of Mecklenburg-Vorpommern) in the file, so we can only take this feature:
>>> len(list(mv_collection)) # only a single feature in this file:
1
>>> feature = list(mv_collection)[0]
>>> feature
{'type': 'Feature',
'id': '0',
'geometry': {
'type': 'MultiPolygon',
'coordinates': [[[(14.2638222, 53.7046631),
(14.2636663, 53.7045455),
(14.2635121, 53.7044352),
...]]]},
'properties': OrderedDict([('srid', '4326'),
('id', '28322'),
('name', 'Mecklenburg-Vorpommern'),
...])}
We see that a feature contains a type, an ID, some properties like the name (from OSM) and most importantly a geometry structure that denotes the type of geometry (a MultiPolygon, i.e. a set of polygons that form the whole shape) and holds a list of coordinates for each polygon.
Plotting with matplotlib and descartes
Let’s display this feature. We can use Python’s standard plotting package matplotlib in conjunction with descartes which enables plotting geographic shapes. However, matplotlib is made for plotting graphs and needs to be tuned a little bit for displaying a geographic shape on a map. We do so by setting an equal aspect ratio to prevent skewing and also disable displaying the x- and y-axis. Let’s write a function since we’ll need this several times:
import matplotlib.pyplot as plt
from descartes import PolygonPatch
def fig_ax_for_map():
"""
Return a matplotlib Figure and Axes object setup
for displaying a map.
"""
fig, ax = plt.subplots(dpi=180)
ax.set_aspect('equal')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
return fig, ax
Additionally, we will need to specify the limits (i.e. range) of the x- and y-axis. If not, matplotlib will use a default range of [0, 1] for both axes and hence any shape whose coordinates are outside of this range will be clipped or completely invisible. For example, when we look at the coordinates in the “geometry” structure of our feature, we’ll see that they are between 10 and 15 for the x-coordinates (this is the longitude in WGS84 coordinates) and 53 to 55 for the y-coordinates (WGS84 latitude). If we’d simply try to plot this, we’d get an empty canvas, because these coordinates are way off from matplotlib’s default axes’ limits.
One way to resolve this, is to find the minimum and maximum values of longitude and latitude coordinates and then set the axes’ limits to this range. To do so, we need to define some functions:
def flatten_list(l):
return sum(l, [])
def recursively_flatten_list(l):
if l and isinstance(l[0], list):
return flatten_list([recursively_flatten_list(m) for m in l])
else:
return l
def limits_of_values(v, buffer=0.0):
return min(v) - buffer, max(v) + buffer
The first two are for recursively flattening a nested list (because the coordinates in our feature’s geometry structure are a nested list). The last one creates the limits for a list of coordinates and optionally allows to specify a buffer
so that we can create small margins to the plot border.
Now we can create the first plot:
fig, ax = fig_ax_for_map()
all_geom_coords = recursively_flatten_list(feature['geometry']['coordinates'])
lngs, lats = zip(*all_geom_coords)
lng_range = limits_of_values(lngs, buffer=0.1)
lat_range = limits_of_values(lats, buffer=0.1)
ax.set_xlim(lng_range)
ax.set_ylim(lat_range)
poly_patch = PolygonPatch(feature['geometry'], fc='blue', ec='black', alpha=0.5)
ax.add_patch(poly_patch)
plt.show()
And the result looks like this:
Transforming shapes to other projections with pyproj
Now you may not know the shape of Mecklenburg-Vorpommern, but this looks a bit arkward. The reason is that the coordinates were projected with the WGS84 coordinate reference system (CRS). At the beginning, calling mv_collection.crs
showed us the somewhat cryptic message {'init': 'epsg:4326'}
. This refers to the EPSG code 4326 which in turn refers to the already mentioned WGS84 coordinate system. This coordinate system is not well suited for a cartographic projection on a smaller scale far from the equator, such as federal states in Germany. Hence we should re-project the coordinates with a CRS that better fits our needs. The EPSG website can be helpful to find coordinate systems suited for our area of interest such as the ETRS89 / LCC Germany coordinate system, which has the EPSG code 5243.
Now that we know which CRS we can use, we need to transform the coordinates of our Mecklenburg-Vorpommern borders. We need two packages for this: 1) pyproj, which is a Python interface to PROJ.4, a ” powerful generic coordinate transformation engine” and 2) shapely, a package for “manipulation and analysis of geometric objects in the Cartesian plane”, which I’ll cover later in more detail.
A transformation to a different projection can be implemented like this with pyproj and shapely:
from pyproj import Proj
from shapely.ops import transform
from shapely.geometry import shape
ger_crs = 'epsg:5243'
ger_project = Proj(init=ger_crs)
feat_shape = shape(feature['geometry'])
feat_shape_epsg5243 = transform(ger_project, feat_shape)
At first, we define our target projection, here via EPSG code as 'epsg:5243'
. Then we create a projection function ger_project
. A shapely shape object named feat_shape
is formed by passing the geometry information of our feature to the shape
function. Finally, we use the transform
function which applies the new projection to each coordinate of feat_shape
. Our projected shape is now stored in feat_shape_epsg5243
an object of type MultiPolygon
. Again, we should plot this shape, but this time we’ll implement a short function for later use:
def plot_shape(ax, s, buffer=3000):
"""Plot a shape `s` on matplotlib Axes `ax`"""
minx, miny, maxx, maxy = s.bounds
ax.set_xlim((minx-buffer, maxx+buffer))
ax.set_ylim((miny-buffer, maxy+buffer))
poly_patch = PolygonPatch(s, fc='blue', ec='black', alpha=0.5)
ax.add_patch(poly_patch)
Please note two things: First, the default buffer (i.e. margin around the object) is set to 3000, which seems quite large. This is because with the change to the ETRS89 projection, we also changed the coordinate units. While WGS84’s unit is degrees, “ETRS89 / LCC Germany” uses meters. Hence, we define a default margin of 3km to the plotting borders.
The second thing to note is that somewhat cumbersome calculation of x- and y-axis limits is simplified. This is because we can exploit one of the many useful attributes of shapely objects, their “bounds”. This gives us the extent of the shape, and hence the minimum/maximum of its coordinates, directly.
Now we can use our function:
fig, ax = fig_ax_for_map()
plot_shape(ax, feat_shape_epsg5243)
plt.show()
Now this really looks like Mecklenburg-Vorpommern (as I’m typing, I’m already regretting that I chose a state with such a long name as an example)! We could also extend the above function to plot several shapes, each shape with a different color, maybe depending on some variable like GDP. By this, we could essentially create a choropleth plot.
Calculations and shape manipulations with shapely
Once you converted your geographic entity to a shapely object, you can apply a wide range of transformations and calculations to it. The excellent manual gives a good overview of what can be done. We already got to know the bounds
attribute that returns the extent of the object. Another interesting attribute might be area
which is defined for objects such as polygons. With this, we can find out the area of Mecklenburg-Vorpommern’s landmass. However, at first we need to make sure that our coordinates have a suitable unit. This is the case now, because we converted the shape to the “ETRS89 / LCC Germany” CRS which uses meters, as we’ve found out before. Now to calculate the area in km², we can do the following:
>>> mv_area_km2 = feat_shape_epsg5243.area / 1000**2
>>> mv_area_km2
23104.905300735827
This is about the same area as reported on Wikipedia.
Another interesting method is distance
, which for any two shapes returns the minimum distance between them (e.g. the shortest line between two polygons). Again, it is very important here, that you converted the shapes to a suitable CRS before. It is especially important that for all operations, all shapes use the same CRS.
Predicates and relationships are also very useful. With this, we can find out for example if one shape contains or crosses another shape. A practical example might be that you want to link GPS data, that you obtained from an experiment, with geographic areas like municipalities or city streets.
Set-theoretic operations come in handy when you need to combine or cut shapes. As a practical example, let’s load the boundary data for Berlin and Brandenburg and combine both to form a “Berlin-Brandenburg” shape:
# Berlin
# (we assume that we only have a single feature per file)
bln_collection = fiona.open('Berlin_AL4.GeoJson')
bln_feature = list(bln_collection)[0]
bln_shape = shape(bln_feature['geometry'])
bln_shape = transform(ger_project, bln_shape)
# Brandenburg
# (we assume that we only have a single feature per file)
bb_collection = fiona.open('Brandenburg_AL4.GeoJson')
bb_feature = list(bb_collection)[0]
bb_shape = shape(bb_feature['geometry'])
bb_shape = transform(ger_project, bb_shape)
We projected both to a common CRS (the “ETRS89 / LCC Germany” CRS as before). Plotting them individually would look like this (at the same scale):
Combining both can be done with the union operation:
bbbln_shape = bb_shape.union(bln_shape)
fig, ax = fig_ax_for_map()
plot_shape(ax, bbbln_shape)
plt.show()
Saving data with fiona
Finally, we can use the fiona package again to save the result to a file. This time, some more things must be considered: 1) We need to specify which file type we want — this time let’s use an ESRI shapefile; 2) we must specify a schema (the format of the metadata and the geometry type) and a CRS; and 3) we must convert our shapely object back to a geometry dictionary that fiona will write to the file. Shapely’s mapping
function is made for this. The full file export can be implemented as follows:
from shapely.geometry import mapping
schema = {
'geometry': 'MultiPolygon',
'properties': {'name': 'str'}
}
with fiona.open('berlin_brandenburg.shp', 'w', 'ESRI Shapefile',
schema=schema, crs=ger_crs)\
as output_collection:
output_collection.write({
'geometry': mapping(bbbln_shape),
'properties': {'name': 'Berlin und Brandenburg'}
})
The result is a shapefile that contains our merged Berlin-Brandenburg. This file can now be loaded for example into QGIS.
Other packages
There are many more packages in the Python world for geospatial data. PySAL, for example, is a collection of advanced spatial analysis methods. geopandas is a convenience wrapper around the above mentioned packages that allows to link observations with geospatial data in a special pandas dataframe. It’s good to start with as it has a convenient programming interface, but also has many limitations so that you will sooner or later end up with the more flexible tools described in this post.
Recent Comments