Suppose you have a list of addresses and want to connect them with some kind of location-based information. For example, your addresses might scatter across several neighborhoods and you want to find out to which neighborhood each address belongs, because you have further information (like mean income, percentage of migrants, etc.) about each neighborhood and want to combine it with your data. In many countries, administrative authorities gather such geographical information and provide the data on their websites.
In the given scenario, three steps are necessary in order to combine the addresses with geographical information:
- Geocoding the address, i.e. finding out the geographical coordinates (latitude, longitude) for this address
- Given a file with geographical information (GIS data) that form several distinct areas as polygons, finding out which of these polygons contains the geocoded address
- Obtain necessary information such as a neighborhood identifier from the polygon
This short post shows how to do that with the Python packages googlemaps and GDAL.
1. Geocoding
Accurate geocoding can be done using the googlemaps package which interfaces with Google’s Maps API. You will need to obtain an API key in order to use the service. See the documentation on how to do that.
After you obtained an API key and installed the package, it is quite straight forward to geocode an address. Let’s try it out with the address of the WZB:
import googlemaps
gmaps = googlemaps.Client(key=API_KEY) # you need to provide your API key here
geocode_results = gmaps.geocode('Reichpietschufer 50 Berlin')
The returned data structure is quite big, here is a snipped of it:
[{'address_components': [{'long_name': '50',
'short_name': '50',
'types': ['street_number']},
{'long_name': 'Reichpietschufer',
'short_name': 'Reichpietschufer',
'types': ['route']},
...
{'long_name': '10785',
'short_name': '10785',
'types': ['postal_code']}],
'formatted_address': 'Reichpietschufer 50, 10785 Berlin, Germany',
'geometry': {'location': {'lat': 52.5063971, 'lng': 13.3650865},
'location_type': 'ROOFTOP',
'viewport': {'northeast': {'lat': 52.5077460802915,
'lng': 13.3664354802915},
'southwest': {'lat': 52.5050481197085,
'lng': 13.3637375197085}}},
'place_id': 'ChIJmUmG5rRRqEcRMqws5hB5mTU',
'types': ['street_address']}]
The most important thing is that the geocoder always returns a list of results and when the search address is ambiguous, we might get several results. Since the results are ordered from higher to lower certainty, it’s usually safe to fetch only the first result. In our case, we are also only interested in the geographic coordinates, which are inside the geometry
→ location
structure:
coord = geocode_results[0]['geometry']['location']
coord
>>> {'lat': 52.5063971, 'lng': 13.3650865}
This is already it in terms of geocoding! It seems very simple but you should consider the following potential pitfalls:
- You might get inaccurate/wrong results, especially when there are typos in the addresses, missing postal codes, ambiguous addresses, etc. You should always check your results 1) automatically for plausibility (e.g. when working with data from a specific country, check that the geocoordinates are all within a plausible range for that country) and 2) manually by checking a sample of your data;
- Something might go wrong during the API request. The internet connection might be broken, the server might not respond, etc. In such cases, an Exception will be raised. You should wrap your request in a try-except block and log the event for later investigation;
- Of course, with a real data set we would need to geocode many addresses. Please be aware that the Google API service employs rate limiting and will deny further requests for 24h once you cross the limit of currently 2,500 requests per day. See this table for more information.
2. Point-in-polygon tests with GDAL/OGR
Suppose that you now want to combine your addresses with further regional information, e.g. with data about their respective neighborhood, community, or other. Sometimes, it’s sufficient to use information in the address_components
structure from Google’s API response because this structure already contains information to which area an address belongs at different administrative levels:
[{'long_name': '50', 'short_name': '50', 'types': ['street_number']},
{'long_name': 'Reichpietschufer',
'short_name': 'Reichpietschufer',
'types': ['route']},
{'long_name': 'Mitte',
'short_name': 'Mitte',
'types': ['political', 'sublocality', 'sublocality_level_1']},
{'long_name': 'Berlin',
'short_name': 'Berlin',
'types': ['locality', 'political']},
{'long_name': 'Berlin',
'short_name': 'Berlin',
'types': ['administrative_area_level_1', 'political']},
{'long_name': 'Germany',
'short_name': 'DE',
'types': ['country', 'political']},
{'long_name': '10785', 'short_name': '10785', 'types': ['postal_code']}]
We can see here that this address is located in the state of Berlin, and there in the district of Mitte. However, this information might not be accurate enough for your purposes. For example, data on neighborhoods is usually much more fine-grained. Geographic information of this scale is often provided by administrative authorities and might be available for download in different GIS formats such as KML, KMZ, GML, GeoJSON or Shapefile (.shp). Such files can be processed in Python with the GDAL package. GDAL stands for Geospatial Data Abstraction Library – a comprehensive software library for handling a wide range of geospatial data formats. We can use it to process the above mentioned GIS formats.
If you want to install the Python GDAL package system-wide on a Linux OS, you can simply install python-gdal
or python3-gdal
via apt or other package managers. Installing it in a Python Virtual Environment (venv) is recommended but a little trickier: At first, install the libraries libgdal
and libgdal-dev
. Find out which package version is available system-wide, e.g. by running apt show python3-gdal
. On Ubuntu 16.04 this gives GDAL version 2.1.3. Then set the correct header file paths in your terminal shell:
export CPLUS_INCLUDE_PATH=/usr/include/gdal
export C_INCLUDE_PATH=/usr/include/gdal
Finally, install the correct package version in your venv via pip: pip install GDAL==2.1.3
.
Let’s continue by using a GIS file which is provided by the Berlin Senate Department for Urban Development and Housing. We will work with something called Lebensweltlich orientierte Räume (LOR), which are hierarchical spaces reaching from 60 larger areas to fine-grained 447 neighborhoods in Berlin. Each of these areas (LORs) can be associated with a rich set of data, for example from the Berlin Open Data website. But in order to do so, we at first need to identify the LOR number of our addresses. We can do so by processing a GIS file which contains the LOR areas with their associated LOR number and checking which LOR area contains our given address, i.e. our already obtained geo-coordinates. The senate department provides a GIS file in KMZ format on their website. It’s important to note that this file uses the WGS 84 coordinate system, which is the same coordinate system that the Google API uses for geocoding. If our GIS file used another coordinate system than WGS 84, we would need to convert the coordinates before using them, otherwise any further processing wouldn’t work!
Before processing the file with Python, we should have a look at it using a GIS application such as QGIS, which is an open source tool available for most OS. By using the “browse panel” on the left, you can simply double click on a KMZ file in order to display it. Here we can see the projected areas of the 447 LOR neighborhoods of Berlin in the file “Planungsraum_FL.kmz”:
Another useful tool of QGIS is the “Identify Features” button on the top. Features are the geometries in your GIS file and with this tool you can select individual geometries and see their attributes. In our case, this means that we can select the polygons of the neighborhood areas. On the right side, a panel will open that shows the attributes of this area. The only important attribute for us is “Name” which includes the LOR code:
Now that we have identified the attribute that we will need to extract, we can start to work with the KMZ file in Python. At first, we import the required GDAL module, load a driver for handling KML/KMZ files and then we open our KMZ file:
from osgeo import ogr
driver = ogr.GetDriverByName('LIBKML')
source = driver.Open('Planungsraum_FL.kmz')
source
>>> prints something like "<osgeo.ogr.DataSource; proxy of <Swig Object ...> >"
Each GIS file organizes its geometries in layers. Let’s see how many layers are inside this file and select a layer to work with:
source.GetLayerCount()
>>> 1
layer = source.GetLayer(0) # select the first (and only) layer
layer
>>> prints something like "<osgeo.ogr.Layer; proxy of <Swig Object ...> >"
We opened the file with the 447 neighborhood areas. Let’s check that we really have 447 “features” (~ areas) in that layer:
layer.GetFeatureCount()
>>> 447
So this seems right! Let’s have a look inside a specific feature:
f1 = layer.GetFeature(1) # here, the first feature is not 0 but 1 (don't know why...)
f1.items() # items() returns all attributes of a feature
>>> {'Name': 'Flaeche 01011101',
'altitudeMode': 'relativeToGround',
...
'visibility': -1}
We can see that for a given feature we can extract it’s LOR by using the “Name” attribute. Now the crucial part is, how do we decide if an address, i.e. its geocoded point, is inside an area, i.e. a feature’s geometry? Luckily, GDAL provides a function for that, namely the Contains() method of a Geometry object. It tests if one geometry (in our case: a neighborhood area) contains another geometry (a coordinate of an address). In order to use it, we need to convert our coordinate obtained from the Google API to a Geometry object first:
pt = ogr.Geometry(ogr.wkbPoint)
pt.AddPoint(coord['lng'], coord['lat']) # watch out: first longitude, then latitude here!
We can use it as follows:
f1.geometry().Contains(pt)
>>> False
This returns False because the WZB address is not contained in the first LOR area that we picked from the 447 features. Of course we need to check all areas for the point. So we basically have all the information now in order to achieve our goal. What’s left is a loop that does the checks for a given point and returns the LOR if the point is contained in a feature’s geometry:
def lor_for_pt(layer, pt):
for i in range(1, layer.GetFeatureCount()+1):
feature = layer.GetFeature(i)
if feature and feature.geometry() and feature.geometry().Contains(pt):
name = feature.items().get('Name', '')
if len(name) >= 8:
return name[-8:]
return None
lor_for_pt(layer, pt)
>>> '01011105'
The LOR Code ‘01011105’ for the WZB address seems right since the hierarchical area names associated to this code are 01-Mitte, 01-Zentrum, 11-Tiergarten Süd, 05-Nördl. Landwehrkanal.
Please note that this quickly implemented method is a “brute force” approach. When you have many addresses and/or many geometries in a GIS file, the process of checking all addresses with all available geometries might get very slow. You may improve performance by:
- harnessing geographic hierarchies if available
- first calculating the centroid of all feature geometries; then for each point calculating the distances to the centroids; sort by ascending distance and then do the point-in-polygon checks
- use computation on all available CPUs via multiprocessing.
More things that can be done with GDAL are explained in the GDAL / OGR cookbook.
Recent Comments