Author Archives: Markus Konrad

Linkdump #52

R
Python
Interesting articles, projects and news

Geocoding an address and performing point-polygon tests with GDAL/OGR in Python

Suppose you have a list of addresses and want to connect them with some kind of location-based information. For example, your addresses might scatter across several neighborhoods and you want to find out to which neighborhood each address belongs, because you have further information (like mean income, percentage of migrants, etc.) about each neighborhood and want to combine it with your data. In many countries, administrative authorities gather such geographical information and provide the data on their websites.

In the given scenario, three steps are necessary in order to combine the addresses with geographical information:

  1. Geocoding the address, i.e. finding out the geographical coordinates (latitude, longitude) for this address
  2. Given a file with geographical information (GIS data) that form several distinct areas as polygons, finding out which of these polygons contains the geocoded address
  3. Obtain necessary information such as a neighborhood identifier from the polygon

This short post shows how to do that with the Python packages googlemaps and GDAL.

Read More →

Linkdump #51

R
Python
Interesting articles, projects and news

Linkdump #50

R
Interesting articles, projects and news

Linkdump #49

R
Python
Interesting articles, projects and news

Slides on Text Preprocessing and Feature Extraction for Quantitative Text Analysis

I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.

The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:

Slides for Text Processing and Feature Extraction for Quantitative Text Analysis (WZB Python User Group Workshop)

I can recommend the following supplementary resources:

Linkdump #48

R
Python
Interesting articles, projects and news

Linkdump #47

R
Python
Interesting articles, projects and news

LATINNO Database online

This week the LATINNO project has published its comprehensive database on democratic innovations in South and Latin America on its official website. 2,400 cases of these innovations have been collected, coded and reviewed and are now publicly available. They can be browsed with the online search tool. Several interactive visualizations have been created to sum up the data.

As reported before, this project on which I have also been working on in the last months was created with the Django framework using the hvad extension for multilingual support. The visualizations were implemented with d3.js.

LATINNO is an ongoing project and more cases of innovations are expected to be added to the database in the next months.

Linkdump #46

R
Python
Interesting articles, projects and news