Markus Konrad | WZB Data Science Blog

Geocoding an address and performing point-polygon tests with GDAL/OGR in Python

Suppose you have a list of addresses and want to connect them with some kind of location-based information. For example, your addresses might scatter across several neighborhoods and you want to find out to which neighborhood each address belongs, because you have further information (like mean income, percentage of migrants, etc.) about each neighborhood and want to combine it with your data. In many countries, administrative authorities gather such geographical information and provide the data on their websites.

In the given scenario, three steps are necessary in order to combine the addresses with geographical information:

Geocoding the address, i.e. finding out the geographical coordinates (latitude, longitude) for this address
Given a file with geographical information (GIS data) that form several distinct areas as polygons, finding out which of these polygons contains the geocoded address
Obtain necessary information such as a neighborhood identifier from the polygon

This short post shows how to do that with the Python packages googlemaps and GDAL.

Slides on Text Preprocessing and Feature Extraction for Quantitative Text Analysis

I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.

The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:

Slides for Text Processing and Feature Extraction for Quantitative Text Analysis (WZB Python User Group Workshop)

I can recommend the following supplementary resources:

the free NLTK book (focused on English texts, also gives introduction to working with Python, quite linguistics-heavy)
D. Sarkar, Text Analytics with Python (apress 2016) (good overview on many different algorithms and models, also gives introduction to working with Python, source code examples often unnecessarily complicated for beginners (triple nested list comprehensions!))
Gensim Tutorials

LATINNO Database online

This week the LATINNO project has published its comprehensive database on democratic innovations in South and Latin America on its official website. 2,400 cases of these innovations have been collected, coded and reviewed and are now publicly available. They can be browsed with the online search tool. Several interactive visualizations have been created to sum up the data.

As reported before, this project on which I have also been working on in the last months was created with the Django framework using the hvad extension for multilingual support. The visualizations were implemented with d3.js.

LATINNO is an ongoing project and more cases of innovations are expected to be added to the database in the next months.

Author Archives: Markus Konrad

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

Post Navigation

Recent posts

Categories

Links

Links

Recent Posts

Recent Comments

Archives

Categories

Meta