Vectorization and parallelization in Python with NumPy and Pandas

Modern computers are equipped with processors that allow fast parallel computation at several levels: Vector or array operations, which allow to execute similar operations simultaneously on a bunch of data, and parallel computing, which allows to distribute data chunks on several CPU cores and process them in parallel. When working with large amounts of data, it is important to know how to exploit these features because this can reduce computation time drastically. Taking advantage of this usually requires some extra effort during implementation. With packages like NumPy and Python’s multiprocessing module the additional work is manageable and usually pays off when compared to the enormous waiting time that you may need when doing large-scale calculations inefficiently.

Read More →

Slides on Topic Modeling – Background, Hyperparameters and common pitfalls

I just uploaded my slides on probabilistic Topic Modeling with LDA that give an overview of the theory, the basic assumptions and prerequisites of LDA and some notes on common pitfalls that often happen when trying out this method for the first time. Furthermore I added a Jupyter Notebook that contains a toy implementation of the Gibbs sampling algorithm for LDA with lots of comments and plots to illustrate each step of the algorithm.

Web scraping with automated browsers using Selenium

Web scraping, i.e. automated data mining from websites, usually involves fetching a web page’s HTML document, parsing it, extracting the required information, and optionally follow links within this document to other web pages to repeat this process. This approach is sufficient for many websites that display information in a static way, i.e. do not respond to user interaction dynamically by the means of JavaScript. In these cases, web scraping can be implemented with Python packages such as requests and BeautifulSoup. Even interactive elements such as forms can be emulated by observing the HTTP POST and GET data that is send to the server, whenever a form is submitted. However, this approach has limits. Sometimes, it is necessary to automate a whole browser in order to implement web scraping on JavaScript-heavy websites as will be shown with a short example in this post.

Read More →

Topic Model Evaluation in Python with tmtoolkit

Topic modeling is a method for finding abstract topics in a large collection of documents. With it, it is possible to discover the mixture of hidden or “latent” topics that varies from document to document in a given corpus. As an unsupervised machine learning approach, topic models are not easy to evaluate since there is no labelled “ground truth” data to compare with. However, since topic modeling typically requires defining some parameters beforehand (first and foremost the number of topics k to be discovered), model evaluation is crucial in order to find an “optimal” set of parameters for the given data.

Several metrics exist for this task and some of them will be covered in this post. Furthermore, as calculating many models on a large text corpus is a computationally intensive task, I introduce the Python package tmtoolkit which allows to utilize all availabel CPU cores in your machine by computing and evaluating the models in parallel.

Read More →

Geocoding an address and performing point-polygon tests with GDAL/OGR in Python

Suppose you have a list of addresses and want to connect them with some kind of location-based information. For example, your addresses might scatter across several neighborhoods and you want to find out to which neighborhood each address belongs, because you have further information (like mean income, percentage of migrants, etc.) about each neighborhood and want to combine it with your data. In many countries, administrative authorities gather such geographical information and provide the data on their websites.

In the given scenario, three steps are necessary in order to combine the addresses with geographical information:

  1. Geocoding the address, i.e. finding out the geographical coordinates (latitude, longitude) for this address
  2. Given a file with geographical information (GIS data) that form several distinct areas as polygons, finding out which of these polygons contains the geocoded address
  3. Obtain necessary information such as a neighborhood identifier from the polygon

This short post shows how to do that with the Python packages googlemaps and GDAL.

Read More →

Slides on Text Preprocessing and Feature Extraction for Quantitative Text Analysis

I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.

The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:

Slides for Text Processing and Feature Extraction for Quantitative Text Analysis (WZB Python User Group Workshop)

I can recommend the following supplementary resources:

LATINNO Database online

This week the LATINNO project has published its comprehensive database on democratic innovations in South and Latin America on its official website. 2,400 cases of these innovations have been collected, coded and reviewed and are now publicly available. They can be browsed with the online search tool. Several interactive visualizations have been created to sum up the data.

As reported before, this project on which I have also been working on in the last months was created with the Django framework using the hvad extension for multilingual support. The visualizations were implemented with d3.js.

LATINNO is an ongoing project and more cases of innovations are expected to be added to the database in the next months.

Speeding up NLTK with parallel processing

When doing text processing with NLTK on large corpora, you often need a lot of patience since even simple methods like word tokenization take quite some time when you’re processing a large amount of text data. This is because NLTK does not often harness the power of modern multicore computers — the code will only run on a single core even if you have four processing cores in your machine. You will need to add parallel processing of your documents yourself. Fortunately this is quite straight forward to implement with Python’s multiprocessing module and I will show how to do this in this small post.

Read More →

Lemmatization of German language text

Lemmatization is the process of finding the base (or dictionary) form of a possibly inflected word — its lemma. It is similar to stemming, which tries to find the “root stem” of a word, but such a root stem is often not a lexicographically correct word, i.e. a word that can be found in dictionaries. A lemma is always a lexicographically correct word.

When using text mining models that depend on term frequency, such as Bag of Words or tf-idf, accurate lemmatization is often crucial, because you might not want to count the occurrences of the terms “book”, and “books” separately; you might want to reduce “books” to its lemma “book” so that it is included in the term frequency of “book”.

For English, automatic lemmatization is supported in many Python packages, for example in NLTK (via WordNetLemmatizer) or spaCy. For German, however, I could only find the CLiPS pattern package which has limited use (e.g. it cannot handle declined nouns) and is not supported in Python 3. By using the annotated TIGER corpus of the University of Stuttgart, I will try to measure the accuracy of a lemmatizer based on the pattern.de module and will suggest an improved lemmatizer which improves pattern.de’s accuracy by about 10%.

Read More →

Interactive Balloon Charts with d3-balloon

As explained before, balloon plots can be a good way to compare many observations with lots of variables. I have now created a small extension for d3.js v4 that allows to implement interactive balloon charts quite easily. It is available on GitHub and can be seen live in action for a recent news article about segregation in schools on the WZB website.

Read More →