Some thoughts about the use of cloud services and web APIs in social science research

In the recent weeks I’ve collaborated on the online book APIs for social scientists and added two chapters: a chapter about the API and a chapter about the GitHub API. The book seeks to provide an overview about web or cloud services and their APIs that might be useful for social scientists and covers a wide range from text translation to accessing social media APIs complete with code examples in R. By harnessing the GitHub workflow model, the book itself is also a nice example of fruitful collaboration via work organization methods that were initially developed in the open source software community.

While working on the two chapters and playing around with the APIs, I once again noticed the double-edged nature of using web APIs in research. It can greatly improve research or even enable research that was not possible before. At the same time, data collected from these APIs can inject bias and the use of these APIs may cause issues with research transparency and replicability. I noted some of these issues in the respective book chapters and I’ve written about them before,[1]See this article in WZB Mitteilungen (only in German) together with Jonas Wiedner. but the two APIs that I covered for the book provide some very practical examples of the main issues when working with web APIs and I wanted to point them out in this blog post.

Read More →

Continuous Integration testing with GitHub Actions using tox and hypothesis

I recently published a major update for the Python tmtoolkit package for text mining and topic modeling. Since it is a fairly large research software package, I’m using a Continuous Integration (CI) system for automated testing on different platforms. This system makes sure that every code update that is pushed to the software repository is automatically checked by running the test suite on all three major operating systems (Linux, MacOS, Windows). For the recent update of tmtoolkit, I decided to move the CI system from Travis CI to GitHub Actions (GHA) since GHA is directly integrated into GitHub and easy to set up. Still, there are some obstacles to overcome so this short post shows how to set up GHA for a Python project with a few extra requirements such as installing system packages on the test runner machine or running tests with tox and hypothesis.

Read More →

Batch transfer GitLab projects with the GitLab API

This is a bit off-topic to be filed under DevOps / workflow automation but I still wanted to share it: We use GitLab at the WZB for collaborative software development and project management and I recently had to transfer all my GitLab projects to a GitLab group.[1]In case you don’t know GitLab: It’s similar to GitHub but open-source and you can install your own instance on your server so that all your data stays within your organization’s IT … Continue reading Since transferring a personal project to a group is not something that is done regularly, it’s quite hidden in the project settings and involves a lot of steps. Transferring a project manually with the GitLab web interface means visiting the project page, navigating to the “transfer project” pane in its advanced settings, selecting the group, clicking “Transfer group” and typing a confirmation string. Nobody want’s to do this manually with more than a handful of projects. Luckily GitLab comes with it’s own, well-documented REST API which can save us a lot of time by letting us automating such tedious tasks.

Read More →


1 In case you don’t know GitLab: It’s similar to GitHub but open-source and you can install your own instance on your server so that all your data stays within your organization’s IT realm. That’s better for data projection, customizability and you’re less dependent on the services of an external company.

Spatially weighted averages in R with sf

Spatial joins allow to augment one spatial dataset with information from another spatial dataset by linking overlapping features. In this post I will provide an example showing how to augment a dataset containing school locations with socioeconomic data of their surrounding statistical region using R and the package sf (Pebesma 2018). This approach has the drawback that the surrounding statistical region doesn’t reflect the actual catchment area of the school. I will present an alternative approach where the overlaps of the schools’ catchment areas with the statistical regions allow to calculate the weighted average of the socioeconomic statistics. If we have no data about the actual catchment areas of the schools, we may resort to approximating these areas as circular regions or as Voronoi regions around schools.

Read More →

Clustered standard errors with R

In many scenarios, data are structured in groups or clusters, e.g. pupils within classes (within schools), survey respondents within countries or, for longitudinal surveys, survey answers per subject. Simply ignoring this structure will likely lead to spuriously low standard errors, i.e. a misleadingly precise estimate of our coefficients. This in turn leads to overly-narrow confidence intervals, overly-low p-values and possibly wrong conclusions.

Clustered standard errors are a common way to deal with this problem. Unlike Stata, R doesn’t have built-in functionality to estimate clustered standard errors. There are several packages though that add this functionality and this article will introduce three of them, explaining how they can be used and what their advantages and disadvantages are. Before that, I will outline the theory behind (clustered) standard errors for linear regression. The last section is used for a performance comparison between the three presented packages. If you’re already familiar with the concept of clustered standard errors, you may skip to the hands-on part right away.

Read More →

Interactive visualization of geospatial data with R Shiny

As a supplement to a recently published study by Marcel Helbig and Katja Salomo (available only in German) about socioeconomic inequalities for children in seven German cities, I’ve created an interactive web visualization with R Shiny and I wanted to share a few experiences that I made during development. This will be mainly about interactive visualization of geospatial data and custom UI elements. Below is a link to an example showing social welfare support rate amongst children and several environmental characteristics in Saarbrucken.

Read More →

Simplifying geospatial features in R with sf and rmapshaper

When working with geospatial data, memory consumption and computation time can become quite a problem, since these datasets are often very large. You may have very granular, high resolution data although that isn’t really necessary for your use-case, for example when plotting small scale maps or when applying calculations at a spatial level for which lower granularity is sufficient. In such scenarios, it’s best to first simplify the geospatial features (the sets of points, lines and polygons that represent geographic entities) in your data. By simplify I mean that we generally reduce the amount of information that is used to represent these features, i.e. we remove complete features (e.g. small islands), we join features (e.g. we combine several overlapping or adjacent features) or we reduce the complexity of features (e.g. we remove vertices or holes in a feature). Since applying these operations comes with information loss you should be very careful about how much you simplify and if this in some way biases you results.

In R, we can apply this sort of simplification with a few functions from the packages sf and, for some special cases explained below, rmapshaper. In the following, I will show how to apply them and what the effects of their parameters are. The data and source code are available on GitHub.

Read More →

Robust web scraping or web API based data collection

There are thousands of articles on the web about web scraping and accessing web APIs. Most of them show you how to extract information from specific elements on a web page or how to communicate with a specific API in order to collect data. For smaller data collection projects, this knowledge may be sufficient, but large scale data collection which must run reliably over days or even weeks brings up additional problems that mainly focus on the robustness of the data collection process. I will try to tackle some of these problems in this post. I will use examples in Python, but the basic concepts can easily be translated to R or other programming languages.

Read More →

Spiegel Online news topics and COVID-19 – a topic modeling approach

I created a project to showcase topic modeling with the tmtoolkit Python package: I use a corpus of articles from the German online news website Spiegel Online (SPON) to create a topic model for before and during the COVID-19 pandemic. This topic model is then used to analyze the volume of media coverage regarding the pandemic and how it changed over time.

National daily infection numbers clearly drive the volume of media coverage on COVID-19 during the observation period (January 2020 to end of August 2020) on SPON, which is probably not very surprising. Even though infection rates increased dramatically in the world in summer 2020 (e.g. in Brazil, India and USA), media coverage first decreased and then stayed at a moderate level, indicating that SPON doesn’t respond so much to rising infection rates at an international level.

You can have a look at the report here. All scripts are available in the GitHub repository.

Using Google Places data to analyze changes in mobility during the COVID-19 pandemic

During the COVID-19 pandemic, it’s apparent that location data gathered by private IT companies and telcos is a primary source for many studies about the effect of mobility restrictions on people’s behaviors and movements. In this blog post, I’d like to have a look at the “popular times” data provided by Google Places. I explain the limitations of this data, show how to gather it and provide some results from data that I fetched during March and April.

Read More →