Author Archives: Markus Konrad

Spatially weighted averages in R with sf

Spatial joins allow to augment one spatial dataset with information from another spatial dataset by linking overlapping features. In this post I will provide an example showing how to augment a dataset containing school locations with socioeconomic data of their surrounding statistical region using R and the package sf (Pebesma 2018). This approach has the drawback that the surrounding statistical region doesn’t reflect the actual catchment area of the school. I will present an alternative approach where the overlaps of the schools’ catchment areas with the statistical regions allow to calculate the weighted average of the socioeconomic statistics. If we have no data about the actual catchment areas of the schools, we may resort to approximating these areas as circular regions or as Voronoi regions around schools.

Read More →

Clustered standard errors with R

In many scenarios, data are structured in groups or clusters, e.g. pupils within classes (within schools), survey respondents within countries or, for longitudinal surveys, survey answers per subject. Simply ignoring this structure will likely lead to spuriously low standard errors, i.e. a misleadingly precise estimate of our coefficients. This in turn leads to overly-narrow confidence intervals, overly-low p-values and possibly wrong conclusions.

Clustered standard errors are a common way to deal with this problem. Unlike Stata, R doesn’t have built-in functionality to estimate clustered standard errors. There are several packages though that add this functionality and this article will introduce three of them, explaining how they can be used and what their advantages and disadvantages are. Before that, I will outline the theory behind (clustered) standard errors for linear regression. The last section is used for a performance comparison between the three presented packages. If you’re already familiar with the concept of clustered standard errors, you may skip to the hands-on part right away.

Read More →

Interactive visualization of geospatial data with R Shiny

As a supplement to a recently published study by Marcel Helbig and Katja Salomo (available only in German) about socioeconomic inequalities for children in seven German cities, I’ve created an interactive web visualization with R Shiny and I wanted to share a few experiences that I made during development. This will be mainly about interactive visualization of geospatial data and custom UI elements. Below is a link to an example showing social welfare support rate amongst children and several environmental characteristics in Saarbrucken.

Read More →

Simplifying geospatial features in R with sf and rmapshaper

When working with geospatial data, memory consumption and computation time can become quite a problem, since these datasets are often very large. You may have very granular, high resolution data although that isn’t really necessary for your use-case, for example when plotting small scale maps or when applying calculations at a spatial level for which lower granularity is sufficient. In such scenarios, it’s best to first simplify the geospatial features (the sets of points, lines and polygons that represent geographic entities) in your data. By simplify I mean that we generally reduce the amount of information that is used to represent these features, i.e. we remove complete features (e.g. small islands), we join features (e.g. we combine several overlapping or adjacent features) or we reduce the complexity of features (e.g. we remove vertices or holes in a feature). Since applying these operations comes with information loss you should be very careful about how much you simplify and if this in some way biases you results.

In R, we can apply this sort of simplification with a few functions from the packages sf and, for some special cases explained below, rmapshaper. In the following, I will show how to apply them and what the effects of their parameters are. The data and source code are available on GitHub.

Read More →

Linkdump #136

R
Python
Other interesting articles, projects and news

Robust web scraping or web API based data collection

There are thousands of articles on the web about web scraping and accessing web APIs. Most of them show you how to extract information from specific elements on a web page or how to communicate with a specific API in order to collect data. For smaller data collection projects, this knowledge may be sufficient, but large scale data collection which must run reliably over days or even weeks brings up additional problems that mainly focus on the robustness of the data collection process. I will try to tackle some of these problems in this post. I will use examples in Python, but the basic concepts can easily be translated to R or other programming languages.

Read More →

Linkdump #135

R
Python
Other interesting articles, projects and news

Spiegel Online news topics and COVID-19 – a topic modeling approach

I created a project to showcase topic modeling with the tmtoolkit Python package: I use a corpus of articles from the German online news website Spiegel Online (SPON) to create a topic model for before and during the COVID-19 pandemic. This topic model is then used to analyze the volume of media coverage regarding the pandemic and how it changed over time.

National daily infection numbers clearly drive the volume of media coverage on COVID-19 during the observation period (January 2020 to end of August 2020) on SPON, which is probably not very surprising. Even though infection rates increased dramatically in the world in summer 2020 (e.g. in Brazil, India and USA), media coverage first decreased and then stayed at a moderate level, indicating that SPON doesn’t respond so much to rising infection rates at an international level.

You can have a look at the report here. All scripts are available in the GitHub repository.

Linkdump #134

R
Python
Other interesting articles, projects and news

Linkdump #133

R
Python
Other interesting articles, projects and news