Category Archives: Data Mining

Some thoughts about the use of cloud services and web APIs in social science research

In the recent weeks I’ve collaborated on the online book APIs for social scientists and added two chapters: a chapter about the genderize.io API and a chapter about the GitHub API. The book seeks to provide an overview about web or cloud services and their APIs that might be useful for social scientists and covers a wide range from text translation to accessing social media APIs complete with code examples in R. By harnessing the GitHub workflow model, the book itself is also a nice example of fruitful collaboration via work organization methods that were initially developed in the open source software community.

While working on the two chapters and playing around with the APIs, I once again noticed the double-edged nature of using web APIs in research. It can greatly improve research or even enable research that was not possible before. At the same time, data collected from these APIs can inject bias and the use of these APIs may cause issues with research transparency and replicability. I noted some of these issues in the respective book chapters and I’ve written about them before,[1]See this article in WZB Mitteilungen (only in German) together with Jonas Wiedner. but the two APIs that I covered for the book provide some very practical examples of the main issues when working with web APIs and I wanted to point them out in this blog post.

Read More →

Robust web scraping or web API based data collection

There are thousands of articles on the web about web scraping and accessing web APIs. Most of them show you how to extract information from specific elements on a web page or how to communicate with a specific API in order to collect data. For smaller data collection projects, this knowledge may be sufficient, but large scale data collection which must run reliably over days or even weeks brings up additional problems that mainly focus on the robustness of the data collection process. I will try to tackle some of these problems in this post. I will use examples in Python, but the basic concepts can easily be translated to R or other programming languages.

Read More →

Spiegel Online news topics and COVID-19 – a topic modeling approach

I created a project to showcase topic modeling with the tmtoolkit Python package: I use a corpus of articles from the German online news website Spiegel Online (SPON) to create a topic model for before and during the COVID-19 pandemic. This topic model is then used to analyze the volume of media coverage regarding the pandemic and how it changed over time.

National daily infection numbers clearly drive the volume of media coverage on COVID-19 during the observation period (January 2020 to end of August 2020) on SPON, which is probably not very surprising. Even though infection rates increased dramatically in the world in summer 2020 (e.g. in Brazil, India and USA), media coverage first decreased and then stayed at a moderate level, indicating that SPON doesn’t respond so much to rising infection rates at an international level.

You can have a look at the report here. All scripts are available in the GitHub repository.

A Twitter network of members of the 19th German Bundestag – part II

This is the second part about my project that deals with the Twitter network of members of the Bundestag. After getting the necessary data, which was explained in part 1, we will now focus on creating a network graph with links between the representatives’ Twitter accounts for exploratory network analysis.

Read More →

A Twitter network of members of the 19th German Bundestag – part I

For the R tutorial that I gave at the WZB in the previous semester, I gave an introduction on how to query web APIs – specifically the Twitter API – and automated data extraction from websites (i.e. web scraping). I showed an example that combined both of these techniques for the goal of getting data about the Twitter activities of members of the current (19th) German Bundestag, which is the federal German parliament. The focus was especially on the question “who follows who” on Twitter. I thought it’s a nice little project showing how to use the Twitter API, do web scraping, combine the collected data and do some exploratory network analysis – all within the R environment. So I decided to polish the code a little bit, put in on GitHub and wrote two blog posts. The first part, i.e. this part, is all about getting the data.

Read More →

Lab report: Development of school sites in eastern Germany

I wanted to share a small lab report on a project about the development of school sites in eastern Germany since 1992. Rita Nikolai (HU Berlin), Marcel Helbig (WZB) and I published our results a few months ago (see this WZB Discussion Paper or this WZBrief), but I’d like to provide some additional information on the (technical) background in this post as this was not the aim of the mentioned papers.

Read More →

Checkboxes and crosses: data mining PDFs with the help of image processing

From time to time, I work with “open data” published by public authorities. Often, these data do not deserve the label “open data” and this is mainly because they are provided as PDF files. PDFs are not machine readable, at least not without lot of programming work. I don’t know if this way of publishing data is done on purpose (because authorities are requested to publish open data but they do not want it to be actually analyzed in large scale) or if it is sheer ignorance.

For a recent project I came across a particular nasty type of PDFs: Scores from a school inspection are listed in a large table where each score is marked with a cross (see a full PDF for such a school inspection):

While most data can be extracted from PDF by converting them to a plain text representation, this is not possible for such PDFs. This is because the most important information, the scores, is not existent in the plain text representation of the PDF. The crosses that mark the score are essentially vector-graphics embedded in the PDF. In this article I will explain how to extract such information.

Read More →

Web scraping with automated browsers using Selenium

Web scraping, i.e. automated data mining from websites, usually involves fetching a web page’s HTML document, parsing it, extracting the required information, and optionally follow links within this document to other web pages to repeat this process. This approach is sufficient for many websites that display information in a static way, i.e. do not respond to user interaction dynamically by the means of JavaScript. In these cases, web scraping can be implemented with Python packages such as requests and BeautifulSoup. Even interactive elements such as forms can be emulated by observing the HTTP POST and GET data that is send to the server, whenever a form is submitted. However, this approach has limits. Sometimes, it is necessary to automate a whole browser in order to implement web scraping on JavaScript-heavy websites as will be shown with a short example in this post.

Read More →

Data Mining OCR PDFs — Using pdftabextract to liberate tabular data from scanned documents

Detected clusters of vertical lines with pdftabextract

During the last months I often had to deal with the problem of extracting tabular data from scanned documents. These documents included quite old sources like catalogs of German newspapers in the 1920s to 30s or newer sources like lists of schools in Germany from the 1990s. All sources were of mixed scanning quality (including rotated or skewed pages) and had very different table layouts. Some had visible table column borders, others only table header borders so the actual table cells were only visually separated by “white-space”. Automated data extraction with tools from ABBYY or using Tabula failed in most cases. Because of the big variety of scanning quality and table layouts, a general single-solution approach didn’t work out. Hence I created a set of common tools that allow to detect table layouts on scanned pages in OCR PDFs, enable visual verification of the detected layouts and finally allow the extraction of the data in the tables. To detect and extract the data I created a Python library named pdftabextract which is now published on PyPI and can be installed with pip. The detected layouts can be verified page by page using pdf2xml-viewer. This post will cover an introduction to both tools by showing all necessary steps in order to extract tabular data from an example page. The necessary files can be found in the examples directory of the pdftabextract github repository. A Jupyter Notebook for this example is also available there.

Read More →

Data Mining OCR PDFs – Getting Things Straight

The first article of my series about extracting tabular data from PDFs focused on rather simple cases; cases that allowed us to convert the PDFs to plain text documents and parse the extracted text line-per-line. We also learned from the first article that the only information that we can access in PDFs is the textual data that is distributed across the pages in the form of individual text boxes, which have properties like a position, width, height and the actual content (text). There’s usualy no information stored about rows/columns or other table-like structures.

Now in the next two articles I want to focus on rather complicated documents: PDFs that have complex table structures or are even scans of documents that were processed via Optical Character Recognition (OCR). Such documents are often “messy” — someone scanned hundreds of pages and of course sometimes the pages are sloped or skewed and the margins differ. It is mostly impossible to extract structured information from such a messy data source by just converting the PDF to a plain text document as described in the previous article. Hence we must use the attributes of the OCR-procossed text boxes (such as the texts’ positions) to recognize patterns in them from which we might infer the table layout.

So the basic goal is to analyse the text boxes and their properties, especially their positions in form of the distribution of their x- and y-coordinates on the page and see if we can construct a table layout from that, so that we can “fit” the text boxes into the calculated table cells. This is something that I’ll explain in the third article of this series. Because before we can do that, we need to clarify some prerequisites which I’ll do in this article:

  1. When we use OCR, to what should we pay attention?
  2. How can we extract the text boxes and their properties from a PDF document?
  3. How can we display and inspect the text boxes?
  4. How can we straighten skewed pages?

Read More →