During the last months I often had to deal with the problem of extracting tabular data from scanned documents. These documents included quite old sources like catalogs of German newspapers in the 1920s to 30s or newer sources like lists of schools in Germany from the 1990s. All sources were of mixed scanning quality (including rotated or skewed pages) and had very different table layouts. Some had visible table column borders, others only table header borders so the actual table cells were only visually separated by “white-space”. Automated data extraction with tools from ABBYY or using Tabula failed in most cases. Because of the big variety of scanning quality and table layouts, a general single-solution approach didn’t work out. Hence I created a set of common tools that allow to detect table layouts on scanned pages in OCR PDFs, enable visual verification of the detected layouts and finally allow the extraction of the data in the tables. To detect and extract the data I created a Python library named pdftabextract which is now published on PyPI and can be installed with pip. The detected layouts can be verified page by page using pdf2xml-viewer. This post will cover an introduction to both tools by showing all necessary steps in order to extract tabular data from an example page. The necessary files can be found in the examples directory of the pdftabextract github repository. A Jupyter Notebook for this example is also available there.
otreeutils – A package with common oTree utility/helper functions and classes
As a little side product from the experiments that I recently implemented with oTree, I created otreeutils. This package contains several utility functions and classes for use cases that I came across quite often in the recent weeks. So I decided to create this small package to re-use my code in several experiments and also to share it with others. Update: The package is now also available on PyPI. You can install it via pip install otreeutils
.
First and foremost, otreeutils allows easier creation of surveys, which are a part of many experiments. Instead of defining the Player
model fields, the view page(s) and HTML templates separately, it’s now possible to define the survey questions and their fields in a single data structure. Then you can generate the Player
model class automatically and just define the order of survey pages (if you distribute the survey questions across several pages) — no need to write any HTML.
Another use case that occurs quite often are pages with understanding questions. The participant can only proceed after answering a set of understanding questions correctly. Hints can be displayed for wrong answers. This is implemented in the UnderstandingQuestionsPage
base class. Again, the questions can be defined in a single data structure, no template needs to be edited.
Finally, we found that we needed a softer variant of oTree’s page timeouts, which automatically switch to the next page after a time limit. I implemented a small extension that allows to define a “timeout warning” which is displayed after the time limit is reached. The participant is not forced to the next page.
More features might come soon as we continue to use oTree at the WZB.
The github repository explains the API in the README file and contains fully documented example apps.
Using custom data models in oTree
I’m currently working on implementing some multiplayer decision strategy games for different experiments in the field of Experimental Economics. We decided to use the excellent oTree framework as basis for our implementations. Since oTree itself is based on Django and provides comprehensive documentation and some good tutorials, it was quite straight forward for me to learn. In most cases, it provides just what you need for implementing an experiment while hiding a lot of unnecessary technical stuff by exposing only a limited API. The full power and functionality of Django is hidden to the programmer for the sake of clearness and simplicity, which is basically a good thing.
However, in some cases oTree’s API is too restrictive for implementing advanced functionality, the main issue being the limited set of data models: By default, you can only record non-complex information (i.e. numeric values, strings, etc.) per subsessions (i.e. round), group or player. What if you need, for example, to record an arbitrary number of (more or less complex) decisions made by a player per round? This is not really supported by oTree’s data models so you need to define and handle your own, custom Django models. I will explain how to do this in this post.
Styling individual cells in Excel output files created with pandas
The Python Data Analysis Library pandas provides basic but reliable Excel in- and output. However, more advanced features for writing Excel files are missing. Some of these advanced things, like conditional formatting can be achieved with XlsxWriter (see also
Improving Pandas’ Excel Output). However, sometimes it is
necessary to set styles like font or background colors on individual cells on the “Python side”. In this scenario,
XlsxWriter won’t work, since “XlsxWriter and Pandas provide very little support for formatting the output data from a dataframe apart from default formatting such as the header and index cells and any cells that contain dates of datetimes.”
To achieve setting styles on individual cells on the Python side, I wrote a small extension for pandas and put it on github, along with some examples. It comes in quite handy, for example when you are running complicated data validation routines (which you probably don’t want to implement in VBA) and want to highlight the validation results by coloring
individual cells in the output Excel sheets.
A tip for the impatient: Simple caching with Python pickle and decorators
During testing and development, it is sometimes necessary to rerun tasks that take quite a long time. One option is to drink coffee in the mean time, the other is to use caching, i.e. save once calculated results to disk and load them from there again when necessary. The Python module pickle is perfect for caching, since it allows to store and read whole Python objects with two simple functions. I already showed in another article that it’s very useful to store a fully trained POS tagger and load it again directly from disk without needing to retrain it, which saves a lot of time.
Displaying translated ForeignKey objects in Django admin with django-hvad
For multilingual websites built with Django the extension django-hvad is a must, since it allows to specify, edit and fetch multilingual (i.e. translatable) objects very easily and is generally well integrated into the Django framework. However, some caveats for using django-hvad’s TranslatableModel in the Django model admin backend system exist, especially when dealing with relations to TranslatableModel objects. I want to address three specific problems in this post: First, it’s not possible to display a translatable field directly in a model admin list display. Secondly, related (and also translated) ForeignKey objects are only displayed by their primary key ID in such a list display. And lastly, a similar problem exists for the list display filter and drop-down selection boxes in the edit form.
Accurate Part-of-Speech tagging of German texts with NLTK
Part-of-speech tagging or POS tagging of texts is a technique that is often performed in Natural Language Processing. It allows to disambiguate words by lexical category like nouns, verbs, adjectives, and so on. This is useful in many cases, for example in order to filter large corpora of texts only for certain word categories. It is also often a prerequisite of lemmatization.
For English texts, POS tagging is implemented in the pos_tag()
function of the widely used Python library NLTK. However, if you’re dealing with other languages, things get trickier. You can try to find a specialized library for your language, for example the pattern library from CLiPS Research Center, which implements POS taggers for German, Spanish and other languages. But apart from this library being only available for Python 2.x, its accuracy is suboptimal — only 84% for German language texts.
Another approach is to use supervised classification for POS tagging, which means that a tagger can be trained with a large text corpus as training data like the TIGER corpus from the Institute for Natural Language Processing / University of Stuttgart. It contains a large set of annotated and POS-tagged German texts. After training with such a dataset, the POS tagging accuracy is about 96% with the mentioned corpora. In this post I will explain how to load a corpus into NLTK, train a tagger with it and then use the tagger with your texts. Furthermore I’ll show how to save the trained tagger and load it from disk in order not to re-train it every time you need to use it.
Autocorrecting misspelled Words in Python using HunSpell
When you’re dealing with natural language data, especially survey data, misspelled words occur quite often in free-text answers and might cause problems during later analyses. A fast and easy to implement approach to deal with these issues is to use a spellchecker and automatically correct misspelled words. I’ll show how to do this with PyHunSpell, a set of Python bindings for the open source spellchecker engine HunSpell which is also used in well-known software projects like Firefox, OpenOffice and works with many languages.
Data Mining OCR PDFs – Getting Things Straight
The first article of my series about extracting tabular data from PDFs focused on rather simple cases; cases that allowed us to convert the PDFs to plain text documents and parse the extracted text line-per-line. We also learned from the first article that the only information that we can access in PDFs is the textual data that is distributed across the pages in the form of individual text boxes, which have properties like a position, width, height and the actual content (text). There’s usualy no information stored about rows/columns or other table-like structures.
Now in the next two articles I want to focus on rather complicated documents: PDFs that have complex table structures or are even scans of documents that were processed via Optical Character Recognition (OCR). Such documents are often “messy” — someone scanned hundreds of pages and of course sometimes the pages are sloped or skewed and the margins differ. It is mostly impossible to extract structured information from such a messy data source by just converting the PDF to a plain text document as described in the previous article. Hence we must use the attributes of the OCR-procossed text boxes (such as the texts’ positions) to recognize patterns in them from which we might infer the table layout.
So the basic goal is to analyse the text boxes and their properties, especially their positions in form of the distribution of their x- and y-coordinates on the page and see if we can construct a table layout from that, so that we can “fit” the text boxes into the calculated table cells. This is something that I’ll explain in the third article of this series. Because before we can do that, we need to clarify some prerequisites which I’ll do in this article:
- When we use OCR, to what should we pay attention?
- How can we extract the text boxes and their properties from a PDF document?
- How can we display and inspect the text boxes?
- How can we straighten skewed pages?
Data Mining PDFs – The simple cases
Extracting data from PDFs can be a laborious task. When you only want to extract all text from a PDF and don’t care about which text is a headline or a paragraph or how text boxes relate to each other, you won’t have much headaches with PDFs, because this is quite straight forward to achieve. But if you want to extract structured information (especially tabular data) it really gets cumbersome, because unlike many other document formats, PDFs usually don’t carry any information about row-column-relationships, even if it looks like you have a table in front of you when you open a PDF document. From a technical point of view, the only information we usually have in PDFs is in forms of text boxes, which have some attributes like:
- position in relation to the page
- width and height
- font attributes (font family, size, etc.)
- the actual content (text) of the text box
So there’s no information in the document like “this text is in row 3, column 5” of a table. All we have is the above attributes from which we might infer a cell position in a table. In a short series of blog posts I want to explain how this can be done. In this first post I will focus on the “simple cases” of data extraction from PDFs, which means cases where we can extract tabular information without the need to calculate the table cells from the individual text box positions. In the upcoming posts I will explain how to handle the harder cases of PDFs: So called “sandwich” documents, i.e. PDFs that contain the scanned pages from some document together with “hidden” text from optical character recognition (OCR) of the scanned pages.
Recent Comments