Category Archives: Python

Autocorrecting misspelled Words in Python using HunSpell

When you’re dealing with natural language data, especially survey data, misspelled words occur quite often in free-text answers and might cause problems during later analyses. A fast and easy to implement approach to deal with these issues is to use a spellchecker and automatically correct misspelled words. I’ll show how to do this with PyHunSpell, a set of Python bindings for the open source spellchecker engine HunSpell which is also used in well-known software projects like Firefox, OpenOffice and works with many languages.

Read More →

Data Mining OCR PDFs – Getting Things Straight

The first article of my series about extracting tabular data from PDFs focused on rather simple cases; cases that allowed us to convert the PDFs to plain text documents and parse the extracted text line-per-line. We also learned from the first article that the only information that we can access in PDFs is the textual data that is distributed across the pages in the form of individual text boxes, which have properties like a position, width, height and the actual content (text). There’s usualy no information stored about rows/columns or other table-like structures.

Now in the next two articles I want to focus on rather complicated documents: PDFs that have complex table structures or are even scans of documents that were processed via Optical Character Recognition (OCR). Such documents are often “messy” — someone scanned hundreds of pages and of course sometimes the pages are sloped or skewed and the margins differ. It is mostly impossible to extract structured information from such a messy data source by just converting the PDF to a plain text document as described in the previous article. Hence we must use the attributes of the OCR-procossed text boxes (such as the texts’ positions) to recognize patterns in them from which we might infer the table layout.

So the basic goal is to analyse the text boxes and their properties, especially their positions in form of the distribution of their x- and y-coordinates on the page and see if we can construct a table layout from that, so that we can “fit” the text boxes into the calculated table cells. This is something that I’ll explain in the third article of this series. Because before we can do that, we need to clarify some prerequisites which I’ll do in this article:

  1. When we use OCR, to what should we pay attention?
  2. How can we extract the text boxes and their properties from a PDF document?
  3. How can we display and inspect the text boxes?
  4. How can we straighten skewed pages?

Read More →

Data Mining PDFs – The simple cases

Extracting data from PDFs can be a laborious task. When you only want to extract all text from a PDF and don’t care about which text is a headline or a paragraph or how text boxes relate to each other, you won’t have much headaches with PDFs, because this is quite straight forward to achieve. But if you want to extract structured information (especially tabular data) it really gets cumbersome, because unlike many other document formats, PDFs usually don’t carry any information about row-column-relationships, even if it looks like you have a table in front of you when you open a PDF document. From a technical point of view, the only information we usually have in PDFs is in forms of text boxes, which have some attributes like:

  • position in relation to the page
  • width and height
  • font attributes (font family, size, etc.)
  • the actual content (text) of the text box

So there’s no information in the document like “this text is in row 3, column 5” of a table. All we have is the above attributes from which we might infer a cell position in a table. In a short series of blog posts I want to explain how this can be done. In this first post I will focus on the “simple cases” of data extraction from PDFs, which means cases where we can extract tabular information without the need to calculate the table cells from the individual text box positions. In the upcoming posts I will explain how to handle the harder cases of PDFs: So called “sandwich” documents, i.e. PDFs that contain the scanned pages from some document together with “hidden” text from optical character recognition (OCR) of the scanned pages.

Read More →

Reading textual data from CSV and Excel files correctly with pandas

The pandas library is great for data analysis with Python, but it has some caveats and gotchas. One of it is importing textual data from CSV and Excel files that is automatically converted to numeric values when it only consists of digits. This is mostly a nice feature, but sometimes it is not what you want, for example in the case of codes with leading zeros like a FIPS state code. If you have a column with FIPS state codes in your CSV or Excel file, it will show up as an integer series after importing it with pandas, so the FIPS code of ’03’ will become the integer ‘3’.

To prevent pandas from doing this, a good guess would be specifying the dtype directly so that it doesn’t need to be guessed, but unfortunately this is not supported:

import pandas as pd

df = pd.read_excel('some_excelfile.xls', dtype=object)
>>> ValueError: The 'dtype' option is not supported with the 'python' engine

It also doesn’t work with other “engines” yet, so we need another solution: Converters. You can pass a dict that specifies a conversion function for each column (either by column index or column name). For example, if we want to have strings instead of numeric values in the columns with indices 3 and 7, we could pass a dict with the conversion function str() like this:

converters = {col: str for col in (3, 7)}
df = pd.read_excel('some_excelfile.xls', converters=converters)

pandas will not guess the data type of the columns where a conversion function is defined but will use the output type of the conversion function, so we will have a series of strings with the leading zeros as we wanted it.

Creating a sparse Document Term Matrix for Topic Modeling via LDA

To do topic modeling with methods like Latent Dirichlet Allocation, it is necessary to build a Document Term Matrix (DTM) that contains the number of term occurrences per document. The rows of the DTM usually represent the documents and the columns represent the whole vocabulary, i.e. the set union of all terms that appear in all documents.

The DTM will contain mostly zero values when we deal with natural language documents, because from the vast vocabulary of possible terms from all documents, only a few will be used in the individual documents (even after normalizing the vocabulary with stemming or lemmatization). Hence the DTM will be a sparse matrix in most cases — and this fact should be exploited to achieve good memory efficiency.

Read More →