Data Mining PDFs – The simple cases

Extracting data from PDFs can be a laborious task. When you only want to extract all text from a PDF and don’t care about which text is a headline or a paragraph or how text boxes relate to each other, you won’t have much headaches with PDFs, because this is quite straight forward to achieve. But if you want to extract structured information (especially tabular data) it really gets cumbersome, because unlike many other document formats, PDFs usually don’t carry any information about row-column-relationships, even if it looks like you have a table in front of you when you open a PDF document. From a technical point of view, the only information we usually have in PDFs is in forms of text boxes, which have some attributes like:

  • position in relation to the page
  • width and height
  • font attributes (font family, size, etc.)
  • the actual content (text) of the text box

So there’s no information in the document like “this text is in row 3, column 5” of a table. All we have is the above attributes from which we might infer a cell position in a table. In a short series of blog posts I want to explain how this can be done. In this first post I will focus on the “simple cases” of data extraction from PDFs, which means cases where we can extract tabular information without the need to calculate the table cells from the individual text box positions. In the upcoming posts I will explain how to handle the harder cases of PDFs: So called “sandwich” documents, i.e. PDFs that contain the scanned pages from some document together with “hidden” text from optical character recognition (OCR) of the scanned pages.

But before I come to that, I want to make a quick side note: I’m still perplexed by the fact that many people and organizations still think that PDFs are good for archiving (tabular) information. They’re not. They’re a nightmare when it comes to automated information extraction. The only thing that PDFs are good at is displaying and printing, not processing.

Creating plain text documents from PDFs

That said, you still have to deal with PDFs because people want to make you suffer. So what can we do? That depends on the kind of PDFs that you have. If you have PDFs that were exported from some other digital document format and the tables from which you’ll need to extract data are not too complicated, chances are high that layout-preserving text extraction is enough. This means that the PDF’s texts are converted to a plain-text document (“.txt”), preserving the location of the text boxes and hence also preserving the layout of a table. Then you can parse the lines of the text document to extract the tabular information.

If the tables in your PDF are more complicated or you have an OCR’ed document you will have to extract the individual text boxes from the PDF and calculate the table layout for yourself. This is something which I’ll explain in the next posts.

So for now we have a PDF with simple and clean tables from which we only want to extract the textual information but preserve the table layout like in these images:

How can we achieve this? There are some recommendations for Python libraries and someone even used a full stack Tika content analysis server for this task, but there are really simpler tools that we can use for this, namely pdftotext from the package poppler-utils (“Poppler is a PDF rendering library based on the xpdf”) which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts.

After installation of this package, we can use the pdftotext command to extract all text from a PDF document. In order to preserve the layout, we have add the -layout option and also use -nopgbrk, because otherwise the document will contain illegal bytes at page breaks. So with the following command we can create a plain text document from a PDF file:

pdftotext -layout -nopgbrk input.pdf output.txt

Parsing plain text documents

All that’s left now is to read the file and parse it for example with a Python script. This really depends on how the produced plain text document looks like so I can’t give a general solution here. A basic approach for most cases is to use Regular Expressions to detect page breaks, rows and columns. When you have rows that span several lines in the plain text document, you’ll usually work with a state variable that defines in which line within a row you are.

Here’s a small example how this might look like:

import re

cur_page = None
in_row = False
for line in open('output.txt'):
    m_pagebrk = re.search(r'^\s+- (\d+) -\s*$', line)   # example for detecting page breaks like "  - PAGE - "
    if m_pagebrk:   # page break occurred
        cur_page = int(m_pagebrk.group(1))
    if not in_row:
        m_row_start = re.search(r'...', line)   # define a Regular Expression for a row start
        if m_row_start:
            in_row = True
            # ... and so on

See the second article of this series: Data Mining OCR PDFs — Getting Things Straight

Comments are closed.

Post Navigation