Extracting data from PDFs can be a laborious task. When you only want to extract all text from a PDF and don’t care about which text is a headline or a paragraph or how text boxes relate to each other, you won’t have much headaches with PDFs, because this is quite straight forward to achieve. But if you want to extract structured information (especially tabular data) it really gets cumbersome, because unlike many other document formats, PDFs usually don’t carry any information about row-column-relationships, even if it looks like you have a table in front of you when you open a PDF document. From a technical point of view, the only information we usually have in PDFs is in forms of text boxes, which have some attributes like:
- position in relation to the page
- width and height
- font attributes (font family, size, etc.)
- the actual content (text) of the text box
So there’s no information in the document like “this text is in row 3, column 5” of a table. All we have is the above attributes from which we might infer a cell position in a table. In a short series of blog posts I want to explain how this can be done. In this first post I will focus on the “simple cases” of data extraction from PDFs, which means cases where we can extract tabular information without the need to calculate the table cells from the individual text box positions. In the upcoming posts I will explain how to handle the harder cases of PDFs: So called “sandwich” documents, i.e. PDFs that contain the scanned pages from some document together with “hidden” text from optical character recognition (OCR) of the scanned pages.
But before I come to that, I want to make a quick side note: I’m still perplexed by the fact that many people and organizations still think that PDFs are good for archiving (tabular) information. They’re not. They’re a nightmare when it comes to automated information extraction. The only thing that PDFs are good at is displaying and printing, not processing.
Creating plain text documents from PDFs
That said, you still have to deal with PDFs because people want to make you suffer. So what can we do? That depends on the kind of PDFs that you have. If you have PDFs that were exported from some other digital document format and the tables from which you’ll need to extract data are not too complicated, chances are high that layout-preserving text extraction is enough. This means that the PDF’s texts are converted to a plain-text document (“.txt”), preserving the location of the text boxes and hence also preserving the layout of a table. Then you can parse the lines of the text document to extract the tabular information.
If the tables in your PDF are more complicated or you have an OCR’ed document you will have to extract the individual text boxes from the PDF and calculate the table layout for yourself. This is something which I’ll explain in the next posts.
So for now we have a PDF with simple and clean tables from which we only want to extract the textual information but preserve the table layout like in these images:
How can we achieve this? There are some recommendations for Python libraries and someone even used a full stack Tika content analysis server for this task, but there are really simpler tools that we can use for this, namely pdftotext
from the package poppler-utils (“Poppler is a PDF rendering library based on the xpdf”) which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts.
After installation of this package, we can use the pdftotext
command to extract all text from a PDF document. In order to preserve the layout, we have add the -layout
option and also use -nopgbrk
, because otherwise the document will contain illegal bytes at page breaks. So with the following command we can create a plain text document from a PDF file:
pdftotext -layout -nopgbrk input.pdf output.txt
Parsing plain text documents
All that’s left now is to read the file and parse it for example with a Python script. This really depends on how the produced plain text document looks like so I can’t give a general solution here. A basic approach for most cases is to use Regular Expressions to detect page breaks, rows and columns. When you have rows that span several lines in the plain text document, you’ll usually work with a state variable that defines in which line within a row you are.
Here’s a small example how this might look like:
import re
cur_page = None
in_row = False
for line in open('output.txt'):
m_pagebrk = re.search(r'^\s+- (\d+) -\s*$', line) # example for detecting page breaks like " - PAGE - "
if m_pagebrk: # page break occurred
cur_page = int(m_pagebrk.group(1))
if not in_row:
m_row_start = re.search(r'...', line) # define a Regular Expression for a row start
if m_row_start:
in_row = True
# ... and so on
See the second article of this series: Data Mining OCR PDFs — Getting Things Straight
Recent Comments