Data Mining OCR PDFs – Getting Things Straight

The first article of my series about extracting tabular data from PDFs focused on rather simple cases; cases that allowed us to convert the PDFs to plain text documents and parse the extracted text line-per-line. We also learned from the first article that the only information that we can access in PDFs is the textual data that is distributed across the pages in the form of individual text boxes, which have properties like a position, width, height and the actual content (text). There’s usualy no information stored about rows/columns or other table-like structures.

Now in the next two articles I want to focus on rather complicated documents: PDFs that have complex table structures or are even scans of documents that were processed via Optical Character Recognition (OCR). Such documents are often “messy” — someone scanned hundreds of pages and of course sometimes the pages are sloped or skewed and the margins differ. It is mostly impossible to extract structured information from such a messy data source by just converting the PDF to a plain text document as described in the previous article. Hence we must use the attributes of the OCR-procossed text boxes (such as the texts’ positions) to recognize patterns in them from which we might infer the table layout.

So the basic goal is to analyse the text boxes and their properties, especially their positions in form of the distribution of their x- and y-coordinates on the page and see if we can construct a table layout from that, so that we can “fit” the text boxes into the calculated table cells. This is something that I’ll explain in the third article of this series. Because before we can do that, we need to clarify some prerequisites which I’ll do in this article:

  1. When we use OCR, to what should we pay attention?
  2. How can we extract the text boxes and their properties from a PDF document?
  3. How can we display and inspect the text boxes?
  4. How can we straighten skewed pages?

OCR-processed PDF documents

You often get OCR-processed PDFs in the form of “sandwich” PDFs: They contain the original scanned pages as images and on top of them is a hidden layer of the text that the OCR produced so that you can select and copy the text. This hidden layer of OCR-text contains the information that we want to extract: a set of text boxes distributed across the page.*

(A small side note: In terms of analysing OCR-processed documents, I’ll focus on techniques here that ignore the scanned page images. We’ll only harness information that’s extracted by the OCR. It might also be possible to use techniques of image processing to detect tables like Tabular does. However, this is more complicated and computationally much more expensive and only works if your tables contain ruling lines.)

What is most important at this stage, is that the OCR is correct, because we can only extract what the OCR has produced (we won’t reinvent OCR here!). So if the hidden text overlay only contains garbage because the OCR was not set up to recognize the font correctly, you can stop here now and reconfigure/train your OCR software (or use better dictionaries) so that it does a better job at text recognition. You can have a quick overview on the quality of your OCR by selecting all text in the PDF (press Ctrl-A) — this will reveal the hidden layer and show you all recognized text as seen in the image below.


Most OCR software contains an option to straighten skewed pages. You should enable this so your pages shouldn’t look like the one above. But if they do and you can’t change it, you’re not lost — I’ll show how to straighten these pages in a later section of this post.

Extracting text boxes and their properties from a PDF document

After we made sure that the OCR quality is descent, we can extract the individual text boxes (the things that you can select in the OCR-PDF) in order to inspect and process them. For this, we’ll use the pdf2xml format because it allows us to parse the text boxes very efficiently later on. To produce an XML file with the pdf2xml format, we’ll use the poppler-utils once again, as already introduced in the first article. This time, we’ll use the pdftohtml command, which can also produce XML when we use the following options:

pdftohtml -c -i -hidden -xml input.pdf output.xml

The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format
respectively. It is important that you specifiy the -hidden parameter when you’re dealing with OCR-processed
(“sandwich”) PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be

Displaying and inspecting the text boxes

Now that we have an XML file, we can have a look at it in a text editor where we’ll see a nested structure of pages and text boxes. It’s cumbersome to inspect the text boxes and their properties like this. It would be better to display them and have the ability to select and inspect individual text boxes. For this, I created a small tool called pdf2xml-viewer which lets you inspect the generated pdf2xml-files in your browser (using your favorite browser inspection tools). You can download the tool on its github page. Have a look at the instructions in the Readme file so that you can load and view a pdf2xml-file. You will be able to inspect the elements like in the image below, which later helps at devising a strategy to extract the tabular information (which we’ll do in the third article of this series).


Straightening skewed pages

As I said, your pages shouldn’t look like mine because you watched out and selected the right option in your OCR software to straighten the pages automatically. However, sometimes you get documents from somewhere that are just like this and you can’t help it. So you have to help yourself and straighten them.

Luckily, I created a small package with tools for extracting tabular data from PDFs called pdftabextract. We’ll further explore this tool in the third article of this series, but for now we’ll use its ability to load a pdf2xml file and automatically straighten pages in it that are skewed.

The function fix_rotation() of the fixrotation module does that and it works like this:

  1. Select text boxes at the top left, top right, bottom right and bottom left corners of the page — they should define the corner text boxes of a table in each page. To select the correct text boxes, you can further pass in condition functions for each corner to say for example that the top left and bottom left corners always have to contain text of a certain format (that’s usually the case in the columns of your table). You can also chose to only select specific corners (e.g. only the corners on the left side and ignore the ones on the right side because they’re not “fixed” in the table)
  2. Construct the lines between the corners that would create a quad around your table (or only a single line depending on how many corners you chose to select). For the lines of this quad (or the single line) calculate their angles in relation to a rectangular page — these angles should be about the same when the page is skewed. Calculate the mean of the angles — this is our skew angle alpha.
  3. Straighten by applying -alpha to the positions of all text boxes on the page.

The following example shows how to straighten the pages in a OCR-processed PDF file (the ones shown in the pictures in this article). It’s taken from the examples directory of the pdftabextract project. In our example, we only identify the text boxes at the top left and bottom left corners of a table in a page, which must contain “G” or “WS”. This is a criterion of this specific PDFs. You do not have to specify such criteria but it helps identifying the correct corners of a table. Furthermore, we set options to divide each PDF page, because we actually have two real pages scanned per PDF page, which happens quite often.

First, import what we need:

import re

from pdftabextract import fixrotation

Now define functions to identify text boxes that mark table corners:

# Top left und bottom left text boxes must only have the text "G" or "WS" inside
def cond_topleft_text(t):
    text = t['value'].strip()
    return'^(G|WS)$', text) is not None
cond_bottomleft_text = cond_topleft_text

# Define the functions as tuple from top left to bottom left in CW direction
# (Disable corners on the right side -- we don't need them)
cond_disabled = lambda t: False
corner_box_cond_fns = (cond_topleft_text, cond_disabled, cond_disabled, cond_bottomleft_text)

Set some options and straighten the pages:

# Fix the rotation
fixrotation.set_config_option('header_skip', 0.1)  # ignore top 10% of the page when inspecting the text box positions
fixrotation.set_config_option('footer_skip', 0.1)  # ignore bottom 10% of the page when inspecting the text box positions
fixrotation.set_config_option('divide', 0.5)  # two "real" pages per PDF page - divide page at 50% (in the middle of th page)
fixrotation.set_config_option('min_content_length_from_mean', 0.2)   # set minimum amount of content for processing
xmltree, xmlroot, rot_results = fixrotation.fix_rotation('examples/ocr-output.pdf.xml', corner_box_cond_fns)

Print the results (will print that some pages were not rotated because the detected rotation is only marginal) and save as XML file again.

for p_id in sorted(rot_results.keys(), key=lambda x: x[0]):
    print("Page %d/%s: %s" % (p_id[0], p_id[1], rot_results[p_id]))

# Write the straightened output XML (just for debugging reasons -- can be viewed with pdf2xml-viewer)

The straightened pages are saved as pdf2xml file, which we can view in the pdf2xml-viewer. Here you can see the difference between an input page and the straightened page:

We’ve learned how to extract text boxes from a PDF and inspect it with pdf2xml-viewer. When we have bad quality PDFs with skewed pages, we can automatically straighten them using the fixrotation submodule of pdftabextract. In the next and final article, I will show how to identify columns and rows in the distribution of text boxes and hence extract tabular information from PDFs once again using functions from pdftabextract. Furthermore, I’ll show how to use some advanced features of the pdf2xml-viewer tool to overlay lines and grids for debugging the table layout detection.

Comments are closed.

Post Navigation