July | 2016 | WZB Data Science Blog

Linkdump #6

July 29, 2016 11:40 am , Markus Konrad

Python

R

Interesting articles, projects and news

Googles Cloud Natural Language API und Speech API haben nun offenen Beta-Status

Posted in: Linkdump

Linkdump #5

July 15, 2016 10:03 am , Markus Konrad

R

Python

Best practices for logging computational systems in R and Python

Interesting articles, projects and news

Posted in: Linkdump

Accurate Part-of-Speech tagging of German texts with NLTK

July 13, 2016 4:22 pm , Markus Konrad

Part-of-speech tagging or POS tagging of texts is a technique that is often performed in Natural Language Processing. It allows to disambiguate words by lexical category like nouns, verbs, adjectives, and so on. This is useful in many cases, for example in order to filter large corpora of texts only for certain word categories. It is also often a prerequisite of lemmatization.

For English texts, POS tagging is implemented in the pos_tag() function of the widely used Python library NLTK. However, if you’re dealing with other languages, things get trickier. You can try to find a specialized library for your language, for example the pattern library from CLiPS Research Center, which implements POS taggers for German, Spanish and other languages. But apart from this library being only available for Python 2.x, its accuracy is suboptimal — only 84% for German language texts.

Another approach is to use supervised classification for POS tagging, which means that a tagger can be trained with a large text corpus as training data like the TIGER corpus from the Institute for Natural Language Processing / University of Stuttgart. It contains a large set of annotated and POS-tagged German texts. After training with such a dataset, the POS tagging accuracy is about 96% with the mentioned corpora. In this post I will explain how to load a corpus into NLTK, train a tagger with it and then use the tagger with your texts. Furthermore I’ll show how to save the trained tagger and load it from disk in order not to re-train it every time you need to use it.

Posted in: NLP & Text Analysis, Python

Autocorrecting misspelled Words in Python using HunSpell

July 13, 2016 1:13 pm , Markus Konrad

When you’re dealing with natural language data, especially survey data, misspelled words occur quite often in free-text answers and might cause problems during later analyses. A fast and easy to implement approach to deal with these issues is to use a spellchecker and automatically correct misspelled words. I’ll show how to do this with PyHunSpell, a set of Python bindings for the open source spellchecker engine HunSpell which is also used in well-known software projects like Firefox, OpenOffice and works with many languages.

Posted in: NLP & Text Analysis, Python

Data Mining OCR PDFs – Getting Things Straight

July 8, 2016 3:19 pm , Markus Konrad

The first article of my series about extracting tabular data from PDFs focused on rather simple cases; cases that allowed us to convert the PDFs to plain text documents and parse the extracted text line-per-line. We also learned from the first article that the only information that we can access in PDFs is the textual data that is distributed across the pages in the form of individual text boxes, which have properties like a position, width, height and the actual content (text). There’s usualy no information stored about rows/columns or other table-like structures.

Now in the next two articles I want to focus on rather complicated documents: PDFs that have complex table structures or are even scans of documents that were processed via Optical Character Recognition (OCR). Such documents are often “messy” — someone scanned hundreds of pages and of course sometimes the pages are sloped or skewed and the margins differ. It is mostly impossible to extract structured information from such a messy data source by just converting the PDF to a plain text document as described in the previous article. Hence we must use the attributes of the OCR-procossed text boxes (such as the texts’ positions) to recognize patterns in them from which we might infer the table layout.

So the basic goal is to analyse the text boxes and their properties, especially their positions in form of the distribution of their x- and y-coordinates on the page and see if we can construct a table layout from that, so that we can “fit” the text boxes into the calculated table cells. This is something that I’ll explain in the third article of this series. Because before we can do that, we need to clarify some prerequisites which I’ll do in this article:

When we use OCR, to what should we pay attention?
How can we extract the text boxes and their properties from a PDF document?
How can we display and inspect the text boxes?
How can we straighten skewed pages?

Posted in: Data Mining, PDFs, Python

Linkdump #4

July 8, 2016 9:54 am , Markus Konrad

Python

R

Other

Interesting articles, projects and news

When Should Hacking Be Legal?

[…] a group of academic researchers and journalists is suing the government, challenging the constitutionality of part of CFAA. With the help of the American Civil Liberties Union, they’re targeting the portion of the law that makes it illegal to break private companies’ terms of service […]
[…] although [those terms] are an individual’s agreement with a company, CFAA makes violating them a federal crime. […]
The four professors bringing the lawsuit are conducting research into racial and other discriminatory biases in online services.
[…] they’re creating an army of fake profiles and tweaking them to look like they belong to a diverse set of people. But using that tactic—one that’s very popular among researchers—could make the professors felons: The terms of most online services, including the largest employment and housing-search websites, prohibit creating multiple profiles, falsifying profile information, and scraping publicly available information with automated scripts.
How Big Data Harms Poor Communities
Fatal Force – US Police Shootings 2016
UK Police Accessed Civilian Data for Fun and Profit, New Report Says
Spies in the Skies
> America is being watched from above. Government surveillance planes routinely circle over most major cities — but usually take the weekends off.
> BuzzFeed News has assembled an unprecedented picture of the operation’s scale and sweep by analyzing aircraft location data collected by the flight-tracking website Flightradar24 from mid-August to the end of December last year, identifying about 200 federal aircraft. Day after day, dozens of these planes circled above cities across the nation.
New Service Sends Summaries of Your Social Media to Landlords, Employers to ‘Assess’ You
Google’s Revolving Door Explorer (US)
IBM’s Watson fed images to estimate water use efficiency in California — California water districts using new data service to estimate water efficiency
Hirnforschung: Fehlerhafte MRT-Software schürt Zweifel an Zehntausenden Studien
Beneath the Cloud — Exploring what the Internet is made of
These Maps Show What the Dark Web Looks Like
Proteste in Simbabwe: Regierung blockiert offenbar WhatsApp

Posted in: Linkdump

Data Mining PDFs – The simple cases

July 4, 2016 2:57 pm , Markus Konrad

Extracting data from PDFs can be a laborious task. When you only want to extract all text from a PDF and don’t care about which text is a headline or a paragraph or how text boxes relate to each other, you won’t have much headaches with PDFs, because this is quite straight forward to achieve. But if you want to extract structured information (especially tabular data) it really gets cumbersome, because unlike many other document formats, PDFs usually don’t carry any information about row-column-relationships, even if it looks like you have a table in front of you when you open a PDF document. From a technical point of view, the only information we usually have in PDFs is in forms of text boxes, which have some attributes like:

position in relation to the page
width and height
font attributes (font family, size, etc.)
the actual content (text) of the text box

So there’s no information in the document like “this text is in row 3, column 5” of a table. All we have is the above attributes from which we might infer a cell position in a table. In a short series of blog posts I want to explain how this can be done. In this first post I will focus on the “simple cases” of data extraction from PDFs, which means cases where we can extract tabular information without the need to calculate the table cells from the individual text box positions. In the upcoming posts I will explain how to handle the harder cases of PDFs: So called “sandwich” documents, i.e. PDFs that contain the scanned pages from some document together with “hidden” text from optical character recognition (OCR) of the scanned pages.

Posted in: Data Mining, PDFs, Python

LATINNO Project Website launched

July 1, 2016 11:02 am , Markus Konrad

I’m happy to announce that the website for the LATINNO project was launched this week. The WZB project LATINNO, lead by Thamy Pogrebinschi, collects and analyses data on democratic innovations of Latin America since the 1990. Currently the website informs about the project, the research design and publications as well as news related to the project. In the near future, a database of coded cases of innovations will be published for open access.

The website was designed by Caroline della Croce and the frontend was implemented by Benedikt Hebeisen, while the backend and database is implemented by me. This multilingual website is developed in Python with the Django framework. We chose Django because it allows rapid website development, has a clear and well documented programming model and features an easy to use administration backend. We additionally used Django hvad to enable multilingual database content.

Posted in: Databases, Web Development

Linkdump #3

July 1, 2016 11:02 am , Markus Konrad

Python

R

News

Posted in: Linkdump

Monthly Archives: July 2016

Linkdump #6

Python

R

Interesting articles, projects and news

Linkdump #5

R

Python

Interesting articles, projects and news

Accurate Part-of-Speech tagging of German texts with NLTK

Autocorrecting misspelled Words in Python using HunSpell

Data Mining OCR PDFs – Getting Things Straight

Linkdump #4

Python

R

Other

Interesting articles, projects and news

Data Mining PDFs – The simple cases

LATINNO Project Website launched

Linkdump #3

Python

R

News

Recent posts

Categories

Links

Links

Recent Posts

Recent Comments

Archives

Categories

Meta