Markus Konrad | WZB Data Science Blog

Speeding up NLTK with parallel processing

When doing text processing with NLTK on large corpora, you often need a lot of patience since even simple methods like word tokenization take quite some time when you’re processing a large amount of text data. This is because NLTK does not often harness the power of modern multicore computers — the code will only run on a single core even if you have four processing cores in your machine. You will need to add parallel processing of your documents yourself. Fortunately this is quite straight forward to implement with Python’s multiprocessing module and I will show how to do this in this small post.

Lemmatization of German language text

Lemmatization is the process of finding the base (or dictionary) form of a possibly inflected word — its lemma. It is similar to stemming, which tries to find the “root stem” of a word, but such a root stem is often not a lexicographically correct word, i.e. a word that can be found in dictionaries. A lemma is always a lexicographically correct word.

When using text mining models that depend on term frequency, such as Bag of Words or tf-idf, accurate lemmatization is often crucial, because you might not want to count the occurrences of the terms “book”, and “books” separately; you might want to reduce “books” to its lemma “book” so that it is included in the term frequency of “book”.

For English, automatic lemmatization is supported in many Python packages, for example in NLTK (via WordNetLemmatizer) or spaCy. For German, however, I could only find the CLiPS pattern package which has limited use (e.g. it cannot handle declined nouns) and is not supported in Python 3. By using the annotated TIGER corpus of the University of Stuttgart, I will try to measure the accuracy of a lemmatizer based on the pattern.de module and will suggest an improved lemmatizer which improves pattern.de’s accuracy by about 10%.

Interactive Balloon Charts with d3-balloon

As explained before, balloon plots can be a good way to compare many observations with lots of variables. I have now created a small extension for d3.js v4 that allows to implement interactive balloon charts quite easily. It is available on GitHub and can be seen live in action for a recent news article about segregation in schools on the WZB website.

Author Archives: Markus Konrad

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

R

Python

Interesting articles, projects and news

Post Navigation

Recent posts

Categories

Links

Links

Recent Posts

Recent Comments

Archives

Categories

Meta