Category Archives: Nlp & Text Analysis

Slides on practical Topic Modeling: Preparation, evaluation, visualization

I gave a presentation on Topic Modeling from a practical perspective*, using data about the proceedings of plenary sessions of the 18th German Bundestag as provided by offenesparlament.de. The presentation covers preparation of the text data for Topic Modeling, evaluating models using a variety of model quality metrics and visualizing the complex distributions in the models. You can have a look at the slides here:

Probabilistic Topic Modeling with LDA – Practical topic modeling: Preparation, evaluation, visualization

The source code of the example project is available on GitHub. It shows how to perform the preprocessing and model evaluation steps with Python using tmtoolkit. The models can be inspected using PyLDAVis and some (exemplary) analyses on the data are performed.

* This presentation builds up on a first session on the theory behind Topic Modeling

Slides on Topic Modeling – Background, Hyperparameters and common pitfalls

I just uploaded my slides on probabilistic Topic Modeling with LDA that give an overview of the theory, the basic assumptions and prerequisites of LDA and some notes on common pitfalls that often happen when trying out this method for the first time. Furthermore I added a Jupyter Notebook that contains a toy implementation of the Gibbs sampling algorithm for LDA with lots of comments and plots to illustrate each step of the algorithm.

Topic Model Evaluation in Python with tmtoolkit

Topic modeling is a method for finding abstract topics in a large collection of documents. With it, it is possible to discover the mixture of hidden or “latent” topics that varies from document to document in a given corpus. As an unsupervised machine learning approach, topic models are not easy to evaluate since there is no labelled “ground truth” data to compare with. However, since topic modeling typically requires defining some parameters beforehand (first and foremost the number of topics k to be discovered), model evaluation is crucial in order to find an “optimal” set of parameters for the given data.

Several metrics exist for this task and some of them will be covered in this post. Furthermore, as calculating many models on a large text corpus is a computationally intensive task, I introduce the Python package tmtoolkit which allows to utilize all availabel CPU cores in your machine by computing and evaluating the models in parallel.

Read More →

Slides on Text Preprocessing and Feature Extraction for Quantitative Text Analysis

I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.

The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:

Slides for Text Processing and Feature Extraction for Quantitative Text Analysis (WZB Python User Group Workshop)

I can recommend the following supplementary resources:

Speeding up NLTK with parallel processing

When doing text processing with NLTK on large corpora, you often need a lot of patience since even simple methods like word tokenization take quite some time when you’re processing a large amount of text data. This is because NLTK does not often harness the power of modern multicore computers — the code will only run on a single core even if you have four processing cores in your machine. You will need to add parallel processing of your documents yourself. Fortunately this is quite straight forward to implement with Python’s multiprocessing module and I will show how to do this in this small post.

Read More →

Lemmatization of German language text

Lemmatization is the process of finding the base (or dictionary) form of a possibly inflected word — its lemma. It is similar to stemming, which tries to find the “root stem” of a word, but such a root stem is often not a lexicographically correct word, i.e. a word that can be found in dictionaries. A lemma is always a lexicographically correct word.

When using text mining models that depend on term frequency, such as Bag of Words or tf-idf, accurate lemmatization is often crucial, because you might not want to count the occurrences of the terms “book”, and “books” separately; you might want to reduce “books” to its lemma “book” so that it is included in the term frequency of “book”.

For English, automatic lemmatization is supported in many Python packages, for example in NLTK (via WordNetLemmatizer) or spaCy. For German, however, I could only find the CLiPS pattern package which has limited use (e.g. it cannot handle declined nouns) and is not supported in Python 3. By using the annotated TIGER corpus of the University of Stuttgart, I will try to measure the accuracy of a lemmatizer based on the pattern.de module and will suggest an improved lemmatizer which improves pattern.de’s accuracy by about 10%.

Read More →

Accurate Part-of-Speech tagging of German texts with NLTK

Part-of-speech tagging or POS tagging of texts is a technique that is often performed in Natural Language Processing. It allows to disambiguate words by lexical category like nouns, verbs, adjectives, and so on. This is useful in many cases, for example in order to filter large corpora of texts only for certain word categories. It is also often a prerequisite of lemmatization.

For English texts, POS tagging is implemented in the pos_tag() function of the widely used Python library NLTK. However, if you’re dealing with other languages, things get trickier. You can try to find a specialized library for your language, for example the pattern library from CLiPS Research Center, which implements POS taggers for German, Spanish and other languages. But apart from this library being only available for Python 2.x, its accuracy is suboptimal — only 84% for German language texts.

Another approach is to use supervised classification for POS tagging, which means that a tagger can be trained with a large text corpus as training data like the TIGER corpus from the Institute for Natural Language Processing / University of Stuttgart. It contains a large set of annotated and POS-tagged German texts. After training with such a dataset, the POS tagging accuracy is about 96% with the mentioned corpora. In this post I will explain how to load a corpus into NLTK, train a tagger with it and then use the tagger with your texts. Furthermore I’ll show how to save the trained tagger and load it from disk in order not to re-train it every time you need to use it.

Read More →

Autocorrecting misspelled Words in Python using HunSpell

When you’re dealing with natural language data, especially survey data, misspelled words occur quite often in free-text answers and might cause problems during later analyses. A fast and easy to implement approach to deal with these issues is to use a spellchecker and automatically correct misspelled words. I’ll show how to do this with PyHunSpell, a set of Python bindings for the open source spellchecker engine HunSpell which is also used in well-known software projects like Firefox, OpenOffice and works with many languages.

Read More →

Creating a sparse Document Term Matrix for Topic Modeling via LDA

To do topic modeling with methods like Latent Dirichlet Allocation, it is necessary to build a Document Term Matrix (DTM) that contains the number of term occurrences per document. The rows of the DTM usually represent the documents and the columns represent the whole vocabulary, i.e. the set union of all terms that appear in all documents.

The DTM will contain mostly zero values when we deal with natural language documents, because from the vast vocabulary of possible terms from all documents, only a few will be used in the individual documents (even after normalizing the vocabulary with stemming or lemmatization). Hence the DTM will be a sparse matrix in most cases — and this fact should be exploited to achieve good memory efficiency.

Read More →