Category Archives: Machine Learning

Slides on practical Topic Modeling: Preparation, evaluation, visualization

I gave a presentation on Topic Modeling from a practical perspective*, using data about the proceedings of plenary sessions of the 18th German Bundestag as provided by offenesparlament.de. The presentation covers preparation of the text data for Topic Modeling, evaluating models using a variety of model quality metrics and visualizing the complex distributions in the models. You can have a look at the slides here:

Probabilistic Topic Modeling with LDA – Practical topic modeling: Preparation, evaluation, visualization

The source code of the example project is available on GitHub. It shows how to perform the preprocessing and model evaluation steps with Python using tmtoolkit. The models can be inspected using PyLDAVis and some (exemplary) analyses on the data are performed.

* This presentation builds up on a first session on the theory behind Topic Modeling

Slides on Topic Modeling – Background, Hyperparameters and common pitfalls

I just uploaded my slides on probabilistic Topic Modeling with LDA that give an overview of the theory, the basic assumptions and prerequisites of LDA and some notes on common pitfalls that often happen when trying out this method for the first time. Furthermore I added a Jupyter Notebook that contains a toy implementation of the Gibbs sampling algorithm for LDA with lots of comments and plots to illustrate each step of the algorithm.

Topic Model Evaluation in Python with tmtoolkit

Topic modeling is a method for finding abstract topics in a large collection of documents. With it, it is possible to discover the mixture of hidden or “latent” topics that varies from document to document in a given corpus. As an unsupervised machine learning approach, topic models are not easy to evaluate since there is no labelled “ground truth” data to compare with. However, since topic modeling typically requires defining some parameters beforehand (first and foremost the number of topics k to be discovered), model evaluation is crucial in order to find an “optimal” set of parameters for the given data.

Several metrics exist for this task and some of them will be covered in this post. Furthermore, as calculating many models on a large text corpus is a computationally intensive task, I introduce the Python package tmtoolkit which allows to utilize all availabel CPU cores in your machine by computing and evaluating the models in parallel.

Read More →