I just uploaded my slides on probabilistic Topic Modeling with LDA that give an overview of the theory, the basic assumptions and prerequisites of LDA and some notes on common pitfalls that often happen when trying out this method for the first time. Furthermore I added a Jupyter Notebook that contains a toy implementation of the Gibbs sampling algorithm for LDA with lots of comments and plots to illustrate each step of the algorithm.

## Topic Model Evaluation in Python with tmtoolkit

Topic modeling is a method for finding abstract topics in a large collection of documents. With it, it is possible to discover the mixture of hidden or “latent” topics that varies from document to document in a given corpus. As an *unsupervised* machine learning approach, topic models are not easy to evaluate since there is no labelled “ground truth” data to compare with. However, since topic modeling typically requires defining some parameters beforehand (first and foremost the number of topics *k* to be discovered), model evaluation is crucial in order to find an “optimal” set of parameters for the given data.

Several metrics exist for this task and some of them will be covered in this post. Furthermore, as calculating many models on a large text corpus is a computationally intensive task, I introduce the Python package tmtoolkit which allows to utilize all availabel CPU cores in your machine by computing and evaluating the models in parallel.