Topic modeling is a method for finding abstract topics in a large collection of documents. With it, it is possible to discover the mixture of hidden or “latent” topics that varies from document to document in a given corpus. As an *unsupervised* machine learning approach, topic models are not easy to evaluate since there is no labelled “ground truth” data to compare with. However, since topic modeling typically requires defining some parameters beforehand (first and foremost the number of topics *k* to be discovered), model evaluation is crucial in order to find an “optimal” set of parameters for the given data.

Several metrics exist for this task and some of them will be covered in this post. Furthermore, as calculating many models on a large text corpus is a computationally intensive task, I introduce the Python package tmtoolkit which allows to utilize all availabel CPU cores in your machine by computing and evaluating the models in parallel.

### Evaluation methods for probabilistic LDA topic model

We will use topic models based on the Latent Dirichlet Allocation (LDA) approach by Blei et al., which is the most popular topic model to date.

Model evaluation is hard when using unlabeled data. The metrics described here all try to assess a model’s quality with theoretic methods in order to find the “best” model. Still it is important to check if this model makes sense practically. A good method to assess the quality of a model practically is the human-in-the-loop approach, where a human has to spot manually inserted random “intruder” words or “topic intruders”. Still, it is wise to choose the best models with a theoretic approach at first.

### Likelihood and perplexity

The *likelihood* measures the probability of the observed data, given the model — i.e. how well a model fits the observed data. The higher the likelihood, the better the model for the given data. When using Gibbs sampling to find the topic models, the likelihood can be estimated as described in Griffiths and Steyvers 2004. The Python package *lda* implements this likelihood estimation function as `LDA.loglikelihood()`

. Griffiths and Steyvers calculate the overall log-likelihood of a model by taking the harmonic mean of the log likelihoods in the Gibbs sampling iterations after a certain number of “burn-in” iterations. Wallach et al. (ICML 2009) raised concerns about the accuracy of this method but note that it reports the ranking of the models correctly, which should be enough for our use cases.

*Perplexity* is also a measure of model quality and in natural language processing is often used as “perplexity per number of words”. It describes how well a model predicts a sample, i.e. how much it is “perplexed” by a sample from the observed data. The lower the score, the better the model for the given data. It can be calculated as `exp^(-L/N)`

where `L`

is the log-likelihood of the model given the sample and `N`

is the number of words in the data. Both scikit-learn and gensim have implemented methods to estimate the log-likelihood and also the perplexity of a topic model.

### Evaluating the posterior distributions’ density or divergence

There are metrics that solely evaluate the posterior distributions (the topic-word and document-topic distributions) without comparing the model somehow with the observed data. Juan et al. describe a method that relies on the pair-wise distances between all topics in the topic-word distribution of a model. They claim that the higher the pair-wise distance between the topics, the higher is the *information density* captured by the model. The metric boils down to calculating the cosine similarity `u*v/(|u|*|v|)`

(where `|u|`

and `|v|`

are the L2 norms of the respective vectors) for each pair of distributions `u`

and `v`

in the posterior topic-word distribution of the topic model and then taking the mean of these similarities. The lower the mean, the less similar the topics are and the better is the model (at least by this metric).

Arun et al. criticize that this metric does not take into account the distribution of topics across the documents. They propose a metric that calculates the symmetric Kullback-Leibler divergence between the distribution of variance in the topic-word distribution and the marginal topic distribution. They observed “empirically [that] this divergence […] start[s] to increase once the right number of topics is reached”. Hence the lower the symmetric divergence, the better. Unfortunately, the paper is very wage when it comes to explaining why this empirical finding should work.

## Finding the best topic model with the Associated Press data

We will use a classic dataset, a collection of 2,246 Associated Press articles containing 10,473 unique words. It can be downloaded in raw format from David Blei’s website or directly as zipped Python pickle file here. We will try to find an optimal value for the number of topics *k*.

### Computing and evaluating the topic models with *tmtoolkit*

The Python package tmtoolkit comes with a set of functions for evaluating topic models with different parameter sets in parallel, i.e. by utilizing all CPU cores. It uses (or implements) the above metrics for comparing the calculated models.

The main functions for topic modeling reside in the `tmtoolkit.lda_utils`

module. Since we will use the lda package, we will need to install it first in order to use the evaluation functions specific for that package. We can start by importing the functions that we need:

```
import matplotlib.pyplot as plt # for plotting the results
plt.style.use('ggplot')
# for loading the data:
from tmtoolkit.utils import unpickle_file
# for model evaluation with the lda package:
from tmtoolkit.lda_utils import tm_lda
# for constructing the evaluation plot:
from tmtoolkit.lda_utils.common import results_by_parameter
from tmtoolkit.lda_utils.visualize import plot_eval_results
```

Next we load the data consisting of the document labels, the vocabulary (unique words) list and the document-term-matrix `dtm`

. We make sure the `dtm`

has the right dimensions:

```
doc_labels, vocab, dtm = unpickle_file('ap.pickle')
print('%d documents, %d vocab size, %d tokens' % (len(doc_labels), len(vocab), dtm.sum()))
assert len(doc_labels) == dtm.shape[0]
assert len(vocab) == dtm.shape[1]
```

Now we define the parameter sets that should be evaluated. We set a dictionary of constant parameters `const_params`

which will be used for each topic model computation and will stay the same. We also set a list of varying parameters `varying_params`

containing dictionaries with different parameter values:

```
const_params = dict(n_iter=2000)
ks = list(range(10, 100, 10)) + list(range(100, 300, 20)) + list(range(300, 500, 50)) + [500, 600, 700]
varying_params = [dict(n_topics=k, alpha=1.0/k) for k in ks]
```

Here, we want to calculate different topic models from a sequence of numbers of topics `ks = [10, 20, .. 100, 120, .. 300, 350, .. 500, 600, 700]`

. Since we have 26 different values in `ks`

, we will create and compare 26 topic models. Note that we also define an `alpha`

parameter for each model as `1/k`

(see below for a discussion about the alpha and beta hyperparameters in LDA). The parameter names must match the parameters for the respective topic modeling package that is used. Here, we will use `lda`

and hence we pass parameters like `n_iter`

or `n_topics`

, whereas with other packages the parameter names would differ (e.g. `num_topics`

instead `n_topics`

in gensim).

Now we can start evaluating our models using the `evaluate_topic_models`

function in the `tm_lda`

module and passing it our list of varying parameters and the dictionary with constant parameters:

```
eval_results = tm_lda.evaluate_topic_models(dtm,
varying_params,
const_params)
```

By default, this will use all your CPU cores to calculate the models and evaluate them in parallel. If we have 4 CPU cores and 26 models to evaluate, tmtoolkit will start 4 subprocesses and will distribute the first 4 model calculations to them and we have 22 left. When the first model calculation is finished on any of the subprocesses, the fifth model calculation task is started by this subprocess and so on. This ensures that all subprocesses (and hence all CPU cores) are busy all the time.

The evaluation function will return a list `eval_results`

that contains 2-tuples. Each of these tuples consists of a dictionary with the parameters that have been used to calculate the model and a dictionary of evaluation results returned by the respective metrics. With `results_by_parameter`

we restructure the results for the parameter that we’re interested in and that we want to plot on the x-axis:

```
results_by_n_topics = results_by_parameter(eval_results, 'n_topics')
fig, ax = plt.subplots(figsize=(8, 6))
plot_eval_results(fig, ax, results_by_n_topics)
plt.tight_layout()
plt.show()
```

The `plot_eval_results`

function creates the plot with all metrics that were calculated during evaluation. Afterwards, we could adjust the plot with matplotlib methods if necessary (e.g. adding a plot title) and finally we show and/or save the plot.

### Results

The plots show normalized values for the respective metrics, i.e. scaled values between [0, 1] for Arun and Juan, and [-1, 0] for the log-likelihood. We can see that the log-likelihood maximizes for values of k between 100 and 350. The Arun metric points to values between 200 and 400. The Juan metric starts to minimize around k=100 but does not go up again within the further range of k. This is probably because this method only assesses the topic-word distribution. Since the corpus is quite large (with more than 400,000 words), it is likely that even for large k the calculated metric will be very low, because the “density” in the topic-word distribution (i.e. pair-wise distances between the word distributions per topic) will still be very high.

Please note that for the “loglikelihood” metric only the loglikelihood estimation for the final model is reported, which is not the same as the harmonic mean method as used by Griffiths and Steyvers. The Griffiths and Steyvers method could not be used since it requires a special Python package (*gmpy2*) that was not available on the CPU-cluster machine on which I ran the evaluation. However, the “loglikelihood” will report quite similar results.

### Alpha and beta parameters

Besides the number of topics, there are also the *alpha* and *beta* (sometimes *eta* in literature) parameters. Both are used to define Dirichlet priors that are used in the calculations for the respective posterior distributions. Alpha is the “concentration parameter” for a prior over the document-specific topic distributions and beta for a prior over the topic-specific word distributions. Both specify prior beliefs about the sparsity/homogeneity of topics and words in the corpus.

Alpha plays a role in the sparsity of topics in the documents. A high alpha value means a lower impact of topic sparsity, i.e. it is expected that a document contains a mixture of most topics, whereas a low alpha value means that we expect documents to cover only a few topics. This is also why alpha is often set to a fraction of the number of topics (like *1/k* in our evaluations): With more topics to discover, we expect that each document will contain fewer, but more specific topics. As extreme examples: If we wanted to discover only two topics (*k=2*) then it is very likely that all documents contain both topics (to a different amount) and hence we have a large value of *alpha=1/2*. If we wanted to discover *k=1000* topics, it is very likely that most of the documents will not cover all of the 1000 topics but only a small fraction of them (i.e. the sparsity is high) and hence we take a low value of *alpha=1/1000* to account for this expected sparsity.

Likewise, beta plays a role in the sparsity of words in the topics. A high beta value means a lower impact of word sparsity, i.e. we expect that each topic will contain most of the words of the corpus. These topics will be more “general” and their word probabilities will be more uniform. A low beta value means the topics should be more specific, i.e. their word probabilities will be less uniform, placing higher probabilities on fewer words. Of course this is also connected to the number of topics to be discovered. A high beta means that few, but more general topics are to be discovered, a low beta should be used for a larger amount of topics which are more specific. Griffiths and Steyvers explain that beta “affects the granularity of the model: a corpus of documents can be sensibly factorized into a set of topics at different scales […]. [A] large value of beta would lead the model to find a relatively small number of topics […] whereas smaller values of beta will produce more topics.”

When we run the evaluation with the same alpha parameter and same range of *k* as above, but with beta=0.1 instead of beta=0.01, we see that the log-likelihood maximizes for a lower range of *k*, namely about 70 to 300 (see figure above). With the Arun et al. metric, it points to values for k between 70 and 240. So this confirms our assumptions on beta: Higher beta should be used when trying to find a smaller number of topics. Interestingly, the Juan metric this time also shows a valley in it’s curve within the range of the given *k* values. This means that a higher value for beta also leads to a lower information density in the topic-word distribution when using a model with many topics.

There are numerous possibilities of combining these parameters, however it is often not easy to interpret the interactions. The following figures show evaluation results of different scenarios: (1) a fixed value for alpha and a beta depending on *k*, (2) both alpha and beta fixed, (3) both alpha and beta depending on *k*.

The LDA hyperparameters alpha, beta and the number of topics are all connected with each other and the interactions are quite complex. It is wrong to think that there is a certain “correct” configuration of parameters for a given set of documents. First of all, it is important to make clear how granular a model should be. If it should cover only a few, but quite general topics or if a larger amount of more specific topics should be captured. Alpha and beta can be set accordingly and a few example models can be calculated (for example by using the `compute_models_parallel`

function in tmtoolkit). In most cases a fixed value for beta to define a models “granularity” seems reasonable and that’s also what Griffiths and Steyvers recommend. A more fine-tuned model evaluation with a varying alpha parameter (depending on *k*) to find a good number of topics can be done using the explained metrics.

Wallach et al. (NIPS 2009) showed that it can be beneficial to use asymmetric priors over the topics in the documents, which means that certain topics can be used more often than others (slide 25). Such topics usually consist of stopwords or very general terms, which, in further analysis can then be ignored. They also show that such models with asymmetric priors perform better when setting the number of topics very high. Still, it is often sufficient in practice to use symmetric priors when removing stopwords beforehand and evaluating a reasonable number of topics with the metrics described.

### Validation on held-out data

Topic models can also be validated on held-out data. Unfortunately, none of the mentioned Python packages for topic modeling properly calculate perplexity on held-out data and tmtoolkit currently does not provide this either. Furthermore, this is even more computationally intensive, especially when doing cross-validation. Still, it would be interesting to compare these results with results from cross-validation, which could be done in future work.