Creating a sparse Document Term Matrix for Topic Modeling via LDA

To do topic modeling with methods like Latent Dirichlet Allocation, it is necessary to build a Document Term Matrix (DTM) that contains the number of term occurrences per document. The rows of the DTM usually represent the documents and the columns represent the whole vocabulary, i.e. the set union of all terms that appear in all documents.

The DTM will contain mostly zero values when we deal with natural language documents, because from the vast vocabulary of possible terms from all documents, only a few will be used in the individual documents (even after normalizing the vocabulary with stemming or lemmatization). Hence the DTM will be a sparse matrix in most cases — and this fact should be exploited to achieve good memory efficiency.

Read More →

About the WZB Data Science Blog

This blog collects some experiences from my daily work in the Data Science field of the WZB. The posts will focus around the following topics:

  • Data extraction / data mining
  • Data visualization
  • Data analysis

Read More →