Creating a sparse Document Term Matrix for Topic Modeling via LDA

To do topic modeling with methods like Latent Dirichlet Allocation, it is necessary to build a Document Term Matrix (DTM) that contains the number of term occurrences per document. The rows of the DTM usually represent the documents and the columns represent the whole vocabulary, i.e. the set union of all terms that appear in all documents.

The DTM will contain mostly zero values when we deal with natural language documents, because from the vast vocabulary of possible terms from all documents, only a few will be used in the individual documents (even after normalizing the vocabulary with stemming or lemmatization). Hence the DTM will be a sparse matrix in most cases — and this fact should be exploited to achieve good memory efficiency.

The memory consumption for the DTM can be calculated by n_docs * n_vocab * n_bytes, with the number of documents, number of terms in the vocabulary and the number of bytes needed to store a single value respectively. This number can become quite big even with a medium-sized corpus. In a recent project I processed a corpus of ~19,000 documents which had a vocabulary of ~170,000 terms. If you store this in a DTM with 32-bit integers, you get a DTM that consumes about 12 GB. This might already be too much for a desktop computer and furthermore, it just takes too much processing time to produce this huge matrix.

In such a case, it’s best to utilize the fact that the DTM is a sparse matrix and only store the non-zero values of the matrix in memory. In Python this can be done with scipy’s coo_matrix (“coordinate list – COO” format) functions, which can be later used with Python’s lda package for topic modeling. The memory and processing time savings can be huge: In my example, the DTM had less than 1% non-zero values. In COO format, such a matrix consumes only about 3 * n_nonzero * n_bytes bytes, which means only about 360 MB instead of 12 GB for the given example.

Creating a sparse matrix

In order to create a sparse matrix, we need to pass the data to coo_matrix() in a certain format, which is given as follows in the documentation:

coo_matrix((data, (i, j))), with data being an array of all non-zero values, i being an array of row indices for each entry in data and j being an array of column indices for each entry in data (hence the memory consumption of 3 * n_nonzero).

Let’s create a sparse 3×4 matrix which should look like this:

7 0 0 0
0 0 0 8
0 9 0 0

Our non-zero values are [7, 8, 9], the row indices are [0, 1, 2] (7 in row 0, 8 in row 1, etc.) and the column indices are [0, 3, 1] (7 in column 0, 8 in column 3, etc.). So we can create this matrix as follows:

import numpy as np
from scipy.sparse import coo_matrix

data = np.array([7, 8, 9])
rows = np.array([0, 1, 2])
cols = np.array([0, 3, 1])

m = coo_matrix((data, (rows, cols)), shape=(3, 4))

It is best to explicitly define the matrix’ shape with the shape parameter, otherwise it is “inferred from the index arrays” which might not be what you want. When we print this matrix, we’ll get the stored non-zero values along with their matrix indices:

  (0, 0)        7
  (1, 3)        8
  (2, 1)        9

In order to verify that this is the matrix that we wanted, we can convert it back to a normal “dense” matrix and print it:

  [[7 0 0 0]
   [0 0 0 8]
   [0 9 0 0]]

Creating a sparse DTM

Suppose that we processed a corpus of documents and created a dictionary with their respective terms that looks like the following:

docs = {
    'doc1': ['python', 'text', 'data', 'nlp', 'data', 'matrix', 'mining'],
    'doc2': ['data', 'science', 'data', 'processing', 'cleaning', 'data'],
    'doc3': ['r', 'data', 'science', 'text', 'mining', 'nlp'],
    'doc4': ['programming', 'c', 'algorithms', 'data', 'structures'],

We want to create a sparse DTM from that. In order to do that, we need the following:

  • an array of document names — this will represent the rows of the DTM
  • an array with the complete vocabulary (all words used in all documents) — this will represent the columns of the DTM
  • the number of non-zero values in the DTM — think of this as the number of unique terms per documents since each unique term in a document will turn up as a non-zero value in a DTM row

We can compute these things quite easily by using Python’s set type which stores unique elements (i.e. all unique terms in a document) and lets us employ set operations:

n_nonzero = 0
for docterms in docs.values():
    unique_terms = set(docterms)    # all unique terms of this doc
    vocab |= unique_terms           # set union: add unique terms of this doc
    n_nonzero += len(unique_terms)  # add count of unique terms in this doc

# make a list of document names
# the order will be the same as in the dict
docnames = list(docs.keys())

Now we have everything set up to create our sparse DTM. First we’ll convert some lists/sets to NumPy arrays because we’ll use NumPy functions for fast processing:

docnames = np.array(docnames)
vocab = np.array(list(vocab))  

We will also need an array that holds the indices that would sort vocab. With this array, we can later identify the terms of each document by their index and also the count of how often these terms occur:

vocab_sorter = np.argsort(vocab)    # indices that sort "vocab"

The dimensions of our DTM will be len(docnames) by len(vocab):

ndocs = len(docnames)
nvocab = len(vocab)

We also need the three arrays for creating our coo_matrix as explained before. It’s much faster allocating empty memory for these arrays (we already know their size!) and filling them for each document than enlarging these arrays in a loop for each document. It’s also good to specify a data type directly — in this case a “C integer” which means a 32 bit integer. Otherwise NumPy will use a default data type of 64 bit float, which is a waste of memory in our case.

data = np.empty(n_nonzero, dtype=np.intc)     # all non-zero term frequencies at data[k]
rows = np.empty(n_nonzero, dtype=np.intc)     # row index for kth data item (kth term freq.)
cols = np.empty(n_nonzero, dtype=np.intc)     # column index for kth data item (kth term freq.)

Now comes the most important part. We’ll write a loop that goes through all documents. For each list of terms in each document, we get the indices into vocab_sorter. This variable is called term_indices. If vocab had the terms “c, d, b, a” then vocab_sorter had the indices that sort vocab (i.e. [2, 3, 1, 0]). If terms in a document was “d, a, a, b” then term_indices would be [3, 0, 0, 1]. By using np.unique() we then get only the unique values in term_indices and as a bonus also the counts of each value (as in a histogram). So uniq_indices and counts will be [3, 0, 1] and [1, 2, 1] respectively. The former are the column indices in the DTM (remember that the columns represent the vocabulary) and the latter represent the actual values in a DTM row (as it is the counts of the terms for this document). The length of the uniq_indices array represents the number unique terms in a document (n_vals = len(uniq_indices)) and hence we need to fill our three sparse DTM arrays data, rows and cols from the current index ind up to ind + n_vals: data[ind:ind_end] = counts and cols[ind:ind_end] = uniq_indices. The only thing that’s left is to save the DTM row indices which represent the documents. At first we find out which document index represents our document name: doc_idx = np.where(docnames == docname). Then we fill in this value repeatedly: rows[ind:ind_end] = np.repeat(doc_idx, n_vals).

We do this all over for all documents. As a Python script this reads as follows:

ind = 0     # current index in the sparse matrix data
# go through all documents with their terms
for docname, terms in docs.items():
    # find indices into  such that, if the corresponding elements in  were
    # inserted before the indices, the order of  would be preserved
    # -> array of indices of  in 
    term_indices = vocab_sorter[np.searchsorted(vocab, terms, sorter=vocab_sorter)]

    # count the unique terms of the document and get their vocabulary indices
    uniq_indices, counts = np.unique(term_indices, return_counts=True)
    n_vals = len(uniq_indices)  # = number of unique terms
    ind_end = ind + n_vals  #  to  is the slice that we will fill with data

    data[ind:ind_end] = counts                  # save the counts (term frequencies)
    cols[ind:ind_end] = uniq_indices            # save the column index: index in 
    doc_idx = np.where(docnames == docname)     # get the document index for the document name
    rows[ind:ind_end] = np.repeat(doc_idx, n_vals)  # save it as repeated value

    ind = ind_end  # resume with next document -> add data to the end

Finally, we can create the coo_matrix:

dtm = coo_matrix((data, (rows, cols)), shape=(ndocs, nvocab), dtype=np.intc)

Let’s have look if the DTM is correct for our example:

  ['doc1', 'doc4', 'doc3', 'doc2']
  ['science', 'mining', 'c', 'text', 'nlp', 'structures',
   'processing', 'matrix', 'r', 'algorithms', 'data',
   'programming', 'python', 'cleaning']
  [[0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 2, 0, 1, 0],
   [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0],
   [1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
   [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 1]]

It is important that neither docnames nor vocab is sorted in some way so don’t get confused!
Let’s look at the first “1” in the first DTM row: It says that for docnames[0] (“doc1”) vocab[1] (“mining”) occurs once, which is correct. Let’s look at the only value “2” in the first DTM row: It says that for docnames[0] (“doc1”) vocab[10] (“data”) occurs twice, which is also correct. At last, let’s look at the only value “3” in the fourth DTM row: It says that for docnames[3] (“doc2”) vocab[10] (“data”) occurs three times (correct).

Finally, let’s plug this DTM into Python’s lda package. We should definitely have a much bigger corpus for proper topic modeling, but for demonstration purposes let’s find out three topics with three common top words each:

model = lda.LDA(n_topics=3, n_iter=1000, random_state=1)

topic_word = model.topic_word_
n_top_words = 3

for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Although the corpus was so small, LDA returns topics that actually make sense regarding the commonly used terms in the documents:

Topic 0: data science cleaning
Topic 1: programming algorithms structures
Topic 2: nlp text mining

Comments are closed.

Post Navigation