Speeding up NLTK with parallel processing

When doing text processing with NLTK on large corpora, you often need a lot of patience since even simple methods like word tokenization take quite some time when you’re processing a large amount of text data. This is because NLTK does not often harness the power of modern multicore computers — the code will only run on a single core even if you have four processing cores in your machine. You will need to add parallel processing of your documents yourself. Fortunately this is quite straight forward to implement with Python’s multiprocessing module and I will show how to do this in this small post.

Standard (single core) implementation

Let’s say that we wanted to word-tokenize (i.e. split the document into a list of words) a corpus of documents and also apply Part-of-Speech tagging to it. Our example corpus is the Project Gutenberg Corpus which contains 18 English-language documents with their number of characters ranging from ~38k to ~4.3m. We use the default functions from NLTK for tokenization (nltk.word_tokenize) and POS tagging (nltk.pos_tag) so a basic script that loads the raw corpus’ documents and applies both methods would be as follows:

import nltk

corpus = {f_id: nltk.corpus.gutenberg.raw(f_id)
          for f_id in nltk.corpus.gutenberg.fileids()}

tokens = {}
for f_id, doc in corpus.items():
    tokens[f_id] = nltk.pos_tag(nltk.word_tokenize(doc))

When I run this script in IPython with timing and 5 iterations (like run -t -N5 singleproc.py) I get an average execution time of ~120 s. on my machine. Using a process monitor (like htop on Linux systems) I can see that only one processor of four cores in my machine is used during script execution. Let’s build a multi-core version of this script so the other three lazy processors can get to work and will hopefully speed up the processing!

Parallel processing (multi-core) implementation

There are two main modules in Python’s standard library for parallel processing: threading and multiprocessing. In general, threading should be used in Python when using I/O-heavy operations, i.e. reading or writing large files or doing concurrent and/or slow network operations. With threading, you could for example query multiple websites concurrently rather then query them one after another, which can improve overall performance. On the other hand, using threading in combination with computation heavy tasks would not increase performance because of CPythons internal implementation. You should rather use multiprocessing in this case, which starts separate Python processes in your operating system that can run in parallel. This article has a good explanation on the multiprocessing vs. multithreading topic in Python.

Since tokenization and POS tagging are computation intensive tasks, we will use the multiprocessing module. Parallelism can become quite cumbersome if you need to transfer data between processes, take care about synchronized memory access, etc. Luckily, we don’t really need to take care in our scenario, because we can easily divide the work that needs to be done and there are no dependencies between the data. The main approach for performance improvement here would be to evenly distribute the computation work across the number of CPU cores in our machine. So on a four core machine, we would start four processes and each process takes care about a certain number of documents in our corpus. Imagine that you have four workers with four desks and a big stack of books, each book having a different thickness. We can distribute the work evenly so that on each desk we have a staple with roughly the same height. Luckily, the books are completely independent on each other, so there’s no kind of inter-process dependency where worker A would have to wait for worker B until he’s finished a certain book.

In this scenario, we could expect a four-fold speed up with four processing cores as compared to a single core implementation. However, this is only a theoretical expectation as in practice there’s some overhead and communication involved with multiprocessing so that the speed up will be lower as we’ll see later. Furthermore, the distribution of work might not be optimal: If it’s done badly, one process might still crunch numbers, while others are already finished. The task of distributing work evenly (scheduling) is not trivial (in fact, it is a partition problem and known to be NP-hard). We will not go into detail about this at the moment and instead implement a simple solution using the Pool class of the multiprocessing module. At first, let’s implement a function that is executed for each document:

def tokenize_and_pos_tag(pair):
    f_id, doc = pair
    return f_id, nltk.pos_tag(nltk.word_tokenize(doc))

A pair tuple is passed with the file/document ID f_id and the document text doc, then the document is tokenized and afterwards POS tags are found for the tokens. A tuple containing the file ID and a list of POS-tagged tokens is returned. So far, nothing new. Now we will need to create a “pool of workers” to which the workload, i.e. the corpus of documents, and the function to be applied (i.e. tokenize_and_pos_tag) is passed. This can be done quite easily, by using the map function:

import multiprocessing as mp

if __name__ == '__main__':
    # automatically uses mp.cpu_count() as number of workers
    # mp.cpu_count() is 4 -> use 4 jobs
    with mp.Pool() as pool:
        tokens = pool.map(tokenize_and_pos_tag, corpus.items())

A pool is created by default with a number of processes that equals the number of processor cores (you can adjust this with the processes argument). pool.map then takes a function which will be executed by these workers. The second argument is the data which must be a sequence and is then submitted to the workers. Each worker get’s a chunk of this data to work on. Unfortunately, I could not find out how the chunk size is determined and what kind of scheduling is performed.

The full script is then:

import multiprocessing as mp
import nltk

corpus = {f_id: nltk.corpus.gutenberg.raw(f_id)
          for f_id in nltk.corpus.gutenberg.fileids()}

def tokenize_and_pos_tag(pair):
    f_id, doc = pair
    return f_id, nltk.pos_tag(nltk.word_tokenize(doc))


if __name__ == '__main__':
    # automatically uses mp.cpu_count() as number of workers
    # mp.cpu_count() is 4 -> use 4 jobs
    with mp.Pool() as pool:
        tokens = pool.map(tokenize_and_pos_tag, corpus.items())

When this script is executed, we can see in a process monitor that all 4 processor cores are used most of the time:

Running this on my machine results in ~64 s. which is a speed up of ~1.9x. This is already a big improvement! Still, it’s quite below the theoretical four-fold improvement. Besides the overhead introduced with multiprocessing, this might be due to the very diverse number of characters in the individual documents leading to sub-optimal work distribution. So further improvements could be achieved by tuning the scheduling (i.e. make it dependent on the number of characters per document). For corpora that contain documents of approx. similar length, this simple parallelism approach should however be sufficient.

Comments are closed.

Post Navigation