Accurate Part-of-Speech tagging of German texts with NLTK

Part-of-speech tagging or POS tagging of texts is a technique that is often performed in Natural Language Processing. It allows to disambiguate words by lexical category like nouns, verbs, adjectives, and so on. This is useful in many cases, for example in order to filter large corpora of texts only for certain word categories. It is also often a prerequisite of lemmatization.

For English texts, POS tagging is implemented in the pos_tag() function of the widely used Python library NLTK. However, if you’re dealing with other languages, things get trickier. You can try to find a specialized library for your language, for example the pattern library from CLiPS Research Center, which implements POS taggers for German, Spanish and other languages. But apart from this library being only available for Python 2.x, its accuracy is suboptimal — only 84% for German language texts.

Another approach is to use supervised classification for POS tagging, which means that a tagger can be trained with a large text corpus as training data like the TIGER corpus from the Institute for Natural Language Processing / University of Stuttgart. It contains a large set of annotated and POS-tagged German texts. After training with such a dataset, the POS tagging accuracy is about 96% with the mentioned corpora. In this post I will explain how to load a corpus into NLTK, train a tagger with it and then use the tagger with your texts. Furthermore I’ll show how to save the trained tagger and load it from disk in order not to re-train it every time you need to use it.

Obtaining and loading the training corpus

In order to load a training corpus into NLTK, we need to obtain it in a format that NLTK understands. Fortunately, NLTK can read corpora in a big variety of formats as the list of corpus submodules shows. As we can see on the download page of the TIGER corpus, the data is available in CONLL09 format, which NLTK understands. So let’s download the latest corpus release in CONLL09 format and read it with NLTK:

import nltk
corp = nltk.corpus.ConllCorpusReader('.', 'tiger_release_aug07.corrected.16012013.conll09',
                                     ['ignore', 'words', 'ignore', 'ignore', 'pos'],
                                     encoding='utf-8')

Reading the file will take some time. It loads the tiger_release_....conll09 file from the current directory (denoted by “.”) and specifies the columns to use in the file (only “words” and “pos”, the rest will be ignored). Now we can load the sentences from the corpus and split them for training and for evaluation as shown in chapter 6 of the NLTK book.

import random

tagged_sents = list(corp.tagged_sents())
random.shuffle(tagged_sents)

# set a split size: use 90% for training, 10% for testing
split_perc = 0.1
split_size = int(len(tagged_sents) * split_perc)
train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_size]

Using a tagger and training it

As stated before, we want to use supervised classification in order to train our POS-tagger. But in order to do so, we need to decide which features to use for tagging. This means, which are the features of a word that describe best which word category they belong to? A classifier based tagger, which inspects words for prefixes, suffixes, and other attributes and also takes the sequence of words into account, seems to provide good results and has been extended for German language features and evaluated by Philipp Nolte. Using his Python class ClassifierBasedGermanTagger (which you can download from the github page) we can create a tagger and train it with the data from the TIGER corpus:

from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
tagger = ClassifierBasedGermanTagger(train=train_sents)

This will also take some time. Now we can evaluate the tagger by using the other 10% of the data (the test sentences):

accuracy = tagger.evaluate(test_sents)

This should result in an accuracy for the POS tagging of about 96%. We can simply try out our tagger with a list of words now:

tagger.tag(['Das', 'ist', 'ein', 'einfacher', 'Test'])
>>> [('Das', 'ART'),
     ('ist', 'VAFIN'),
     ('ein', 'ART'),
     ('einfacher', 'ADJA'),
     ('Test', 'NN')]

The words are tagged as specified in the CONLL format. It correctly identified articles (‘ART’), the verb “ist” (‘VAFIN’), the adjective (‘ADJA’), and the noun (‘NN’).

Note that you can already use this tagger now, but you can also train it again with the complete TIGER corpus, not just 90% as before, because we don’t need the 10% of evaluation data any more.

Saving and loading a trained tagger

Now you probably don’t want re-train the tagger every time you want to use it, because it takes quite some time. Luckily, we can use Python’s pickle module which can be used to store (“serialize”) a complete Python object on disk and load (“deserialize”) it from there again. Let’s save the whole tagger object to disk — it complete freezes its state and also saves all the information it has “learned” from the training data:

import pickle

with open('nltk_german_classifier_data.pickle', 'wb') as f:
    pickle.dump(tagger, f, protocol=2)

Note that it is important to open the output file in “write binary” ('wb') mode. Also note that you need to specify a protocol number equal or less than 2 if you want to deserialize the object also in a Python 2.x environment later.

Now suppose that you want to use your tagger in a project, you can easily load it from disk by using pickle.load() (again using “read binary” ('rb') mode):

with open('nltk_german_classifier_data.pickle', 'rb') as f:
    tagger = pickle.load(f)

Note that all the imports that are used during serialization (nltk, ClassifierBasedGermanTagger) must also be available when loading the tagger from disk!

If you want to download a POS tagger trained with the TIGER corpus, I’ve provided the pickle-file which can be loaded with Python 2 and 3.

Bonus: Quick lemmatization with lemmata from the TIGER corpus and HunSpell

I noticed that the TIGER corpus also contains lemmata for each word. I couldn’t devise a quick method for supervised lemmatization in German (something like that what WordNet provides for English texts), but it can be used for direct lemma lookup from a table — a simple but effective method if your corpus is big (which TIGER is). At first we need to extract the lemmata from the TIGER CONLL file for which I created the following method:

def read_lemmata_from_tiger_corpus(tiger_corpus_file, valid_cols_n=15, col_words=1, col_lemmata=2):
    lemmata_mapping = {}

    with open(tiger_corpus_file) as f:
        for line in f:
            parts = line.split()
            if len(parts) == valid_cols_n:
                w, lemma = parts[col_words], parts[col_lemmata]
                if w != lemma and w not in lemmata_mapping and not lemma.startswith('--'):
                    lemmata_mapping[w] = lemma

    return lemmata_mapping

This will create a dict with a word to lemma mapping. As a fallback option we can also use PyHunSpell’s stem() method. I explained how to set up and use PyHunSpell in a previous post. Its stem() method actually does not try to do stemming but lemmatization. It’s not that accurate but it can be used as a fallback so that we can lemmatize a list of words like this:

lemmata_mapping = read_lemmata_from_tiger_corpus('tiger_release_aug07.corrected.16012013.conll09')
spellchecker_enc = spellchecker.get_dic_encoding()
for w in words:
    w_lemma = lemmata_mapping.get(w, None)
    if not w_lemma:
        lemmata_hunspell = spellchecker.stem(w)
        if lemmata_hunspell:
            w_lemma = lemmata_hunspell[-1].decode(spellchecker_enc)

    if w_lemma:
        lemmata.append((w_lemma, pos))
    else:
        lemmata.append((w, pos))

Comments are closed.

Post Navigation