Lemmatization of German language text

Lemmatization is the process of finding the base (or dictionary) form of a possibly inflected word — its lemma. It is similar to stemming, which tries to find the “root stem” of a word, but such a root stem is often not a lexicographically correct word, i.e. a word that can be found in dictionaries. A lemma is always a lexicographically correct word.

When using text mining models that depend on term frequency, such as Bag of Words or tf-idf, accurate lemmatization is often crucial, because you might not want to count the occurrences of the terms “book”, and “books” separately; you might want to reduce “books” to its lemma “book” so that it is included in the term frequency of “book”.

For English, automatic lemmatization is supported in many Python packages, for example in NLTK (via WordNetLemmatizer) or spaCy. For German, however, I could only find the CLiPS pattern package which has limited use (e.g. it cannot handle declined nouns) and is not supported in Python 3. By using the annotated TIGER corpus of the University of Stuttgart, I will try to measure the accuracy of a lemmatizer based on the pattern.de module and will suggest an improved lemmatizer which improves pattern.de’s accuracy by about 10%.

Of course, the texts normally need to be tokenized and all tokens need to be annotated with Part-of-Speech (POS) tags. This is already included in the TIGER corpus, but I’ve previously written how POS tagging can be done accurately with German language texts.

Lemmatization with the pattern.de module can be achieved by using the singularize, conjugate or predicative functions for the respective word forms. We can already see that pattern.de only supports singularization of plural nouns, but it cannot be used to determine the lemma of a declined noun. However, in German this usually only applies to genitive nouns which are rarely used. Hence we can find lemmata for plural nouns and all kinds of inflected verbs and adjectives:

from pattern.de import singularize, conjugate, predicative

def lemma_via_patternlib(token, pos):
    if pos == 'NP':  # singularize noun
        return singularize(token)
    elif pos.startswith('V'):  # get infinitive of verb
        return conjugate(token)
    elif pos.startswith('ADJ') or pos.startswith('ADV'):  # get baseform of adjective or adverb
        return predicative(token)

    return token

When using this function with nouns, adjectives and adverbs in the TIGER corpus, the correct lemma is found in about 74% of the cases. This is already quite good, but it could be improved by using the TIGER corpus itself as lemmata dictionary and add two algorithms for better handling of composite nouns and adjectives:

  1. German nouns can be formed by creating composita from other words, for example “Feinstaubbelastungen” which consists of “Feinstaub” and “Belastungen”. We can successively split each noun at possible hyphenation positions into front and back parts. Then check if the back part exists in the lemmata dictionary, find the lemma for the back part and concatenate the front part with the back part lemma. In our example, we can hyphenate to word as “Fein-staub-be-last-ung-en”. The back parts that we try to find in our lemmata dictionary will then be: “Staubbelastungen”, “Belastungen”, “Lastungen”, “Ungen”, “En”. Since only the first two words are actual words, only those could be found in our dictionary. “Belastungen” exists in the lemmata dictionary and we find “Belastung” as its lemma, which we add to the front part “Feinstaub” and get the correct lemma “Feinstaubbelastung”.

  2. There are several prevalent German language adjective suffixes and we can build a small dictionary with these suffixes and their inflected forms which we can use to find the lemma of an inflected adjective:

def adj_lemma(w):
    for full, reduced in ADJ_SUFFIXES_DICT.items():
        if w.endswith(full):
            return w[:-len(full)] + reduced

    return w

With these improvements, we should achieve better results than with pattern.de alone and definitely better results than using a fixed lemmata dictionary lookup. Unfortunately I could not find another German text corpus with POS and lemma annotations to check the results. So in order to evaluate the improved lemmatizer, I split the TIGER corpus and used 90% as lemmata dictionary and the remaining 10% as test data, doing ten iterations and shuffling the corpus tokens on each iteration. With this approach 84% of the words could be correctly lemmatized when using pattern.de in combination with the TIGER corpus based lemmatizer.

The full code is available as germalemma package on GitHub. After downloading the TIGER corpus and converting the corpus file (see instructions in the README file), it can be directly used for lemmatization.

Comments are closed.

Post Navigation