Part-of-speech tagging or POS tagging of texts is a technique that is often performed in Natural Language Processing. It allows to disambiguate words by lexical category like nouns, verbs, adjectives, and so on. This is useful in many cases, for example in order to filter large corpora of texts only for certain word categories. It is also often a prerequisite of lemmatization.
For English texts, POS tagging is implemented in the pos_tag()
function of the widely used Python library NLTK. However, if you’re dealing with other languages, things get trickier. You can try to find a specialized library for your language, for example the pattern library from CLiPS Research Center, which implements POS taggers for German, Spanish and other languages. But apart from this library being only available for Python 2.x, its accuracy is suboptimal — only 84% for German language texts.
Another approach is to use supervised classification for POS tagging, which means that a tagger can be trained with a large text corpus as training data like the TIGER corpus from the Institute for Natural Language Processing / University of Stuttgart. It contains a large set of annotated and POS-tagged German texts. After training with such a dataset, the POS tagging accuracy is about 96% with the mentioned corpora. In this post I will explain how to load a corpus into NLTK, train a tagger with it and then use the tagger with your texts. Furthermore I’ll show how to save the trained tagger and load it from disk in order not to re-train it every time you need to use it.
Read More →
Recent Comments