Autocorrecting misspelled Words in Python using HunSpell

When you’re dealing with natural language data, especially survey data, misspelled words occur quite often in free-text answers and might cause problems during later analyses. A fast and easy to implement approach to deal with these issues is to use a spellchecker and automatically correct misspelled words. I’ll show how to do this with PyHunSpell, a set of Python bindings for the open source spellchecker engine HunSpell which is also used in well-known software projects like Firefox, OpenOffice and works with many languages.

Prerequisites

At first you will need to install the packages python-dev and libhunspell-dev with your OS package manager (Linux) or with port/brew (Mac OSX). Then you should make sure that you have the dictionaries installed for the language that you’ll be using. You can also do that with your package manager; the packages are named hunspell-en-us for US-English, hunspell-de-de for German, etc. Finally you can install the Python library PyHunSpell via pip from the Python Package Index.

Usage

Using the Python library is quite simple. At first you will need to create a HunSpell instance by specifying the paths to the dictionary and affix files for your language, e.g. for German language:

import hunspell
spellchecker = hunspell.HunSpell('/usr/share/hunspell/de_DE.dic',
                                 '/usr/share/hunspell/de_DE.aff')

When you’ve installed the dictionaries with your system’s package manager, these files usually reside in /usr/share/hunspell/.

Now we can use it by checking the spelling of a word:

spellchecker.spell('Wörterbuch')   # "Wörterbuch" is "dictionary" in German
>>> True
spellchecker.spell('Wörterbuhc')   # Let's introduce a typo
>>> False

Please note that I’m using Python 3 here so all strings are Unicode strings, hence there’s no problem with the non-ASCII Umlaut. It’s definitely preferable to use Python 3 when you’re dealing with anything other than English texts, because otherwise you’ll have to deal with Python 2.x’s Unicode frustrations.

So we detected that the word “Wörterbuhc” is not correctly spelled, let’s correct it! We can use the suggest() method for this:

spellchecker.suggest('Wörterbuhc')
>>> [b'W\xf6rterbuch', b'Beiw\xf6rter']

The spellchecker will return a sorted list of words with the most likely correction as the first word. It can also return an empty list if it can’t suggest any word. Apparently there are some strange “\xf6” letters in our suggestions. This is because HunSpell returns the suggestions as bytes objects (hence the little b prefix), not as Unicode strings (str type). The returned bytes are encoded in the dictionary’s encoding and we need to decode them first to get proper strings. At first we need to find out the dictionary’s encoding:

spellchecker.get_dic_encoding()
>>> 'ISO8859-1'

So our dictionary is encoded as ISO-8859-1 and luckily we can pass this encoding labeling directly to a byte object’s decode() function. So we can get the auto-corrected word as follows:

enc = spellchecker.get_dic_encoding()
suggestions = spellchecker.suggest('Wörterbuhc')
autocorrected = suggestions[0].decode(enc)
autocorrected
>>> 'Wörterbuch'

Now a word of warning: Autocorrection is not a cure-all solution. You should definitely check the corrections that are being made to your texts, since all words that the spellchecker doesn’t know will be replaced by a word that most likely “fits”. Especially when you’re working with texts that use many domain-specific words, chances are high that the spellchecker doesn’t know them and produces garbage replacements. In such cases you need to provide additional words to the spellchecker’s dictionary. This can be done with the add() function of the spellchecker.

When we put all this together, we can define a function like this, which takes a previously initialized spellchecker, a sequence of words and optionally a sequence of custom words that should be added to the dictionary first. It will return a list with the corrected words:

def correct_words(spellchecker, words, add_to_dict=[]):   
    enc = spellchecker.get_dic_encoding()   # get the encoding for later use in decode()

    # add custom words to the dictionary
    for w in add_to_dict:
        spellchecker.add(w)

    # auto-correct words
    corrected = []
    for w in words:
        ok = spellchecker.spell(w)   # check spelling
        if not ok:
            suggestions = spellchecker.suggest(w)
            if len(suggestions) > 0:  # there are suggestions
                best = suggestions[0].decode(enc)   # best suggestions (decoded to str)
                corrected.append(best)
            else:
                corrected.append(w)  # there's no suggestion for a correct word
        else:
            corrected.append(w)   # this word is correct

    return corrected

Comments are closed.

Post Navigation