I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.
The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:
I can recommend the following supplementary resources:
- the free NLTK book (focused on English texts, also gives introduction to working with Python, quite linguistics-heavy)
- D. Sarkar, Text Analytics with Python (apress 2016) (good overview on many different algorithms and models, also gives introduction to working with Python, source code examples often unnecessarily complicated for beginners (triple nested list comprehensions!))
- Gensim Tutorials
Recent Comments