I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.
The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:
Slides for Text Processing and Feature Extraction for Quantitative Text Analysis (WZB Python User Group Workshop)
I can recommend the following supplementary resources:
Recent Comments