July | 2017 | WZB Data Science Blog

Slides on Text Preprocessing and Feature Extraction for Quantitative Text Analysis

I’ve recently given a small workshop on Text Preprocessing and Feature Extraction for Quantitative Text Analysis with Python at the WZB. In the first part, we discussed different methods for normalizing, parsing and filtering the raw input text like tokenization, Part-of-Speech tagging, stemming and lemmatization. The second part focuses on feature extraction, explaining the Bag-of-Words model and the tf-idf approach as prominent examples. Both are the foundation for many text analysis algorithms used in text classification, topic modeling or clustering. The slides emphasize the importance of these processing steps that come before the actual text analysis algorithms are applied, because: garbage in, garbage out.

The explanations on the slides are quite detailed, so I thought putting them online might be informative for others. So here we go:

Slides for Text Processing and Feature Extraction for Quantitative Text Analysis (WZB Python User Group Workshop)

I can recommend the following supplementary resources:

the free NLTK book (focused on English texts, also gives introduction to working with Python, quite linguistics-heavy)
D. Sarkar, Text Analytics with Python (apress 2016) (good overview on many different algorithms and models, also gives introduction to working with Python, source code examples often unnecessarily complicated for beginners (triple nested list comprehensions!))
Gensim Tutorials

Monthly Archives: July 2017

Linkdump #50

R

Interesting articles, projects and news

Linkdump #49

R

Python

Interesting articles, projects and news

Slides on Text Preprocessing and Feature Extraction for Quantitative Text Analysis

Linkdump #48

R

Python

Interesting articles, projects and news

Linkdump #47

R

Python

Interesting articles, projects and news

Recent posts

Categories

Links

Links

Recent Posts

Recent Comments

Archives

Categories

Meta