Reading textual data from CSV and Excel files correctly with pandas

The pandas library is great for data analysis with Python, but it has some caveats and gotchas. One of it is importing textual data from CSV and Excel files that is automatically converted to numeric values when it only consists of digits. This is mostly a nice feature, but sometimes it is not what you want, for example in the case of codes with leading zeros like a FIPS state code. If you have a column with FIPS state codes in your CSV or Excel file, it will show up as an integer series after importing it with pandas, so the FIPS code of ’03’ will become the integer ‘3’.

To prevent pandas from doing this, a good guess would be specifying the dtype directly so that it doesn’t need to be guessed, but unfortunately this is not supported:

import pandas as pd

df = pd.read_excel('some_excelfile.xls', dtype=object)
>>> ValueError: The 'dtype' option is not supported with the 'python' engine

It also doesn’t work with other “engines” yet, so we need another solution: Converters. You can pass a dict that specifies a conversion function for each column (either by column index or column name). For example, if we want to have strings instead of numeric values in the columns with indices 3 and 7, we could pass a dict with the conversion function str() like this:

converters = {col: str for col in (3, 7)}
df = pd.read_excel('some_excelfile.xls', converters=converters)

pandas will not guess the data type of the columns where a conversion function is defined but will use the output type of the conversion function, so we will have a series of strings with the leading zeros as we wanted it.

Creating a sparse Document Term Matrix for Topic Modeling via LDA

To do topic modeling with methods like Latent Dirichlet Allocation, it is necessary to build a Document Term Matrix (DTM) that contains the number of term occurrences per document. The rows of the DTM usually represent the documents and the columns represent the whole vocabulary, i.e. the set union of all terms that appear in all documents.

The DTM will contain mostly zero values when we deal with natural language documents, because from the vast vocabulary of possible terms from all documents, only a few will be used in the individual documents (even after normalizing the vocabulary with stemming or lemmatization). Hence the DTM will be a sparse matrix in most cases — and this fact should be exploited to achieve good memory efficiency.

Read More →

About the WZB Data Science Blog

This blog collects some experiences from my daily work in the Data Science field of the WZB. The posts will focus around the following topics:

  • Data extraction / data mining
  • Data visualization
  • Data analysis

Read More →