Reading textual data from CSV and Excel files correctly with pandas

The pandas library is great for data analysis with Python, but it has some caveats and gotchas. One of it is importing textual data from CSV and Excel files that is automatically converted to numeric values when it only consists of digits. This is mostly a nice feature, but sometimes it is not what you want, for example in the case of codes with leading zeros like a FIPS state code. If you have a column with FIPS state codes in your CSV or Excel file, it will show up as an integer series after importing it with pandas, so the FIPS code of ’03’ will become the integer ‘3’.

To prevent pandas from doing this, a good guess would be specifying the dtype directly so that it doesn’t need to be guessed, but unfortunately this is not supported:

import pandas as pd

df = pd.read_excel('some_excelfile.xls', dtype=object)
>>> ValueError: The 'dtype' option is not supported with the 'python' engine

It also doesn’t work with other “engines” yet, so we need another solution: Converters. You can pass a dict that specifies a conversion function for each column (either by column index or column name). For example, if we want to have strings instead of numeric values in the columns with indices 3 and 7, we could pass a dict with the conversion function str() like this:

converters = {col: str for col in (3, 7)}
df = pd.read_excel('some_excelfile.xls', converters=converters)

pandas will not guess the data type of the columns where a conversion function is defined but will use the output type of the conversion function, so we will have a series of strings with the leading zeros as we wanted it.

Comments are closed.

Post Navigation