Caktus CTO and Co-founder Colin Copeland

Pandas is a powerful Python data analysis tool. It's used heavily in the data science community since its data structures make real-world data analysis significantly easier. At Caktus, in addition to using it for data exploration, we also incorporate it into Extract, Transform, and Load (ETL) processes.

The Southern Coalition for Social Justice’s Open Data Policing website uses Pandas in various capacities. Open Data Policing aggregates, visualizes, and publishes public records related to all known traffic stops in North Carolina, Maryland, and Illinois. The project must parse and clean data provided by state agencies, including the State of Maryland. Maryland provides data in Excel files, which can sometimes be difficult to parse. pandas.read_excel() is also quite slow compared to its _csv() counterparts.

By default, pandas.read_excel() reads the first sheet in an Excel workbook. However, Maryland's data is typically spread over multiple sheets. Luckily, it's fairly easy to extend this functionality to support a large number of sheets:

import pandas as pd

def read_excel_sheets(xls_path):
    """Read all sheets of an Excel workbook and return a single DataFrame"""
    print(f'Loading {xls_path} into pandas')
    xl = pd.ExcelFile(xls_path)
    df = pd.DataFrame()
    columns = None
    for idx, name in enumerate(xl.sheet_names):
        print(f'Reading sheet #{idx}: {name}')
        sheet = xl.parse(name)
        if idx == 0:
            # Save column names from the first sheet to match for append
            columns = sheet.columns
        sheet.columns = columns
        # Assume index of existing data frame when appended
        df = df.append(sheet, ignore_index=True)
    return df

The Maryland data is in the same format across all sheets, so we just stack the sheets together in a single data frame. Now we can load the entire Excel workbook:

stops = read_excel_sheets("data/PIALog_through-20171231.xlsx")

Now it's easy to write this data frame to a CSV:

stops.to_csv("data/stops.01end.csv", index=False)

We can add to this data, too. Maryland sends deltas of data, rather than an updated full data set. So it can be appended to the existing CSV data set by using mode=”a”:

stops2 = read_excel_sheets("data/PIANorthCarolina_02152019.xlsx")
stops2.to_csv("data/stops.01end.csv", mode="a", header=False, index=False)

Now we have a single CSV file with all of the data.

That's it! This certainly isn't a huge data set by any means, but since working with Excel files can be slow and sometimes painful, we hope this short how-to will be helpful!

New Call-to-action
blog comments powered by Disqus



You're already subscribed