Analyze data with SQL window functions

We regularly use tools like PostgreSQL, Pandas, and Jupyter Notebooks to analyze data here at Caktus. Recently, we were reviewing North Carolina traffic stop data for the NC CopWatch project and had the opportunity to use PostgreSQL's window functions, which are helpful when aggregating data.

To get an idea of why window functions are valuable, let's start with a simple example table:

SELECT
    stop_date
    , driver_race
    , driver_gender
FROM ncstops
ORDER BY stop_date DESC
LIMIT 10;

stop_date      | driver_race | driver_gender
---------------------+-------------+---------------
2022-10-31 22:28:00 | Black       | female
2022-10-31 20:04:00 | Hispanic    | male
2022-10-31 13:55:00 | Black       | male
2022-10-31 08:45:00 | Black       | male
2022-10-31 00:39:00 | Hispanic    | male
2022-10-30 20:00:00 | Black       | male
2022-10-30 19:53:00 | White       | male
2022-10-30 19:41:00 | Black       | female
2022-10-30 18:57:00 | Black       | female
2022-10-30 16:00:00 | Black       | female
(10 rows)

Using this data, say we want to get the percent of traffic stops by race, gender, and hour of day. For example, between 2-3a in the morning, what's the percentage breakdown of stopped drivers by race and gender.

First, cast the timestamp column to a time type to exclude the date portion. A simple SELECT example can help illustrate:

traffic_stops_nc=# SELECT '2022-10-31 22:28:00'::time;
   time
----------
 22:28:00
(1 row)

Then use date_trunc to truncate the time to the hour so it can be grouped together:

traffic_stops_nc=# SELECT date_trunc('hour', '2022-10-31 22:28:00'::time);
 date_trunc
------------
 22:00:00
(1 row)

Now combine this method with a count aggregate function and GROUP BY clause:

SELECT
    date_trunc('hour', stop_date::time) AS hour_of_day
    , driver_race
    , driver_gender
    , count(*) AS stop_count
FROM ncstops
GROUP BY 1, 2, 3
ORDER BY date_trunc('hour', stop_date::time)
LIMIT 15;

hour_of_day |   driver_race   | driver_gender | count
-------------+-----------------+---------------+-------
00:00:00    | Asian           | female        |   300
00:00:00    | Asian           | male          |   636
00:00:00    | Black           | female        | 12065
00:00:00    | Black           | male          | 21407
00:00:00    | Hispanic        | female        |  1314
00:00:00    | Hispanic        | male          |  6464
00:00:00    | Native American | female        |    31
00:00:00    | Native American | male          |    73
00:00:00    | Other           | female        |   160
00:00:00    | Other           | male          |   461
00:00:00    | White           | female        |  6161
00:00:00    | White           | male          | 10246
01:00:00    | Asian           | female        |    42
01:00:00    | Asian           | male          |   126
01:00:00    | Black           | female        |  2231
(15 rows)

So how do we now calculate the percentage breakdown by hour? To start, we want to divide the midnight count values above by the total midnight stops, for example:

SELECT
    count(*) AS stop_count
FROM ncstops
WHERE date_trunc('hour', stop_date::time) = '00:00:00';

However, we want that for every hour. This is where we can use PostgreSQL's window functions:

SELECT
    date_trunc('hour', stop_date::time) AS hour_of_day
    , driver_race
    , driver_gender
    , count(*) AS stop_count
    , sum(count(*)) OVER (PARTITION BY date_trunc('hour', stop_date::time)) AS total_hour_stops
FROM ncstops
GROUP BY 1, 2, 3
ORDER BY date_trunc('hour', stop_date::time)
LIMIT 15;

hour_of_day |   driver_race   | driver_gender | count | total_hour_stops
-------------+-----------------+---------------+-------+------------------
00:00:00    | Asian           | female        |   300 |            59318
00:00:00    | Asian           | male          |   636 |            59318
00:00:00    | Black           | female        | 12065 |            59318
00:00:00    | Black           | male          | 21407 |            59318
00:00:00    | Hispanic        | female        |  1314 |            59318
00:00:00    | Hispanic        | male          |  6464 |            59318
00:00:00    | Native American | female        |    31 |            59318
00:00:00    | Native American | male          |    73 |            59318
00:00:00    | Other           | female        |   160 |            59318
00:00:00    | Other           | male          |   461 |            59318
00:00:00    | White           | female        |  6161 |            59318
00:00:00    | White           | male          | 10246 |            59318
01:00:00    | Asian           | female        |    42 |            10993
01:00:00    | Asian           | male          |   126 |            10993
01:00:00    | Black           | female        |  2231 |            10993
(15 rows)

Let's break this down. A window function performs a calculation across a set of table rows that are somehow related to the current row. We already have the count of stops, so we'll need to sum those counts across multiple rows:

sum(count(*)) OVER ()

However, we only want to sum rows within the same hour. The PARTITION BY clause within OVER divides the rows into partitions that share the same values of the PARTITION BY expressions. So here we want to PARTITION BY the hour of the day:

sum(count(*)) OVER (PARTITION BY date_trunc('hour', stop_date::time)) AS total_hour_stops

That's it! Now the total_hour_stops contains the total of stops per hour, regardless of how the SELECT statement groups the data using GROUP BY. In our example, there are total 59,318 stops between 12:00a and 1:00a in the morning.

Now that we have the data we need, we can use Pandas to load and prepare the data:

import pandas as pd

df = pd.read_sql(
    """
    SELECT
        date_trunc('hour', stop_date::time) AS hour_of_day
        , driver_race
        , driver_gender
        , count(*) AS stop_count
        , sum(count(*)) OVER (PARTITION BY date_trunc('hour', stop_date::time)) AS total_hour_stops
    FROM ncstops
    GROUP BY 1, 2, 3
    ORDER BY date_trunc('hour', stop_date::time)
    """,
    pg_engine,
)
df["hour_of_day"] = df["hour_of_day"].dt.components['hours']
df["percent_stops_for_race_gender"] = df.stop_count / df.total_hour_stops

And Plotly Express to visualize the data:

import plotly.express as px

mask = df["driver_race"].isin(["Black", "White"])
fig = px.line(
    df[mask],
    x="hour_of_day",
    y='percent_stops_for_race_gender',
    color="driver_race",
    title="Percent of traffic stops by race, gender, and hour of day",
    labels={
        'percent_stops_for_race_gender': 'Percent of stops',
        'hour_of_day': 'Hour of day',
        "driver_race": "Driver race",
        "driver_gender": "Driver gender"
    },
    color_discrete_sequence=(px.colors.diverging.Picnic[1], px.colors.diverging.Picnic[7]),
    height=600,
    markers=True,
    line_dash="driver_gender",
    line_dash_map={"male": "solid", "female": "dot"},
    line_group="driver_gender",
)
fig.layout.yaxis.tickformat = ',.0%'
fig.update_traces(textposition="bottom right")

Graph of traffic stop data