A Guide to Using Window Functions

Create running totals, moving averages and rankings with ease in BigQuery.

Photo by Benjamin Voros on Unsplash

If you’ve ever searched for or stumbled upon something like ‘6 SQL skills you need to know to ace that interview’ or perhaps ‘SQL concepts I wish I knew years ago’. Chances are, window functions get their well-deserved mention somewhere in that list.

Window functions are fantastic.

My goal for this article is to help you understand these window functions and how to use them. Once we’ve covered the tutorial, I’ve prepared a few use cases you can run in your project to play around with, as I’ve used public data for these examples.

We will cover:

What is a window function?The syntax of a window function — namely the partition, order by and frame partsA close look at how you can create a 7-day moving average and how it worksWhat aggregate and window functions are at your disposal?Finally, we’ll run through a few use cases to demonstrate how window functions can be applied.

What is a window function?

The term ‘window’ may seem strange to use in SQL (or computing in general). Usually, the name of the function type gives you a glimpse of how they can be used, such as:

Aggregate Functions — take a bunch of things and give you one result that summarises them allString Functions — toolbox full of methods to manipulate words and sentencesArray Functions — work with a collection or group of items all at once

And so on…

Photo by Cosmic Timetraveler on Unsplash

So what is a window function in SQL? Much like a window in the real world, it allows you to view a particular area while the rest remains out of sight. You’re focused solely on what’s being shown through the window.

Returning to the world of data, let’s say you have a table containing monthly sales for liquor stores in IOWA.

The dataset used in this example is publicly accessible, provided by Google and already exists in BigQuery if you want to try these examples yourself. (link)

bigquery-public-data.iowa_liquour_sales.sales

The example below provides a simple view of sales by year and month.

I’ve saved the above as a view to keep our future queries as minimal as we can to focus on applying window functions.

If you want to use this view, you can use spreadsheep-20220603.Dashboard_Datasets.iowa_liqour_monthly_sales.

What if we also want the monthly average for each year as a separate column?

There are a few ways to achieve this, and if you’re new to window functions, you might try calculating the average as a sub-query and then joining to the original table, like below.

This works perfectly fine, but a window function will allow you to reach the same answer without a sub-query!

The window function above allows us to perform an aggregate function, in this case, the avg function, over a specific group of rows defined by partition by year.

Thinking back to the window analogy earlier, the partition by part is our window in that scenario. Sure, we have the entire dataset in front of us, but the partition limits our view to just the year.

Time to dive into the syntax.

Window Syntax

In the example above, we can split the function into two parts, the function name and the window.

In this scenario, the function name is the familiar aggregate function called AVG. The window part, however, is a little different.

Once you’ve specified your function, you start your window function with the over keyword, which must then be followed by parenthesis ().

Inside the parenthesis, you can specify what window we want to perform the aggregate using the partition by keyword, followed by a list of columns you wish to include in your window. Here, we’ve only included one column, year, but later we will bring another column into the mix.

Partition by is optional; if you don’t include a partition by, the aggregate will consist of all rows in your dataset. As this exists in the SELECT statement, it’s worth noting that the WHERE clause will be actioned before this window function.

What do I mean by this? Using the example I shared earlier, I specified a window by using partition by year. However, in my WHERE clause, I have a filter set only to return rows where the year = 2022.

This means the dataset only has one year in view — 2022, when the window function runs. Therefore my partition by year window is redundant, and using the line below would give the same result in this scenario.

Let’s re-run our query earlier and, this time, remove our WHERE clause.

Photo by Nik on Unsplash

Here we can see different values for 2023 and 2022. This is now showing the average monthly sales for each of the years offered.

For example, in line 7, we have 2022 with average monthly sales of 35.7 million, while the average monthly sales for 2023 (so far) is 35.8 million.

Accessing the monthly average data makes it easier to visualise and analyse sales trends. Specifically, it becomes evident that the year’s second half contributes significantly to sales.

We used a window function to figure out the average monthly sales for each year. Then, this function applied the result to all rows with that year. It’s like a left-join sub-query that we saw earlier.

Photo by Daniel K Cheung on Unsplash

Order By

We’ve focused mainly on aggregate functions and how to specify the window so far. We can also determine the order the window should carry out its task, which is a key part of ranking or running total/average solutions.

Returning to the Iowa dataset, let’s expand our view to include store_name and then give stores a numbered monthly ranking based on their total sales.

New View

spreadsheep-20220603.Dashboard_Datasets.iowa_liqour_monthly_sales_inc_store

Unlike aggregates, for the rank function, which is exclusive to window functions, you don’t specify a column inside the function itself.

However, if you try and run this as shown above, you’ll be greeted by an error.

The issue here is we’ve told Bigquery we want to rank our results, but we haven’t specified how they should be ranked, which we can achieve using ORDER BY.

This gives us a view of monthly sales on a store level, with a ranking for each store. You can then take this further and answer other questions, such as What were the top 3 stores for each month in 2022?

In one of the examples towards the end of this article, we’ll use a new clause called QUALIFY, which allows you to filter on results given by window functions easily.

So far, our window functions have applied to all rows in each partition, but what if we only want a subset of the partition? For example, the average daily sales for the last seven days? For this, we need to specify a window frame.

Window Frame

Time to pull in a new dataset, introducing Chicago taxies! This is another public dataset (CC0 license) you can use if you want to experiment. (link)

bigquery-public-data.chicago_taxi_trips.taxi_trips

The public dataset is large at 75GB, quickly eating into your free 100GB monthly query allowance. Therefore, I’ve created a new table that only holds data from 2023, so we can play around with the data without racking up a hefty bill.

I’ve made this table public, so I recommend you try my dataset for testing.

spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data

Anyway, back to the subject…what is a window frame? This clause allows us to define what rows or ranges we need to use inside the partition. A popular use case for this is to create moving averages.

SELECT
date(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-05-01” and “2023-06-30”
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC

This query provides us with revenue by date between May and June 2023.

Moving averages are very common with time series data, as it allows you to easily compare the performance of a specific day or event month to what result you typically see for a given period.

First, let’s create a simple moving average and to save repeating date conversions and revenue rounding, I’ve put our initial query inside a CTE.

WITH daily_data as(
SELECT
date(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-05-01” and “2023-06-30”
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC
)

SELECT
trip_start_date,
trip_total_revenue,
avg(trip_total_revenue) over (order by trip_start_date asc) as moving_average
FROM
daily_data

If we look at the first five rows, we can see that the first average equals trip_total_revenue. This is because it’s the start of the window as we’ve ordered our data by trip_start_date in asc order. Therefore there isn’t anything to average against yet.

However, we now have a daily average between rows 1 and 2 for the second row. We have a daily average for the third row between rows 1, 2 and 3.

It’s a good start which shows us that our moving average is working, but let’s take it a step further. Let’s create a moving average which only includes the last seven days of revenue, and if the window doesn’t contain seven days, then show a null value.

To specify your window range, there are three keywords you need to remember:

current rowprecedingfollowing

You then construct your window starting with either rows or ranges (I’ll explain the difference between the two later), followed by between <<start>> and <<end>>.

rows between 7 preceding and one preceding

The example above is the window frame we need for our problem. We’ve specified the window to begin seven rows before the current row and end one before the current row.

Here’s a simple example of how this window framework with a sum aggregate.

select
numbers,
sum(numbers) over
(
order by numbers asc
rows between 7 preceding and one preceding
)
as moving_sum_seven
from
test_data

As you can see, when we reach the 8th row, the value of the moving sum reaches 7, where the window now contains seven rows of data. If you switch the window to 6 preceding and current rows, you’ll see the window has shifted to include the current row.

At the end of this section, I’ll provide some use case examples to highlight how they can be used, but back to the task at hand for now!

Let’s put that window range into our moving average.

with daily_data as (
SELECT
date(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-05-01” and “2023-06-30”
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC
)

SELECT
trip_start_date,
trip_total_revenue,
avg(trip_total_revenue) over (order by trip_start_date asc rows between 7 preceding and one preceding) as moving_average
FROM
daily_data
ORDER BY
trip_start_date DESC

Now we have one final challenge, how do we make the value null if the window contains less than seven rows of data? Well, we can use an IF statement to check.

if
(
COUNT(*) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING) = 7,
AVG(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING),
NULL
) AS moving_average

We’ve brought in a second window function that counts how many rows exist in the window frame, which, if equal to 7, will provide the moving average result.

Difference between ROWS and RANGE

In SQL, both the ROWS and RANGE clauses help control which rows are used by a window function within a group.

The ROWS clause works with a fixed number of rows. It counts a specific number of rows before or after the current row, regardless of their values. These rows are included in the window function.

The RANGE clause works with rows based on their values. It considers the rows with values within a specific range relative to the current row. The actual values determine which rows are included in the window function calculation.

So, while the ROWS clause focuses on the physical position of the rows, the RANGE clause considers the logical value of the rows to determine their inclusion in the window function.

Try this as an example to see it in action

with sales_data as (
SELECT
‘2023-01-01’ AS DATE, 100 AS SALES
UNION ALL
SELECT
‘2023-01-02’ AS DATE, 50 AS SALES
UNION ALL
SELECT
‘2023-01-03’ AS DATE, 250 AS SALES
UNION ALL
SELECT
‘2023-01-03’ AS DATE, 200 AS SALES
UNION ALL
SELECT
‘2023-01-04’ AS DATE, 300 AS SALES
UNION ALL
SELECT
‘2023-01-05’ AS DATE, 150 AS SALES
)

SELECT
*,
SUM(SALES) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total_rows,
SUM(SALES) OVER (ORDER BY DATE RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total_range
FROM sales_data

Take a close look at rows 3 and 4 to compare the two clauses. The ROWS clause adds each row to the total, even if there are duplicate sale dates. But with the RANGE clause, rows with the same sale date are grouped together as one range. For example, in this case, all rows with the date 2023–01–03 will be considered as one range.

Photo by Christopher Gower on Unsplash

What are the window functions?

There are a bunch of functions you can use with a window function.

For aggregate functions, you can try:

SUM: Calculates the sum of a numerical column.AVG: Calculates the average of a numerical column.MIN: Retrieves the minimum value from a column.MAX: Retrieves the maximum value from a column.COUNT: Counts the number of rows in a column.COUNT DISTINCT: Counts the number of distinct values in a column.

Then you have a bunch of new functions exclusive to the window function, known as analytical functions:

ROW_NUMBER: Assigns a unique number to each row within the window frame.RANK: Assigns a rank to each row based on the order specified in the window frame.DENSE_RANK: Assigns a rank to each row, without gaps, based on the order specified in the window frame.LAG: Retrieves the value from a previous row within the window frame.LEAD: Retrieves the value from a subsequent row within the window frame.FIRST_VALUE: Retrieves the value from the first row within the window frame.LAST_VALUE: Retrieves the value from the last row within the window frame.The functions above are hyperlinked to the BigQuery documentation.Photo by Susan Holt Simpson on Unsplash

Working Examples

Daily Running Total

One of the more simple use cases of window functions is the running total. For the Chicago taxies dataset, we could have revenue on a monthly level but require a new column which tracks our total revenue across the year so far.

with daily_data as (
SELECT
date(timestamp_trunc(trip_start_timestamp,month)) as trip_month,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-01-01” and “2023-06-30”
GROUP BY
trip_month
ORDER BY
trip_month DESC
)

SELECT
trip_month,
trip_total_revenue,
round(sum(trip_total_revenue) over (order by trip_month asc),2) AS running_total_revenue,
FROM
daily_data
ORDER BY
trip_month DESC

12-Week Moving Average

This article’s tutorial highlights that moving averages are very common when working with time series data.

with daily_data as (
SELECT
date(timestamp_trunc(trip_start_timestamp,week(monday))) as trip_week,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-01-01” and “2023-06-30”
GROUP BY
trip_week
ORDER BY
trip_week DESC
)

SELECT
trip_week,
trip_total_revenue,
if
(
COUNT(*) OVER (ORDER BY trip_week ASC ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING) = 12,
AVG(trip_total_revenue) OVER (ORDER BY trip_week ASC ROWS BETWEEN 12 PRECEDING AND 1 PRECEDING),
NULL
) AS moving_average
FROM
daily_data
ORDER BY
trip_week DESC

Plotting the revenue alongside the moving average indicates a positive trend, as the moving average has continually climbed each week since April. Without the moving average, our eyes might be drawn to the lower-performing weeks rather than the bigger picture.

Photo by Randy Fath on Unsplash

Calculating a Z-score for anomaly detection

Z-Score calculation = (x — mean) / standard deviation

A z-score is a way to measure how unusual or typical a number is compared to a group of other numbers. It tells you how far away a specific number is from the group’s average in terms of standard deviations.

if
(
COUNT(*) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) = 30,
round
(
(
trip_total_revenue –
AVG(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING)
) / stddev(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING)
,1),
NULL
) AS z_score_30_day

In this example, we’ve taken the actual value of trip_total_revenue and subtracted the average daily revenue we’ve seen for the last 30 days.

We then divided that number by the standard deviation for those 30 days. This tells us how close a specific day of revenue is to the average or how many standard deviations that value is from the average.

This is a handy metric to plot on a chart, such as below, as it gives context to your data. Although we only have the last 30 days in view, the z-score compares to the previous 30 days effortlessly, and we can see areas where peaks and dips seem insignificant until the z-score highlights how different that day performed compared to the norm.

With these kinds of reports, you should set a value that suggests you have an abnormal event. I wouldn’t say any dates are abnormal in the chart above, but a typical go-to is a value of 3 (i.e., three standard deviations). However, this depends entirely on the volatility of your data.

Full Querywith daily_data as (
SELECT
(trip_start_timestamp) as trip_start_date,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-01-01” and “2023-06-30”
GROUP BY
trip_start_date
ORDER BY
trip_start_date DESC
)

SELECT
trip_start_date,
trip_total_revenue,
if
(
COUNT(*) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) = 30,
round
(
(
trip_total_revenue –
AVG(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING)
) / stddev(trip_total_revenue) OVER (ORDER BY trip_start_date ASC ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING)
,1),
NULL
) AS z_score_30_day
FROM
daily_data
ORDER BY
trip_start_date DESCPhoto by Giorgio Trovato on Unsplash

Monthly Top Ranking Performers

There are numerous taxi companies in the Chicago taxies dataset, and we may ask ourselves who were the top 3 performing companies each month.

To achieve this, we can use the rank analytical function, partitioned by trip_month and ordered by trip_total_revenue in descending order.

rank() over (partition by trip_month order by trip_total_revenue desc) AS ranking

However, this will still provide results for all companies in the dataset for each month rather than just the top 3. Therefore, we can utilise the QUALIFY clause, which functions similarly to the WHERE clause, to allow you to filter your data.

The qualify clause can only be used with window functions and can reference window functions you’ve created in your select statement. More details here.

The results below make it clear three main companies are dominating the taxi ring.

with daily_data as (
SELECT
date(timestamp_trunc(trip_start_timestamp,month)) as trip_month,
company,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-01-01” and “2023-06-30”
GROUP BY
trip_month,
company
ORDER BY
trip_month DESC
)

SELECT
trip_month,
company,
trip_total_revenue,
rank() over (partition by trip_month order by trip_total_revenue desc) AS ranking
FROM
daily_data
QUALIFY
ranking <= 3
ORDER BY
trip_month DESCPhoto by Towfiqu barbhuiya on Unsplash

Monthly/Quarterly Comparisons

Monthly and quarterly reporting is essential for tracking KPIs and to help gauge the direction the business is heading. However, creating a report in BigQuery, which provides month vs month changes, can prove tricky once you know how.

Once you have your data on the level you want, such as monthly in my example below, you can use the LAG or LEAD functions to return the previous month’s revenue which allows you to calculate the % difference.

You can use LAG or LEAD; both achieve the same depending on how you order your data. As we’re pulling the previous month’s revenue, it makes sense to use lag here.

with daily_data as (
SELECT
date(timestamp_trunc(trip_start_timestamp,month)) as trip_month,
round(sum(trip_total),2) as trip_total_revenue
FROM
`spreadsheep-case-studies-2023.chicago_taxies_2023.trip_data`
WHERE
date(trip_start_timestamp) between “2023-01-01” and “2023-06-30”
GROUP BY
trip_month
ORDER BY
trip_month DESC
)

SELECT
trip_month,
trip_total_revenue,
lead(trip_total_revenue) over (order by trip_total_revenue asc) AS previous_month_revenue,
round
(
(
(
trip_total_revenue – lag(trip_total_revenue) over (order by trip_total_revenue asc)
) / lag(trip_total_revenue) over (order by trip_total_revenue asc)
) * 100
, 1) || “%” AS perc_change
FROM
daily_data
ORDER BY
trip_month DESC

That brings this article to an end. If you have any questions or challenges, please feel free to comment, and I’ll answer as soon as possible.

I frequently write articles for BigQuery and Looker Studio. If you’re interested, consider following me here on Medium for more!

All images, unless otherwise noted, are by the author.

Stay classy folks!
Tom

A Guide to Using Window Functions was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top