How to Fix Missing Dates for Time Series Analysis

Learn how to use TVFs in BigQuery to effortlessly generate date ranges for your time series analysis.

My goal for this article is to help you understand TVFs and how to use them by walking you through an example which addresses the missing date issue common in time series analysis.

In some situations, dates with zero data are important and must be shown/included in your dataset. For example:

Businesses can benefit from identifying days with no sales in retail. These days are influenced by holidays or changes in customer behaviour.Finding missing dates in data helps improve data quality by revealing anomalies or outliers caused by system failures or incomplete data capture. Showing missing dates is a useful tool for achieving this goal.

These missing dates can cause issues for analysis and visualisation. Therefore, you need a solution that ensures all dates are present in the output, even if there is no corresponding data.

By the end of this article, you’ll have your very own TVF, which can generate this…

From a single line of code!

We will cover:

How to generate dates to plug missing gaps in your dataHow you can create a TVF and the usage of parametersHow to call a TVFWe’ll look at expanding our date generator for greater flexibility.Finally, I’ll share how you can access my TVF and introduce you to an open-source project called BigFunctions.

The Problem

Consider this scenario: you’ve run a query which provides results of total survey responses by date for the last four weeks. Then, you bring the results into Google Sheets to quickly visualise the data.

The chart above doesn’t highlight any missing data; it looks exactly as expected. Even if you choose to display all dates in the x-axis, you can be forgiven for not noticing the two missing days in July.

How Can We Solve This Problem

Before we get into the TVF topic, let’s talk about my method for solving such a problem and why I bundled it into a TVF.

To solve this problem, I create what I like to call a Date Axis. This column of dates/weeks/months, whatever period you need, is built separately from the dataset you’re analysing. This ensures that the dates are independent and do not rely on the data to exist.

Creating the date axis is reasonably straightforward, although a chore if you frequently need to create one.

Below is a simple example which generates dates between the 19th of June and the 16th of July in 2023.

WITH date_axis as (SELECT
dates
FROM
UNNEST(generate_date_array(“2023-06-19″,”2023-07-16”)) as dates
)

SELECT
dates
FROM
date_axis

The generate_date_array function is the key part of this, but as the function name implies, the output is returned as an array. Therefore, we must unnest (flatten) this array for the next step.

The date axis exists in a CTE as we need to treat it like a separate table to left join our actual data to the list of dates.

WITH date_axis as (SELECT
dates
FROM
UNNEST(generate_date_array(“2023-06-19″,”2023-07-16”)) as dates
)

SELECT
dates,
responses as original_responses,
ifnull(responses,0) as new_responses
FROM
date_axis as axis
LEFT JOIN
`spreadsheep-20220603.Case_Studies.survey_responses` as survey
ON axis.dates = survey.date

As you can see above, we have null values for the 1st and 2nd of July in our survey_responses table because those dates don’t exist. Using the date axis, we can easily spot these and handle them appropriately, which in this scenario, null values are replaced with 0.

Re-plotting our updated data, we now capture the lack of responses at the beginning of July.

What even is a TVF?

TVF is an abbreviation for Table-Valued Function. Much like UDFs (User-Defined Functions), they allow you to specify a series of tasks which will be run whenever your custom function is called.

The difference between the two is the UDF returns a result for each row in your dataset, whereas the TVF returns an entire table.

You may wonder what’s the point if the CTE approach does the job perfectly. Well, in a TVF, we can expand on the features and reusability of the date axis function and streamline our code.

There are many creative and useful ways to use TVFs, and in this article, we’ll be using one to generate a date axis.

Creating a TVF

CREATE OR REPLACE TABLE FUNCTION `spreadsheep-20220603.Case_Studies.generate_dates`(start_date DATE, end_date DATE)
AS (
SELECT
dates
FROM
UNNEST(generate_date_array(start_date,end_date)) as dates
)

Creating the TVF is nice and easy; start with create or replace table function followed by where in your project you want to save your TVF. Then, you can add parameters, which we’ve added two in this example.

start_date DATE, end_date DATE

As shown below, these two parameters replace the static values we added to the generate_date_array function.

unnest(generate_date_array(start_date,end_date)) as dates

When your TVF is created, you can call your new function as if it was a table. Note that I’ve added parenthesis at the end of the FROM clause to specify what values I want the TVF to use, with the 1st of July as the start date and the 7th of July as the end date.

SELECT
dates
FROM
`spreadsheep-20220603.Case_Studies.generate_dates`(“2023-07-01”, “2023-07-07”)

We can now update our original query to use the new TVF.

WITH date_axis as (
SELECT
dates
FROM
`spreadsheep-20220603.Case_Studies.generate_dates`(“2023-06-19”, “2023-07-16”)
)

SELECT
dates,
responses as original_responses,
ifnull(responses,0) as new_responses
FROM
date_axis as axis
LEFT JOIN
`spreadsheep-20220603.Case_Studies.survey_responses` as survey
ON axis.dates = survey.date

Expanding on the TVF

The function so far is quite limiting because it only provides dates. What if we wanted week start dates where the week started on Sundays, or do we want quarterly start and end dates for the last few years?

Although we could add that logic into our CTE which calls the TVF, let’s handle that in the TVF instead so it’s there whenever we need it.

My final version adds a few other possibilities depending on whether you need weekly, monthly, or quarterly date ranges.

CREATE OR REPLACE TABLE FUNCTION `spreadsheep-20220603.Case_Studies.generate_dates`(start_date DATE, end_date DATE)
OPTIONS (description=”Generate a table of dates”) AS (
(
select
date,
format_date(“%a”, date) as day_of_week,
date_trunc(date, week(monday)) as week_start_monday,
date_trunc(date, week(monday)) + 6 as week_end_monday,
date_trunc(date, week(sunday)) as week_start_sunday,
date_trunc(date, week(sunday)) + 6 as week_end_sunday,
date_trunc(date, month) as month_start,
date_add(date_trunc(date, month), interval 1 month) – 1 as month_end,
date_trunc(date, quarter) as quarter_start,
date_add(date_trunc(date, quarter), interval 1 quarter) – 1 as quarter_end,
from unnest(
generate_date_array(
start_date,
end_date
)
) as date
)
);

This gets us the output we saw at the beginning of the article, where a single query line can generate years worth of dates, along with their week, month, and quarterly parts.

As a bonus, this function we created doesn’t query any actual data. Meaning it’s completely free to run and also lightning-fast.

Even generating dates from 1820 to current took just 1 second.

SELECT * FROM `spreadsheep-20220603.Case_Studies.generate_dates`(“1820-07-01″,”2023-07-15”)Photo by Benjamin Davies on Unsplash

Accessing my TVF

To save yourself some time, you don’t have to create this TVF in your project; you can use the public version, which exists in the BigFunctions open-source project.

To add BigFunctions to your project, you can use the explorer add feature and then ‘star a project by name’ as shown below.

These functions are available in every region, and inside each dataset, you will find generate_dates under Routines. Give the code below a try!

SELECT * FROM `bigfunctions.europe_west2.generate_dates`(“2022-01-01”, “2023-01-01”);

More details about BigFunctions can be found here, which is chocked full of great custom functions, some of which even use Python to run all kinds of neat stuff. Check it out if you use BigQuery in your day-to-day role.

That brings this article to an end. If you have any questions, please feel free to comment, and I’ll answer as soon as possible.

I frequently write articles for BigQuery and Looker Studio. If you’re interested, consider following me here on Medium for more!

All images, unless otherwise noted, are by the author.

Stay classy folks!
Tom

How to Fix Missing Dates for Time Series Analysis was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top