Anomaly Root Cause Analysis 101

How to find the explanation for every anomaly on your metrics

Photo by Markus Winkler on Unsplash

We use metrics and KPIs to monitor the health of our products: to ensure that everything is stable or the product is growing as expected. But sometimes, metrics change suddenly. Conversions may rise by 10% on one day, or revenue may drop slightly for a few quarters. In such situations, it’s critical for businesses to understand not only what is happening but also why and what actions we should take. And this is where analysts come into play.

My first data analytics role was KPI analyst. Anomaly detection and root cause analysis has been my main focus for almost three years. I’ve found key drivers for dozens of KPI changes and developed a methodology for approaching such tasks.

In this article, I would like to share with you my experience. So next time you face unexpected metric behaviour, you will have a guide to follow.

What to focus on?

Before moving on to analysis, let’s define our main goal: what we would like to achieve. So what is the purpose of our anomaly root cause analysis?

The most straightforward answer is understanding key drivers for metric change. And it goes without saying that it’s a correct answer from an analyst’s point of view.

But let’s look from a business side. The main reason to spend resources on this research is to minimize the potential negative impact on our customers. For example, if the conversion has dropped because of a bug in the new app version released yesterday, it will be better to find it out today rather than in a month when hundreds of customers will have already churned.

Our main goal is to minimise the potential negative impact on our customers.

As an analyst, I like having optimization metrics even for my work tasks. Minimizing potential adverse effects sounds like a proper mindset to help us focus on the right things.

So keeping the main goal in mind, I would try to find answers to the following questions:

Is it a real problem affecting our customers’ behaviour or just a data issue?If our customers’ behaviour actually changed, could we do anything with it? What will be the potential effect of different options?If it’s a data issue, could we use other tools to monitor the same process? How could we fix the broken process?

Step 1: Do It Yourself

From my experience, the best first action is to reproduce the affected customer journey. For example, suppose the number of orders in the e-commerce app decreased by 10% on iOS. In that case, it’s worth trying to purchase something and double-check whether there are any product issues: buttons are not visible, the banner can’t be closed, etc.

Also, remember to look at logging to ensure that information is captured correctly. Everything may be ok with customer experience, but we may lose data about purchases.

I believe it’s an essential step to start your anomaly investigation. First of all, after DIY, you will better understand the affected part of the customer journey: what are the steps, how data is logged. Secondly, you may find the root cause and save yourself hours of analysis.

Tip: It’s more likely to reproduce the issue if the anomaly magnitude is significant, which means the problem impacts many customers.

Step 2: Check The Data

As we discussed earlier, first of all, it’s essential to understand whether customers are influenced, or it’s just a data anomaly.

I definitely advise you to check that the data is up-to-date. You may see a 50% decrease in yesterday’s revenue because the report captured only the first half of the day. You can look at the raw data or talk to your Data Engineering team.

If there are no known data-related problems, you can double-check the metric using different data sources. In many cases, the products have client-side (for example, Google Analytics or Amplitude) and back-end data (for example, application logs, access logs or logs of API gateway). So we can use different data sources to verify KPI dynamics. If you see an anomaly only in one data source, your problem is likely data-related and doesn’t affect customers.

The other thing to keep in mind is time windows and data delays. Once, a product manager came to me saying activation was broken because conversion from registration to the first successful action (i.e. purchase in case of e-commerce) had been decreasing for three weeks. However, it was an everyday situation.

Example by author based on synthetic data

The root cause of the decrease was the time window. We track activation within the first 30 days after registration. So cohorts registered 4+ weeks ago had the whole month to make the first action. But customers from the last cohort had only one week to convert, so conversion for them is expected to be much lower. If you want to compare conversions for these cohorts, change the time window to one week or wait.

In case of data delays, you may have a similar decreasing trend in recent days. For example, our mobile analytical system used to send events in batches when the device was using a Wi-Fi network. So on average, it took 3–4 days to get all events from all devices. So seeing fewer active devices for the last 3–4 days was usual.

The good practice for such cases is trimming the last period from your graphs. It will prevent your team from making wrong decisions based on data. However, people may still accidentally bump into such inaccurate metrics, and you should spend some time understanding how methodologically accurate metrics are before diving deep into root cause analysis.

Step 3: Helicopter view

The next step is to look at trends more globally. First, I prefer to zoom out and look at longer trends to get the whole picture.

For example, let’s look at the number of purchases. The number of orders has been growing steadily week after week, with an expected decrease at the end of December (Christmas and New Year time). But then, at the beginning of May, KPI significantly dropped and continued decreasing. Should we start panicking?

Example by author based on synthetic data

Actually, most likely, there’s no reason to panic. We can look at metric trends for the last three years and notice that the number of purchases decreases every single summer. So it’s a case of seasonality. For many products, we can see lower engagement during the summertime because customers go on vacation. However, this seasonality pattern isn’t ubiquitous: for example, travel or summer festival sites may have an opposite seasonality trend.

Example by author based on synthetic data

Let’s look at one more example — the number of active customers for another product. We could see a decrease since June: monthly active users used to be 380K — 400K, and now it’s only 340–360K (around a -10% decrease). We’ve already checked that there were no such changes in summer during several previous years. Should we conclude that something is broken in our product?

Example by author based on synthetic data

Wait, not yet. In this case, zooming out can also help. Taking into account long-term trends, we can see that the last three weeks’ values are close to the ones in February and March. The true anomaly is 1.5 months of the high number of customers from the beginning of April till mid-May. We may have wrongly concluded that KPI has dropped, but it just returned to the norm. Considering that it was spring 2020, higher traffic on our site is likely due to COVID isolation: customers were sitting at home and spending more time online.

Example by author based on synthetic data

The last but not least point of your initial analysis is to define the exact time when KPI changed. In some cases, the change may happen suddenly within 5 minutes. While in others, it can be a very slight shift in trend. For example, active users used to grow +5% WoW (week-over-week), but now it’s just +3%.

It’s worth trying to define the change point as accurately as possible (even with minute precision) because it will help you pick up the most plausible hypothesis later.

How fast the metric has changed can give you some clues. For example, if conversion changed within 5 minutes, it can’t be due to the rollout of a new app version (it usually takes days for customers to update their apps) and is more likely due to back-end changes (for example, API).

Step 4: Get the context

Understanding the whole context (what’s going on) may be crucial for our investigation.

What I usually check to see the whole picture:

Internal changes. It goes without saying internal changes can influence KPIs, so I usually look up all releases, experiments, infrastructure incidents, product changes (i.e. new design or price changes) and vendor updates (for example, upgrade to the latest version of the BI tool we are using for reporting).External factors may be different depending on your product. Currency exchange rates in fintech can affect customers’ behaviour, while big news or weather changes can influence search engine market share. You can brainstorm similar factors for your product. Try to be creative in thinking about external factors. For example, once we discovered that the decrease in traffic on site was due to the network issues in our most significant region.Competitors activities. Try to find out whether your main competitors are doing something right now — an extensive marketing campaign, an incident when their product is unavailable or market closure. The easiest way to do it is to look for mentions on Twitter, Reddit or news. Also, there are a lot of sites monitoring services’ issues and outages (for example, DownDetector or DownForEveryoneOrJustMe) where you could check your competitors’ health.Customers’ voice. You can learn about problems with your product from your customer support team. So don’t hesitate to ask them whether there are any new complaints or an increase in customer contacts of a particular type. However, please remember that few people may contact customer support (especially if your product is not essential for everyday life). For example, once many-many years ago, our search engine was wholly broken for ~100K users of the old versions of Opera browser. The problem persisted for a couple of days, but less than ten customers reached out to the support.

Since we’ve already defined the anomaly time, it’s pretty easy to get all events that happened nearby. These events are your hypothesis.

Tip: If you suspect internal changes (release or experiment) are the root cause of your KPI drop-off. The best practice is to revert these changes (if possible) and then try to understand the exact problem. It will help you reduce the potential negative effects on customers.

Step 5: Slicing & Dicing

At this moment, you hopefully already have an understanding of what is going on around the time of the anomaly and some hypotheses about the root causes.

Let’s start by looking at the anomaly from a higher level. For example, if there’s an anomaly in conversion on Android for the USA customers, it’s worth checking iOS and web and customers from other regions. Then you will be able to understand the scale of the problem adequately.

After that, it’s time to dive deep and try to localize anomaly (to define as narrow as possible a segment or segments affected by KPI change). The most straightforward way is to look at your product’s KPI trends in different dimensions.

The list of such meaningful dimensions can differ significantly depending on your product, so it’s worth brainstorming with your team. I would suggest looking at the following groups of factors:

technical features: for example, platform, operation system, app version;customer features: for example, new or existing customer (cohorts), age, region;customer behaviour: for example, product features adopted, experiment flags, marketing channels.

When examining KPI trends split by different dimensions, it’s better to look only at significant enough segments. For example, if revenue has dropped by 10%, there’s no reason to look at countries that contribute less than 1% to total revenue. Metrics tend to be more volatile in smaller groups, so insignificant segments may add too much noise. I prefer to group all small slices into the `other` group to avoid losing this signal completely.

For example, we can look at revenue split by platforms. The absolute numbers for different platforms can differ significantly, so I normed all series on the first point to compare dynamics over time. Sometimes, it’s better to normalize on average for the first N points. For example, average the first seven days to capture weekly seasonality.

That’s how you could do it in Python.

import plotly.express as px

norm_value = df[:7].mean()
norm_df = df.apply(lambda x: x/norm_value, axis = 1)
px.line(norm_df, title = ‘Revenue by platform normed on 1st point’)

The graph tells us the whole story: before May, revenue trends for different platforms were pretty close, but then something happened on iOS, and iOS revenue decreased by 10–20%. So iOS platform is mainly affected by this change, while others are pretty stable.

Example by author based on synthetic data

Step 6: Understand your metric

After determining the main segments affected by the anomaly, let’s try to decompose our KPI. It may give us a better understanding of what’s going on.

We usually use two types of KPIs in analytics: absolute numbers and ratios. So let’s discuss the approach for decomposition in each case.

We can decompose an absolute number by norming it. For example, let’s look at the total time spent in service (a standard KPI for content products). We can decompose it into two separate metrics.

Then we can look at the dynamics for both metrics. In the example below, we can see that number of active customers is stable while the time spent per customer dropped, which means we haven’t lost customers entirely, but due to some reason, they started to spend less time on our service.

Example by author based on synthetic data

For ratio metrics, we can look at the numerator and denominator dynamics separately. For example, let’s use conversion from registration to the first purchase within 30 days. We can decompose it into two metrics:

the number of customers who did purchase within 30 days after registration (numerator),the number of registrations (denominator).

In the example below, the conversion rate decreased from 43.5% to 40% in April. Both the number of registrations and the number of converted customers increased. It means there are additional customers with lower conversion. It can happen because of different reasons:

new marketing channel or marketing campaign with lower-quality users;technical changes in data (for example, we changed the definition of regions, and now we are taking into account more customers);fraud or bot traffic on site.Example by author based on synthetic dataTip: If we saw a drop-off in converted users while total users were stable, that would indicate problems in a product or data regarding the fact of conversion.

For conversions, it also may be helpful to turn it into a funnel. For example, in our case, we can look at the conversions for the following steps:

completed registrationproducts’ catalogueadding an item to the basketplacing ordersuccessful payment.

Conversion dynamics for each step may show us the stage in a customer journey where the change happened.

Step 7: Coming to a conclusion

As a result of all the analysis stages mentioned above, you should have a pretty whole picture of the current situation:

what exactly changed;what segments are affected;what is going on around.

Now it’s time to sum it up. I prefer to put all information down in a structured way, describing tested hypotheses and conclusions we’ve made and what it is the current understanding of the primary root cause and next steps (if they are needed).

Tip: It’s worth writing down all tested hypotheses (not only proven ones) because it will avoid duplicating unnecessary work.

The essential thing to do now is to verify that our primary root cause can completely explain KPI change. I usually model the situation if there are no known effects.

For example, in the case of conversion from registration to the first purchase, we might have discovered a fraud attack, and we know how to identify bot traffic using IP addresses and user agents. So we could look at the conversion rate without the effect of the known primary root cause — fraud traffic.

Example by author based on synthetic data

As you can see, the fraud traffic explains only around 70% of drop-off, and there could be other factors affecting KPI. That’s why it’s better to double-check that you’ve found all significant factors.

Sometimes, it may be challenging to prove your hypothesis, for example, changes in price or design that you couldn’t A/B test appropriately. We all know that correlation doesn’t imply causation.

The possible ways to check the hypothesis in such cases:

To look at similar situations in the past, for example, price changes and whether there was a similar correlation with KPI.Try to identify customers with changed behaviour, such as those who started spending much less time in our app, and conduct a survey.

After this analysis, you will still doubt the effects, but it may increase confidence that you’ve found the correct answer.

Tip: The survey could also help if you are stuck: you’ve checked all hypotheses and still haven’t found an explanation.

How to be prepared for the next root cause analysis?

At the end of the extensive investigation, it’s time to think about how to make it easier and better next time.

My best practices after ages of dealing with anomalies investigations:

It’s super-helpful to have a checklist specific to your product — it can save you and your colleagues hours of work. It’s worth putting together a list of hypotheses and tools to check them (links to dashboards, external sources of information on your competitors etc.). Please, keep in mind that writing down the checklist is not a one-time activity: you should add new knowledge to it once you face new types of anomalies so it stays up-to-date.The other valuable artifact is a changelog with all meaningful events for your product, for example, changes in price, launches of competitive products or new feature releases. The changelog will allow you to find all significant events in one place not looking through multiple chats and wiki pages. It can be demanding not to forget to update the changelog. You could make it part of analytical on-call duties to establish clear ownership.In most cases, you need input from different people to understand the situation’s whole context. A preliminary prepared working group and a channel for KPI anomaly investigations can save precious time and keep all stakeholders updated.Last but not least, to minimize the potential negative impact on customers, we should have a monitoring system in place to learn about anomalies as soon as possible and start looking for root causes. So save some time establishing and improving your alerting and monitoring.

TL;DR

The key messages I would like you to keep in mind:

Dealing with root cause analysis, you should focus on minimizing the potential negative impact on customers.Try to be creative and look broadly: get all the context of what’s going on inside your product, infrastructure, and what are potential external factors.Dig deep: look at your metrics from different angles, trying to examine different segments and decompose your metrics.Be prepared: it’s much easier to deal with such research if you already have a checklist for your product, a changelog and a working group to brainstorm.

Thank you a lot for reading this article. I hope now you won’t be stuck facing a root cause analysis task since you already have a guide at hand. If you have any follow-up questions or comments, please don’t hesitate to leave them in the comments section.

Anomaly Root Cause Analysis 101 was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top