Figuring out the most unusual segments in data

How to find segments to focus on using common sense and machine learning

Photo by Klara Kulikova on Unsplash

Analysts often have tasks of finding the “interesting” segments — the segments where we could focus our efforts to get the maximum potential impact. For example, it may be interesting to determine what customer segments have the most significant effect on churn. Or you could try to understand what types of orders affect customer support workload and the company’s revenue.

Of course, we could look at graphs to find such outstanding features. But it may be time-consuming because we usually track dozens or even hundreds of customers’ characteristics. More than that, we need to look at combinations of different factors so that it may lead to a combinatorial explosion. With such tasks, a framework would be really helpful because it could save you hours of analysis.

In this article, I would like to share with you two approaches for finding the most outstanding slices of data:

based on common sense and basic maths,based on machine learning — our data science team at Wise has open-sourced a library Wise Pizza that gives you answers in three lines of code.

Example: Churn for bank customers

You can find the complete code for this example on GitHub.

We will be using data for bank customers’ churn as an example. This dataset can be found on Kaggle under CC0: Public Domain license.

We will try to find the segments with the most significant impact on churn using different approaches: graphs, common sense and machine learning. But let’s start with data preprocessing.

The dataset lists customers and their characteristics: credit score, country of residency, age & gender, how much money customers have on balance etc. Also, for each customer, we know whether they churned or not — parameter exited.

Our main goal is to find the customer segments with the highest impact on the number of churned customers. After that, we could try to understand the problems specific to these user groups. If we focus on fixing issues for these segments, we will have the most significant effect on the number of churned customers.

To simplify calculations and interpretations, we will define segments as sets of filters, for example, gender = Male or gender = Male, country = United Kingdom.

We will be working with discrete characteristics, so we have to transform continuous metrics, such as age or balance. For this, we could look at distributions and define suitable buckets. For example, let’s look at age.

Graph by author

Code example for bucketing continuous characteristic

def get_age_group(a):
if a < 25:
return ’18 – 25′
if a < 35:
return ’25 – 34′
if a < 45:
return ’35 – 44′
if a < 55:
return ’45 – 54′
if a < 65:
return ’55 – 64′
return ’65+’

raw_df[‘age_group’] = raw_df.age.map(get_age_group)

The most straightforward way to find intriguing segments in data is to look at visualisations. We can look at churn rates split by one or two dimensions using bar charts or heat maps.

Let’s look at the correlation between age and churn. Churn rates are low for customers under 35 years — less than 10%. While for customers between 45 and 64 years, retention is the worst — almost half of customers have churned.

Graph by author

Let’s add one more parameter (gender) to try to find more complex relations. Barchart won’t be able to show us two-dimensional relationships, so let’s switch to a heatmap.

Churn rates for females are higher for all age groups, so gender is an influential factor.

Graph by author

Such visualisations can be pretty insightful, but there are a couple of problems with this approach:

we don’t take into account the size of segments,it may be time-consuming to look at all possible combinations of characteristics you have,it’s challenging to visualize more than two dimensions in one graph.

So let’s move on to more structured approaches that will help us to get a prioritized list of interesting segments with estimated effects.

Common sense approach

Assumptions

How could we calculate the potential impact of fixing problems for a specific segment? We can compare it to the “ideal” scenario with a lower churn rate.

You may wonder how we could estimate the benchmark for churn rate. There are several ways to do it:

benchmarks from the market: you can try to search for typical churn rates levels for products in your domain,high-performing segments in your product: usually, you have a bit better-performing segments (for example, you can split by country or platform) and you can use them as a benchmark,average value: the most conservative approach is looking at the global mean value and estimating the potential effect of reaching the average churn rates for all segments.

Let’s play safe and use the average churn rate from our dataset as a benchmark — 20.37%.

Listing all possible segments

The next step is to build all possible segments. Our dataset has ten dimensions with 3–6 unique values for each. The total number of combinations is around 1.2M. It looks computationally costly even though we have just a few dimensions and different values for them. In actual tasks, you usually have dozens of characteristics and unique values.

We definitely need to think about some performance optimizations. Otherwise, we may have to spend hours waiting for results. Here are a couple of tips on reducing computations:

First of all, we don’t need to build all possible combinations. It will be reasonable to limit the depth to 4–6. The possibility that your product team should focus on a user segment defined by 42 different filters is pretty low.Secondly, we may define the size of the effect we are interested in. Let’s say we would like to increase the retention rate by at least 1% point. It means we are not interested in segments with a size of less than 1% of all users. Then we can stop splitting a segment further if its size is below this threshold — it will reduce the number of operations.Last but not least, you can significantly reduce the data size and resources spent on calculations in real-life datasets. For that, you can group all small characteristics for each dimension into an other group. For example, there are hundreds of countries, and each country’s users’ share usually follows Zipf’s law as with many other real data relations. So you will have many countries with a size of less than 1% of all users. As we discussed earlier, we are not interested in such small user groups, and we can just group them all into one segment country = other to make calculations easier.Graph by author

We will be using recursion to build all combinations of filters up to max_depth. I like this concept of computer science because, in many cases, it allows you to solve complex problems elegantly. Unfortunately, data analysts rarely face the need to write recursive code — I can remember three tasks through 10 years of data analysis experience.

The idea of recursion is pretty straightforward — it’s when your function calls itself during the execution. It’s handy when you are working with hierarchies or graphs. If you would like to learn more about recursion in Python, read this article.

The high-level concept in our case is the following:

We start with the entire dataset and no filters.Then we try to add one more filter (if the segment size is big enough and we haven’t reached maximum depth) and apply our function to it.Repeat the previous step until conditions are valid.
num_metric = ‘exited’
denom_metric = ‘total’
max_depth = 4

def convert_filters_to_str(f):
lst = []
for k in sorted(f.keys()):
lst.append(str(k) + ‘ = ‘ + str(f[k]))

if len(lst) != 0:
return ‘, ‘.join(lst)
return ”

def raw_deep_dive_segments(tmp_df, filters):
# return segment
yield {
‘filters’: filters,
‘numerator’: tmp_df[num_metric].sum(),
‘denominator’: tmp_df[denom_metric].sum()
}

# if we haven’t reached max_depth then we can dive deeper
if len(filters) < max_depth:
for dim in dimensions:
# check if this dimensions has already been used
if dim in filters:
continue

# deduplication of possible combinations
if (filters != {}) and (dim < max(filters.keys())):
continue

for val in tmp_df[dim].unique():
next_tmp_df = tmp_df[tmp_df[dim] == val]

# checking if segment size is big enough
if next_tmp_df[denom_metric].sum() < min_segment_size:
continue

next_filters = filters.copy()
next_filters[dim] = val

# executing function for subsequent segment
for rec in raw_deep_dive_segments(next_tmp_df, next_filters):
yield rec

# aggregating all segments for dataframe
segments_df = pd.DataFrame(list(raw_deep_dive_segments(df, {})))

As a result, we got around 10K segments. Now we can calculate the estimated effects for each, filter segments with negative effects and look at the user groups with the highest potential impact.

baseline_churn = 0.2037
segments_df[‘churn_share’] = segments_df.churn/segments_df.total
segments_df[‘churn_est_reduction’] = (segments_df.churn_share – baseline_churn)
*segments_df.total
segments_df[‘churn_est_reduction’] = segments_df[‘churn_est_reduction’]
.map(lambda x: int(round(x)))

filt_segments_df = segments_df[segments_df.churn_est_reduction > 0]
.sort_values(‘churn_est_reduction’, ascending = False).set_index(‘segment’)

It should be a Holly Graal that gives all the answers. But wait, there are too many duplicates and segments subsequent to one another. Could we reduce duplication and keep only the most informative user groups?

Grooming

Let’s look at a couple of examples.

The churn rate for the child segment age_group = 45–54, gender = Male is lower than age_group = 45–54. Adding a gender = Male filter doesn’t bring us closer to the specific problem. So we can eliminate such cases.

The example below shows the opposite situation: the churn rate for the child segment is significantly higher, and, more than that, the child segment includes 80% of churned customers from the parent node. In this case, it’s reasonable to eliminate a credit_score_group = poor, tenure_group = 8+ segment because the main problem is within a is_active_member = 0 group.

Let’s filter all those not-so-interesting segments.

import statsmodels.stats.proportion

# getting all parent – child pairs
def get_all_ancestors_recursive(filt):
if len(filt) > 1:
for dim in filt:
cfilt = filt.copy()
cfilt.pop(dim)
yield cfilt
for f in get_all_ancestors_recursive(cfilt):
yield f

def get_all_ancestors(filt):
tmp_data = []
for f in get_all_ancestors_recursive(filt):
tmp_data.append(convert_filters_to_str(f))
return list(set(tmp_data))

tmp_data = []

for f in tqdm.tqdm(filt_segments_df[‘filters’]):
parent_segment = convert_filters_to_str(f)
for af in get_all_ancestors(f):
tmp_data.append(
{
‘parent_segment’: af,
‘ancestor_segment’: parent_segment
}
)

full_ancestors_df = pd.DataFrame(tmp_data)

# filter child nodes where churn rate is lower

filt_child_segments = []

for parent_segment in tqdm.tqdm(filt_segments_df.index):
for child_segment in full_ancestors_df[full_ancestors_df.parent_segment == parent_segment].ancestor_segment:
if child_segment in filt_child_segments:
continue

churn_diff_ci = statsmodels.stats.proportion.confint_proportions_2indep(
filt_segments_df.loc[parent_segment][num_metric],
filt_segments_df.loc[parent_segment][denom_metric],
filt_segments_df.loc[child_segment][num_metric],
filt_segments_df.loc[child_segment][denom_metric]
)

if churn_diff_ci[0] > -0.00:
filt_child_segments.append(
{
‘parent_segment’: parent_segment,
‘child_segment’: child_segment
}
)

filt_child_segments_df = pd.DataFrame(filt_child_segments)
filt_segments_df = filt_segments_df[~filt_segments_df.index.isin(filt_child_segments_df.child_segment.values)]

# filter parent nodes where churn rate is lower

filt_parent_segments = []

for child_segment in tqdm.tqdm(filt_segments_df.index):
for parent_segment in full_ancestors_df[full_ancestors_df.ancestor_segment == child_segment].parent_segment:
if parent_segment not in filt_segments_df.index:
continue

churn_diff_ci = statsmodels.stats.proportion.confint_proportions_2indep(
filt_segments_df.loc[parent_segment][num_metric],
filt_segments_df.loc[parent_segment][denom_metric],
filt_segments_df.loc[child_segment][num_metric],
filt_segments_df.loc[child_segment][denom_metric]
)
child_coverage = filt_segments_df.loc[child_segment][num_metric]/filt_segments_df.loc[parent_segment][num_metric]

if (churn_diff_ci[1] < 0.00) and (child_coverage >= 0.8):
filt_parent_segments.append(
{
‘parent_segment’: parent_segment,
‘child_segment’: child_segment
}
)

filt_parent_segments_df = pd.DataFrame(filt_parent_segments)
filt_segments_df = filt_segments_df[~filt_segments_df.index.isin(filt_parent_segments_df.parent_segment.values)]

Now we have around 4K interesting segments. With this toy dataset, we see little difference after this grooming for the top ones. However, with real-life data, these efforts often pay out.

Root causes

The last thing we can do to leave the most meaningful slices is to keep only the root nodes of our segments. These segments are the root causes, and others are included in them. If you would like to dig deeper into one of the root causes, look at child nodes.

To get only the root causes, we need to eliminate all segments for which we have a parent node in our final list of interesting ones.

root_segments_df = filt_segments_df[~filt_segments_df.index.isin(
full_ancestors_df[full_ancestors_df.parent_segment.isin(
filt_segments_df.index)].ancestor_segment
)
]

So here it is, now we have a list of user groups to focus on. We got only one-dimensional segments at the top since there are few complex relations in data where a couple of characteristics explain the full effect.

It’s crucial to discuss how we could interpret the results. We got a list of customer segments with the estimated impact. Our estimation is based on the hypothesis that we could decrease the churn rate for the whole segment to reach the benchmark level (in our example — the average value). So we estimated the impact of fixing the problems for each user group.

You must keep in mind that this approach only gives you a high-level view of what user groups to focus on. It doesn’t take into account whether it’s possible to fix these problems entirely or not.

We’ve written quite a lot of code to get results. Maybe there’s another approach to solving this task using data science and machine learning that won’t require so much effort.

Pizza time

Actually, there is another way. Our data science team at Wise has developed a library Wise Pizza that could find the most intriguing segments in a blink of an eye. It’s open-sourced under Apache 2.0 license, so you also could use it for your tasks.

If you are interested to learn more about Wise Pizza library, don’t miss Egor’s presentation on Data Science Festival.

Applying Wise Pizza

The library is easy to use. You need to write just a couple of lines and specify the dimensions and number of segments you want in a result.

# pip install wise_pizza – for installation
import wise_pizza

# building a model
sf = wise_pizza.explain_levels(
df=df,
dims=dimensions,
total_name=”exited”,
size_name=”total”,
max_depth=4,
min_segments=15,
solver=”lasso”
)

# making a plot
sf.plot(width=700, height=100, plot_is_static=False)Graph by author

As a result, we also got a list of the most interesting segments and their potential impact on our product churn. Segments are similar to the ones we’ve obtained using the previous approach. However, the impact estimations differ a lot. To interpret Wise Pizza results correctly and understand the differences, we need to discuss how it works in more detail.

How it works

The library is based on Lasso and LP solvers. If we simplify it, the library does something similar to one-hot-encoding, adding flags for segments (the same ones we’ve calculated before) and then uses Lasso regression with churn rate as a target variable.

As you may remember from machine learning, the Lasso regression tends to have many zero coefficients, selecting a few significant factors. Wise Pizza finds the appropriate alpha coefficient for Lasso regression so that you will get a specified number of segments as a result.

For revising Lasso (L1) and Ridge (L2) regularisations, you could consult the article.

How to interpret results

Impact is estimated as the result of multiplication of coefficient and segment size.

So as you could see, it’s completely different to what we’ve estimated before. The common sense approach estimates the impact of completely fixing the problems for user groups, while Wise Pizza’s impact shows incremental effects to other selected segments.

The advantage of this approach is that you can sum up different effects. However, you need to be accurate during the results’ interpretations because the impact for each segment depends on other selected segments since they may be correlated. For example, in our case, we have three correlated segments:

age_group = 45-54num_of_products = 1, age_group = 44–54is_active_member = 1, age_group = 44–54.

The impact for age_group = 45–54 grasps potential effects for the whole age group, while others estimate additional impact from specific subgroups. Such dependencies may lead to significant results differences depending on min_segments parameter, because you will have different sets of final segments and correlations between them.

It’s crucial to pay attention to the whole picture and interpret Wise Pizza results correctly. Otherwise, you may jump to the wrong conclusions.

I appreciate this library as an invaluable tool for getting quick insights from data and the first segment candidates to dive deeper. However, suppose I need to do opportunity sizing and more robust analysis to share the potential impact of our focus with my product team. In that case, I still use a common sense approach with a reasonable benchmark because it’s much easier to interpret.

TL;DR

Finding interesting slices in your data is a common task for analysts (especially at the discovery stage). Luckily, you don’t need to make dozens of graphs to solve such questions. There are frameworks which are more comprehensive and easy-to-use.You can use the Wise Pizza ML library to get quick insights on the segments with the most significant impact on average (it also allows you to look at the difference between two datasets). I usually use it to get the first list of meaningful dimensions and segments.ML approach can give you a high-level view and prioritization in a blink of an eye. However, I recommend you to pay attention to results interpretation and make sure you and you stakeholders fully understand it. However, if you need to do a robust estimation of potential effect on KPIs of fixing problems for the whole user group, it’s worth using a good old common sense approach based on arithmetics.Thank you a lot for reading this article. I hope it was insightful to you. If you have any follow-up questions or comments, please don’t hesitate to leave them in the comments section.

Figuring out the most unusual segments in data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Logo

Oh hi there 👋
It’s nice to meet you.

Sign up to receive awesome content in your inbox, every month.

We don’t spam!

Leave a Comment

Scroll to Top