#### CAUSAL DATA SCIENCE

#### How to best target policies in the presence of churn

Cover, image by Author

A very common task in data science is *churn prediction*. However, predicting churn is often just an intermediate step and rarely the final objective. Usually, what we actually care about is **reducing churn**, which is a separate objective, not necessarily related. In fact, for example, knowing that long-term customers are less likely to churn than new customers is not an actionable insight since we cannot increase customers’ tenure. What we would like to know instead is how one (or more) treatment impacts churn. This is often referred to as **churn uplift**.

In this article, we will be going **beyond** both churn prediction and churn uplift and consider instead the ultimate goal of churn-prevention campaigns: **increasing revenue**. First of all, a policy that reduces churn might also have an impact on revenue, which should be taken into account. However, and more importantly, increasing revenue is relevant only if the customer does not churn. Vice-versa, decreasing churn is more relevant for high-revenue customers. This **interaction** between churn and revenue is critical in understanding the profitability of any treatment campaign and should not be overlooked.

### Gifts and Subscriptions

For the rest of the article, we are going to use a **toy example** to illustrate the main idea. Suppose we were a company interested in reducing our customer’s churn and ultimately increasing our revenue. Suppose we have decided to test a new idea: sending a **gift** of *1$* to our users. In order to test whether the treatment works, we have randomly sent it only to a subsample of our customer base.

cost = 1

Let’s have a look at the data we have at our disposal. I import the data-generating process dgp_gift() from src.dgp. I also import some plotting functions and libraries from src.utils.

from src.utils import *

from src.dgp import dgp_gift

dgp = dgp_gift(n=100_000)

df = dgp.generate_data()

df.head()Data snapshot, image by Author

We have information on 100_000 customers for which we observe the number of months they have been active customers, the revenue they generated in the last month (rev_old), the change in revenue between the last month and the previous one (rev_change), whether they were randomly sent a gift and the two outcomes of interest: churn, i.e. whether they are not active customers anymore and the revenue they have generated in the current month. We denote the outcomes with the letter *Y*, the treatment with the letter *W* and the other variables with the letter *X*.

Y = [‘churn’, ‘revenue’]

W = ‘gift’

X = [‘months’, ‘rev_old’, ‘rev_change’]

**Note** that, for simplicity, we consider a single-period snapshot of the data, and we summarize the panel structure of the data in just a couple of variables. Usually, we would have a longer time series but also a longer time horizon for what concerns the outcome (e.g. customer lifetime value).

We can represent the underlying data-generating process with the following **Directed Acyclic Graph (DAG)**. Nodes represent variables, and arrows represent potential causal relationships. I have highlighted in green the two relationships of interest: the effect of the gift on churn and revenue. Note that churn is related to revenue since churned customers, by definition, generate zero revenue.

DAG of the data generating process, image by Author

Importantly, past revenue and revenue change are predictors of churn and revenue but are not related to our intervention. On the contrary, the intervention affects churn and revenue differentially depending on the customer’s total active months.

While simplistic, this data-generating process aims at capturing an important **insight**: variables that are good predictors of churn or revenue, are not necessarily variables that predict churn or revenue **lift**. We will see later how this impacts our analysis.

Let’s start first by exploring the data.

#### Exploratory Data Analysis

Let’s start with churn. How many customers did the company lose last month?

df.churn.mean()0.19767

The company lost almost *20%* of its customers last month! Did the gift help in preventing churn?

We want to compare the churn frequency of customers that received the gift with the churn frequency of customers that did not receive the gift. Since the gift was randomized, the difference-in-means estimator is an unbiased estimator for the average treatment effect (ATE) of the gift on churn.

Average Treatment Effect, image by Author

We compute the difference-in-means estimate by linear regression. We also include other covariates to improve the efficiency of the estimator.

smf.ols(“churn ~ ” + W + ” + ” + ” + “.join(X), data=df).fit().summary().tables[1]Churn regression table, image by Author

It looks like the gift decreased churn by around *11* percentage points, i.e. almost one-third of the baseline level of *32%*! Did it also have an impact on revenue?

As for churn, we regress revenue on gift, our treatment variable, to estimate the average treatment effect.

smf.ols(“revenue ~ ” + W + ” + ” + ” + “.join(X), data=df).fit().summary().tables[1]Revenue regression table, image by Author

It looks like the gift on average increased revenue by *0.63$*, which means that it was **not profitable**. Does it mean that we should stop sending gifts to our customers? It depends. In fact, the gift might be effective for certain customer segments. We just need to identify them.

### Targeting Policies

In this section, we will try to understand if there is a data-informed profitable way to send the gift, by targeting specific customers. In particular, we will **compare** different targeting **policies** with the objective of increasing revenue.

Throughout this section, we will need some algorithms to either predict revenue, or churn, or the probability of receiving the gift. We use gradient-boosted tree models from the lightgbm library. We use the same models for all policies so that we cannot attribute differences in performance to prediction accuracy.

from lightgbm import LGBMClassifier, LGBMRegressor

To **evaluate** each policy denoted with *τ*, we compare its profits with the policy *Π⁽¹⁾*, with its profits without the policy *Π⁽⁰⁾*, over every single individual, in a separate validation dataset. Note that this is usually not possible since, for each customer, we only observe one of the two **potential outcomes**, with or without the gift. However, since we are working with synthetic data, we can do oracle evaluation. If you want to know more about how to evaluate uplift models with real data, I recommend my introductory article.

First of all, let’s define profits *Π* as the net revenue *R* when the customer does not churn *C*.

Profit formula, image by Author

Therefore, the overall effect on profits for treated individuals is given by the difference between the profits when treated *Π⁽¹⁾* minus the profits when not treated *Π⁽⁰⁾*.

Profit lift formula, image by author

The effect for untreated individuals is zero.

def evaluate_policy(policy):

data = dgp.generate_data(seed_data=4, seed_assignment=5, keep_po=True)

data[‘profits’] = (1 – data.churn) * data.revenue

baseline = (1-data.churn_c) * data.revenue_c

effect = policy(data) * (1-data.churn_t) * (data.revenue_t-cost) + (1-policy(data)) * (1-data.churn_c) * data.revenue_c

return np.sum(effect – baseline)

#### 1. Target Churning Customers

A first policy could be to just target **churning customers**. Let’s say we send the gift only to customers with above-average predicted churn.

model_churn = LGBMClassifier().fit(X=df[X], y=df[‘churn’])

policy_churn = lambda df : (model_churn.predict_proba(df[X])[:,1] > df.churn.mean())

evaluate_policy(policy_churn)-5497.46

The policy is not profitable and would lead to an aggregate **loss** of more than *5000$*.

You might think that the problem is the **arbitrary threshold**, but this is not the case. Below I plot the aggregate effect for all possible policy thresholds.

x = np.linspace(0, 1, 100)

y = [evaluate_policy(lambda df : (model_churn.predict_proba(df[X])[:,1] > p)) for pin x]

fig, ax = plt.subplots(figsize=(10, 3))

sns.lineplot(x=x, y=y).set(xlabel=’Churn Policy Threshold’, title=’Aggregate Effect’);

ax.axhline(y=0, c=’k’, lw=3, ls=’–‘);Aggregate effect by churn threshold, image by Author

As we can see, no matter the threshold, it is basically impossible to make any profit.

The problem is that the fact that a customer is likely to churn does not imply that the gift will have any impact on their churn probability. The two measures are not completely unrelated (e.g. we cannot decrease the churning probability of customers that have a 0% probability of churning), but they are not the same thing.

#### 2. Target revenue customers

Let’s now try a different policy: we send the gift only to **high-revenue customers**. For example, we might send the gift only to the top 10% of customers by revenue. The idea is that if the policy indeed decreases churn, these are the customers for whom decreasing churn is more profitable.

model_revenue = LGBMRegressor().fit(X=df[X], y=df[‘revenue’])

policy_revenue = lambda df : (model_revenue.predict(df[X]) > np.quantile(df.revenue, 0.9))

evaluate_policy(policy_revenue)-4730.82

The policy is again unprofitable, leading to substantial **losses**. As before, this is not a problem of selecting the threshold, as we can see in the plot below. The best we can do is set a threshold so high that we do not treat anyone, and we make zero profits.

x = np.linspace(0, 100, 100)

y = [evaluate_policy(lambda df : (model_revenue.predict(df[X]) > c)) for c in x]

fig, ax = plt.subplots(figsize=(10, 3))

sns.lineplot(x=x, y=y).set(xlabel=’Revenue Policy Threshold’, title=’Aggregate Effect’);

ax.axhline(y=0, c=’k’, lw=3, ls=’–‘);Aggregate effect by revenue threshold, image by Author

The problem is that, in our setting, the churn probability of high-revenue customers does not decrease enough to make the gift profitable. This is also partially due to the fact, often observed in reality, that high-revenue customers are also the least likely to churn, to begin with.

Let’s now consider a more relevant set of policies: policies based on **uplift**.

#### 3. Target churn uplift customers

A more sensible approach would be to target customers whose churn probability decreases the most when receiving the *1$* gift. We estimate churn uplift using the double-robust estimator, one of the best-performing uplift models. If you are unfamiliar with meta-learners, I recommend starting from my introductory article.

We import the doubly-robust learner from econml, a Microsoft library.

from econml.dr import DRLearner

DR_learner_churn = DRLearner(model_regression=LGBMRegressor(), model_propensity=LGBMClassifier(), model_final=LGBMRegressor())

DR_learner_churn.fit(df[‘churn’], df[W], X=df[X]);

Now that we have estimated churn uplift, we might be tempted to just target customers with a high negative uplift (negative, since we want to *decrease* churn). For example, we might send the gift to all customers with an estimated uplift larger than the average churn.

policy_churn_lift = lambda df : DR_learner_churn.effect(df[X]) < – np.mean(df.churn)

evaluate_policy(policy_churn_lift)-3925.24

The policy is still unprofitable, leading to almost *4000$* in losses.

The problem is that we haven’t considered the cost of the policy. In fact, decreasing the churn probability is **only profitable for high-revenue customers**. Take the extreme case: avoiding churn of a customer that does not generate any revenue is not worth any intervention.

Therefore, let’s only send the gift to customers whose churn probability weighted by revenue decreases more than the cost of the gift.

model_revenue_1 = LGBMRegressor().fit(X=df.loc[df[W] == 1, X], y=df.loc[df[W] == 1, ‘revenue’])

policy_churn_lift = lambda df : – DR_learner_churn.effect(df[X]) * model_revenue_1.predict(df[X]) > cost

evaluate_policy(policy_churn_lift)318.03

This policy is finally profitable!

However, we still have not considered one channel: the intervention might also affect the revenue of existing customers.

#### 4. Target revenue uplift customers

A symmetric approach to the previous one would be to consider only the impact on revenue, ignoring the impact on churn. We could estimate the revenue uplift for non-churning customers and treat only customers whose incremental effect on revenue, net of churn, is greater than the cost of the gift.

DR_learner_netrevenue = DRLearner(model_regression=LGBMRegressor(), model_propensity=LGBMClassifier(), model_final=LGBMRegressor())

DR_learner_netrevenue.fit(df.loc[df.churn==0, ‘revenue’], df.loc[df.churn==0, W], X=df.loc[df.churn==0, X]);

model_churn_1 = LGBMClassifier().fit(X=df.loc[df[W] == 1, X], y=df.loc[df[W] == 1, ‘churn’])

policy_netrevenue_lift = lambda df : DR_learner_netrevenue.effect(df[X]) * (1-model_churn_1.predict(df[X])) > cost

evaluate_policy(policy_netrevenue_lift)50.80

This policy is profitable as well but ignores the effect on churn. How do we combine this policy with the previous one?

#### 5. Target revenue uplift customers

The best way to efficiently **combine** both the effect on churn and the effect on net revenue is simply to estimate **total revenue uplift**. The implied optimal policy is to treat customers whose total revenue uplift is greater than the cost of the gift.

DR_learner_revenue = DRLearner(model_regression=LGBMRegressor(), model_propensity=LGBMClassifier(), model_final=LGBMRegressor())

DR_learner_revenue.fit(df[‘revenue’], df[W], X=df[X]);

policy_revenue_lift = lambda df : (DR_learner_revenue.effect(df[X]) > cost)

evaluate_policy(policy_revenue_lift)2028.21

It looks like this is by far the best policy, generating an aggregate profit of more than *2000$*!

The result is starking if we compare all the different policies.

policies = [policy_churn, policy_revenue, policy_churn_lift, policy_netrevenue_lift, policy_revenue_lift]

df_results = pd.DataFrame()

df_results[‘policy’] = [‘churn’, ‘revenue’, ‘churn_L’, ‘netrevenue_L’, ‘revenue_L’]

df_results[‘value’] = [evaluate_policy(policy) for policy in policies]

fig, ax = plt.subplots()

sns.barplot(df_results, x=’policy’, y=’value’).set(title=’Overall Incremental Effect’)

plt.axhline(0, c=’k’);Comparing policies, image by Author

#### Intuition and Decomposition

If we compare the different policies, it is clear that targeting high-revenue or high-churn probability customers directly were the **worst choices**. This is not necessarily always the case, but it happened in our simulated data because of two facts that are also common in many real scenarios:

Revenue and churn probability are negatively correlatedThe effect of the gift on churn (or revenue) was not strongly negatively (or positively for revenue) correlated with the baseline values

Either one of these two facts can be enough to make targeting revenue or churn a bad strategy. What one should target instead is customers with a high **incremental** effect. And it’s best to directly use as outcome the variable of interest, revenue in this case, whenever available.

To better understand the mechanism, we can **decompose** the aggregate effect of a policy on profits into three parts.

Profit lift decomposition, image by Author

This implies that there are **three channels** that make treating a customer profitable.

If it’s a *high-revenue* customer and the treatment *decreases* its churn probabilityIf it’s a *non-churning* customer and the treatment *increases* its revenueIt the treatment has a strong impact on *both* its revenue and churn probability

Targeting by churn uplift exploits only the first channel, targeting by net revenue uplift exploits only the second channel, and targeting by total revenue uplift exploits all three channels, making it the **most effective** method.

#### Bonus: weighting

As highlighted by Lemmens, Gupta (2020), sometimes it might be worth **weighting** observations when estimating model uplift. In particular, it might be worth giving more weight to observations close to the treatment policy threshold.

The **idea** is that weighting generally decreases the efficiency of the estimator. However, we are not interested in having correct estimates for all the observations, but rather we are interested in estimating the **policy threshold** correctly. In fact, whether you estimate a net profit of *1$* or *1000$* it does not matter: the implied policy is the same: send the gift. However, estimating a net profit of *1$* rather than *-1$* reverses the policy implications. Therefore, a large loss in accuracy away from the threshold sometimes is worth a small gain in accuracy at the threshold.

Let’s try using negative exponential weights, decreasing in distance from the threshold.

DR_learner_revenue_w = DRLearner(model_regression=LGBMRegressor(), model_propensity=LGBMClassifier(), model_final=LGBMRegressor())

w = np.exp(1 + np.abs(DR_learner_revenue.effect(df[X]) – cost))

DR_learner_revenue_w.fit(df[‘revenue’], df[W], X=df[X], sample_weight=w);

policy_revenue_lift_w = lambda df : (DR_learner_revenue_w.effect(df[X]) > cost)

evaluate_policy(policy_revenue_lift_w)1398.19

In our case, weighting is not worth it: the implied policy is still profitable but less than the one obtained with the unweighted model, *2028$*.

**Conclusion**

In this article, we have why and how one should go **beyond** churn prediction and churn uplift modeling. In particular, one should concentrate on the final business objective of increasing profitability. This implies shifting the focus from *prediction* to *uplift* but also combining churn and revenue into a single outcome.

An important caveat concerns the dimension of the data available. We have used a toy dataset that highly simplifies the problem in at least **two dimensions**. First of all, backward, we normally have longer time series that can (and should) be used for both prediction and modeling purposes. Second, forward, one should combine churn with a longer-run estimate of customer profitability, usually referred to as *customer lifetime value*.

#### References

Kennedy (2022), “Towards Optimal Doubly Robust Estimation of Heterogeneous Causal Effects”Bonvini, Kennedy, Keele (2021), “Minimax Optimal Subgroup Identification”Lemmens, Gupta (2020), “Managing Churn to Maximize Profits”

#### Related Articles

Evaluating Uplift ModelsUnderstanding Meta LearnersUnderstanding AIPW, the Doubly-Robust Estimator

#### Code

You can find the original Jupyter Notebook here:

Blog-Posts/notebooks/beyond_churn.ipynb at main · matteocourthoud/Blog-Posts

#### Thank you for reading!

*I really appreciate it! *🤗* If you liked the post and want to see more, consider **following me**. I post once a week on topics related to causal inference and data analysis. I try to keep my posts simple but precise, always providing code, examples, and simulations.*

*Also, a small **disclaimer**: I write to learn, so mistakes are the norm, even though I try my best. Please, when you spot them, let me know. I also appreciate suggestions on new topics!*

Beyond Churn Prediction and Churn Uplift was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.