CPSC 330 Lecture 20: Survival analysis

Varada Kolhatkar

Focus on the breath!

Announcements

  • HW9 has been released (due on December 5th)
    • Almost there! You’ve got this! 😊
  • Midterm 2 grades were released last week.

Recap: iClicker questions

(iClicker) Exercise 20.1

Select all of the following statements which are TRUE.

    1. We need to be careful when splitting the data when working with time series data.
    1. Cross-validation in time series can be applied like in other machine learning tasks.
    1. In time series forecasting, the future value of a series can only be predicted based on its past values and cannot incorporate other variables.
    1. When we used RandomForestRegressor model on the POSIX time feature, it predicted a straight line on the test data because tree-based models are inherently unable to extrapolate (i.e., make predictions outside the range of the training data).

Customer churn

Customer churn, also known as customer attrition, refers to the phenomenon where customers or subscribers stop doing business with a company or service.

Monthly subscriber churn rates for various streaming services

Question: Is a smaller or a larger churn rate more desirable for a subscription-based company?

  • A smaller churn rate is better (means fewer customers leaving)
  • Lower churn = higher customer retention = more stable revenue

The challenge: Predicting when, not just whether

Imagine you work for a subscription-based telecom company.

  • Your team wants to predict when a customer will churn, not just whether they churn.
  • This helps the company:
    • Target retention strategies at the right time
    • Allocate resources efficiently to high-risk customers
    • Understand which factors accelerate or delay churn
  • Our goal: model time to churn while accounting for customers who haven’t churned yet.

Customer Churn Dataset

If you wanted to predict whether a customer churns, what kind of model from your ML toolbox would you use?

customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
6464 4726-DLWQN Male 1 No No 50 Yes Yes DSL Yes ... No No Yes No Month-to-month Yes Bank transfer (automatic) 70.35 3454.6 No
5707 4537-DKTAL Female 0 No No 2 Yes No DSL No ... No No No No Month-to-month No Electronic check 45.55 84.4 No
3442 0468-YRPXN Male 0 No No 29 Yes No Fiber optic No ... Yes Yes Yes Yes Month-to-month Yes Credit card (automatic) 98.80 2807.1 No
3932 1304-NECVQ Female 1 No No 2 Yes Yes Fiber optic No ... Yes No No No Month-to-month Yes Electronic check 78.55 149.55 Yes
6124 7153-CHRBV Female 0 Yes Yes 57 Yes No DSL Yes ... Yes Yes No No One year Yes Mailed check 59.30 3274.35 No

5 rows × 21 columns

Churn prediction as binary classification

  • When we treat churn as a binary classification problem, we only ask: Has the customer churned by the time of data collection?

  • Limitations of this approach:

    • Answers only “Yes/No” and discards when churn occurred
    • Treats a customer who churned after 1 month the same as one who churned after 5 years
    • Ignores the time dimension entirely
  • Is that what we want? Not if timing matters for business decisions!

Predicting tenure

tenure Churn
6464 50 No
5707 2 No
3442 29 No
3932 2 Yes
6124 57 No
301 4 Yes
3552 68 No
2874 64 No

In our dataset, the tenure column is the number of months the customer has stayed with the company. Can we use the techniques you learned so far (e.g., regression models) to predict the time (tenure in our case)?

The problem: Incomplete information

tenure Churn
6464 50 No
5707 2 No
3442 29 No
3932 2 Yes
6124 57 No
301 4 Yes
3552 68 No
2874 64 No
  • We only have information about tenure up to the point we collected the data.
  • For customers who haven’t churned:
    • We don’t know their true “time to churn”
    • We only know they lasted at least this long
    • Their actual churn time is unknown (incomplete information)
  • This is called right-censoring - the event hasn’t occurred yet

Types of censoring

  • Right-censoring: Event hasn’t occurred yet (most common in practice)
  • Left-censoring: Event occurred before observation started
  • Interval-censoring: Event occurred within a time interval

Time to event and censoring

tenure Churn
6464 50 No
5707 2 No
3442 29 No
3932 2 Yes
6124 57 No
301 4 Yes
3552 68 No
2874 64 No
  • Many customers in the dataset are still active (Churn = “No”).
  • They are right-censored: we only know their tenure so far, not their final tenure.

Time-to-event problems

Time-to-event problems appear everywhere. Examples include:

  • Time until a customer leaves a subscription service
  • Time until a disease causes death
  • Time until equipment fails
  • Duration of unemployment until finding a job
  • Waiting time until a scheduled surgery

These all follow the same pattern: the event happens once, and we care about how long it takes.

Approaches

Approach 1: Only use churned customers ⛔️

Suppose we only consider rows where Churn == “Yes” and throw away all right-censored customers.

customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
3932 1304-NECVQ Female 1 No No 2 Yes Yes Fiber optic No ... Yes No No No Month-to-month Yes Electronic check 78.55 149.55 Yes
301 8098-LLAZX Female 1 No No 4 Yes Yes Fiber optic No ... No No Yes Yes Month-to-month Yes Electronic check 95.45 396.1 Yes
5540 3803-KMQFW Female 0 Yes Yes 1 Yes No No No internet service ... No internet service No internet service No internet service No internet service Month-to-month No Mailed check 20.55 20.55 Yes
4084 2777-PHDEI Female 0 No No 1 Yes No Fiber optic No ... No No Yes No Month-to-month No Electronic check 78.05 78.05 Yes

4 rows × 21 columns

This throws away valuable information from active customers!

iClicker

Would only considering rows where Churn == “Yes” and throwing away all right-censored customers overestimate or underestimate the average survival time?

    1. Overestimate
    1. Underestimate
    1. Cannot tell based on the provided information

Approach 2: Assume everyone churns now ⛔️

Treat all current tenure values as final, even for active customers.

tenure Churn
6464 50 No
5707 2 No
3442 29 No
3932 2 Yes
6124 57 No

This assumes active customers (Churn = “No”) will churn immediately.

iClicker

Would assuming everyone churns now underestimate or overestimate tenure?

    1. Overestimate
    1. Underestimate
    1. Cannot tell based on the provided information

  • Key insight: Ignoring or removing censored cases will bias our estimates:

Approach 3: Survival analysis ✅

Survival analysis explicitly models time until an event and properly handles censoring.

Common methods:

  • Kaplan–Meier estimator: Non-parametric method for estimating survival curves
  • Cox proportional hazards model: Semi-parametric regression model for survival data
  • Survival forests: Random forest variant adapted for censored data

Survival analysis

These methods allow estimation of:

Survival function:

  • \(S(t) = P(T > t)\) (probability of surviving past time \(t\))
  • Imagine starting with 100 customers on Day 0. The survival function tells you: What fraction of them are still active (have not churned) at time \(t\)?
  • \(S(3) = 0.80\) (80% haven’t churned by month 3)
  • \(S(12) = 0.40\) (40% haven’t churned by month 12)

Hazard function: \(h(t)\) (instantaneous risk of the event at time \(t\))

  • Imagine a customer who has not churned yet and is still active at month 10.
  • The hazard tells us: How likely are they to churn right now, at month 10?
  • Human mortality: hazard is low at age 20, higher at 80

iClicker

Select all of the following statements which are TRUE.

    1. Right censoring occurs when the endpoint of event has not been observed for all study subjects by the end of the study period.
    1. Right censoring implies that the data is missing completely at random.
    1. In the presence of right-censored data, binary classification models can be applied directly without any modifications or special considerations.
    1. If we apply the Ridge regression model to predict tenure in right censored data, we are likely to underestimate it because the tenure observed in our data is shorter than what it would be in reality.

Class demo

What did we learn today?

  • Censoring and incorrect approaches to handling it
    • Throw away people who haven’t churned
    • Assume everyone churns today
  • Predicting tenure vs. churned
  • Survival analysis encompasses both of these, and deals with censoring
  • And it can make rich and interesting predictions!
  • KM model -> doesn’t look at features
  • CPH model -> like linear regression, does look at the features