CPSC 330 Lecture 20: Survival analysis
Announcements
- HW9 has been released (due on December 5th)
- Almost there! You’ve got this! 😊
- Midterm 2 grades were released last week.
Recap: iClicker questions
(iClicker) Exercise 20.1
Select all of the following statements which are TRUE.
- We need to be careful when splitting the data when working with time series data.
- Cross-validation in time series can be applied like in other machine learning tasks.
- In time series forecasting, the future value of a series can only be predicted based on its past values and cannot incorporate other variables.
- When we used
RandomForestRegressor model on the POSIX time feature, it predicted a straight line on the test data because tree-based models are inherently unable to extrapolate (i.e., make predictions outside the range of the training data).
Customer churn
Customer churn, also known as customer attrition, refers to the phenomenon where customers or subscribers stop doing business with a company or service.
Monthly subscriber churn rates for various streaming services
Question: Is a smaller or a larger churn rate more desirable for a subscription-based company?
- A smaller churn rate is better (means fewer customers leaving)
- Lower churn = higher customer retention = more stable revenue
The challenge: Predicting when, not just whether
Imagine you work for a subscription-based telecom company.
- Your team wants to predict when a customer will churn, not just whether they churn.
- This helps the company:
- Target retention strategies at the right time
- Allocate resources efficiently to high-risk customers
- Understand which factors accelerate or delay churn
- Our goal: model time to churn while accounting for customers who haven’t churned yet.
Churn prediction as binary classification
When we treat churn as a binary classification problem, we only ask: Has the customer churned by the time of data collection?
Limitations of this approach:
- Answers only “Yes/No” and discards when churn occurred
- Treats a customer who churned after 1 month the same as one who churned after 5 years
- Ignores the time dimension entirely
Is that what we want? Not if timing matters for business decisions!
Predicting tenure
| 6464 |
50 |
No |
| 5707 |
2 |
No |
| 3442 |
29 |
No |
| 3932 |
2 |
Yes |
| 6124 |
57 |
No |
| 301 |
4 |
Yes |
| 3552 |
68 |
No |
| 2874 |
64 |
No |
In our dataset, the tenure column is the number of months the customer has stayed with the company. Can we use the techniques you learned so far (e.g., regression models) to predict the time (tenure in our case)?
Types of censoring
- Right-censoring: Event hasn’t occurred yet (most common in practice)
- Left-censoring: Event occurred before observation started
- Interval-censoring: Event occurred within a time interval
Time to event and censoring
| 6464 |
50 |
No |
| 5707 |
2 |
No |
| 3442 |
29 |
No |
| 3932 |
2 |
Yes |
| 6124 |
57 |
No |
| 301 |
4 |
Yes |
| 3552 |
68 |
No |
| 2874 |
64 |
No |
- Many customers in the dataset are still active (Churn = “No”).
- They are right-censored: we only know their tenure so far, not their final tenure.
Time-to-event problems
Time-to-event problems appear everywhere. Examples include:
- Time until a customer leaves a subscription service
- Time until a disease causes death
- Time until equipment fails
- Duration of unemployment until finding a job
- Waiting time until a scheduled surgery
These all follow the same pattern: the event happens once, and we care about how long it takes.
Approach 1: Only use churned customers ⛔️
Suppose we only consider rows where Churn == “Yes” and throw away all right-censored customers.
| 3932 |
1304-NECVQ |
Female |
1 |
No |
No |
2 |
Yes |
Yes |
Fiber optic |
No |
... |
Yes |
No |
No |
No |
Month-to-month |
Yes |
Electronic check |
78.55 |
149.55 |
Yes |
| 301 |
8098-LLAZX |
Female |
1 |
No |
No |
4 |
Yes |
Yes |
Fiber optic |
No |
... |
No |
No |
Yes |
Yes |
Month-to-month |
Yes |
Electronic check |
95.45 |
396.1 |
Yes |
| 5540 |
3803-KMQFW |
Female |
0 |
Yes |
Yes |
1 |
Yes |
No |
No |
No internet service |
... |
No internet service |
No internet service |
No internet service |
No internet service |
Month-to-month |
No |
Mailed check |
20.55 |
20.55 |
Yes |
| 4084 |
2777-PHDEI |
Female |
0 |
No |
No |
1 |
Yes |
No |
Fiber optic |
No |
... |
No |
No |
Yes |
No |
Month-to-month |
No |
Electronic check |
78.05 |
78.05 |
Yes |
4 rows × 21 columns
This throws away valuable information from active customers!
iClicker
Would only considering rows where Churn == “Yes” and throwing away all right-censored customers overestimate or underestimate the average survival time?
- Overestimate
- Underestimate
- Cannot tell based on the provided information
Approach 2: Assume everyone churns now ⛔️
Treat all current tenure values as final, even for active customers.
| 6464 |
50 |
No |
| 5707 |
2 |
No |
| 3442 |
29 |
No |
| 3932 |
2 |
Yes |
| 6124 |
57 |
No |
This assumes active customers (Churn = “No”) will churn immediately.
iClicker
Would assuming everyone churns now underestimate or overestimate tenure?
- Overestimate
- Underestimate
- Cannot tell based on the provided information
- Key insight: Ignoring or removing censored cases will bias our estimates:
Approach 3: Survival analysis ✅
Survival analysis explicitly models time until an event and properly handles censoring.
Common methods:
- Kaplan–Meier estimator: Non-parametric method for estimating survival curves
- Cox proportional hazards model: Semi-parametric regression model for survival data
- Survival forests: Random forest variant adapted for censored data
Survival analysis
These methods allow estimation of:
Survival function:
- \(S(t) = P(T > t)\) (probability of surviving past time \(t\))
- Imagine starting with 100 customers on Day 0. The survival function tells you: What fraction of them are still active (have not churned) at time \(t\)?
- \(S(3) = 0.80\) (80% haven’t churned by month 3)
- \(S(12) = 0.40\) (40% haven’t churned by month 12)
Hazard function: \(h(t)\) (instantaneous risk of the event at time \(t\))
- Imagine a customer who has not churned yet and is still active at month 10.
- The hazard tells us: How likely are they to churn right now, at month 10?
- Human mortality: hazard is low at age 20, higher at 80
iClicker
Select all of the following statements which are TRUE.
- Right censoring occurs when the endpoint of event has not been observed for all study subjects by the end of the study period.
- Right censoring implies that the data is missing completely at random.
- In the presence of right-censored data, binary classification models can be applied directly without any modifications or special considerations.
- If we apply the
Ridge regression model to predict tenure in right censored data, we are likely to underestimate it because the tenure observed in our data is shorter than what it would be in reality.
What did we learn today?
- Censoring and incorrect approaches to handling it
- Throw away people who haven’t churned
- Assume everyone churns today
- Predicting tenure vs. churned
- Survival analysis encompasses both of these, and deals with censoring
- And it can make rich and interesting predictions!
- KM model -> doesn’t look at features
- CPH model -> like linear regression, does look at the features