CPSC 330 Lecture 9: Classification Metrics

Varada Kolhatkar

Focus on the breath!

Announcements

Important information about midterm 1
- https://piazza.com/class/mekbcze4gyber/post/162
- Good news for you: You’ll have access to our course notes in the midterm!
HW4 was due on Monday, Oct 6th 11:59 pm.
HW5 has been released. It’s a project-type assignment and you get till Oct 27th to work on it.

ML workflow

Accuracy

So far, we’ve been measuring model performance using Accuracy.
Accuracy is the proportion of all predictions that were correct — whether positive or negative.

\[ \text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} \]

But is accuracy always the right metric to evaluate a model? 🤔

A fraud classification example

(139554, 29)

	Time	Amount	V1	V2	V3	V4	V5	V6	V7	...	V19	V20	V21	V22	V23	V24	V25	V26	V27	V28
64454	51150.0	1.00	-3.538816	3.481893	-1.827130	-0.573050	2.644106	-0.340988	2.102135	...	-1.509991	1.345904	0.530978	-0.860677	-0.201810	-1.719747	0.729143	-0.547993	-0.023636	-0.454966
37906	39163.0	18.49	-0.363913	0.853399	1.648195	1.118934	0.100882	0.423852	0.472790	...	0.810267	-0.192932	0.687055	-0.094586	0.121531	0.146830	-0.944092	-0.558564	-0.186814	-0.257103
79378	57994.0	23.74	1.193021	-0.136714	0.622612	0.780864	-0.823511	-0.706444	-0.206073	...	0.258815	-0.178761	-0.310405	-0.842028	0.085477	0.366005	0.254443	0.290002	-0.036764	0.015039
245686	152859.0	156.52	1.604032	-0.808208	-1.594982	0.200475	0.502985	0.832370	-0.034071	...	-1.009429	-0.040448	0.519029	1.429217	-0.139322	-1.293663	0.037785	0.061206	0.005387	-0.057296
60943	49575.0	57.50	-2.669614	-2.734385	0.662450	-0.059077	3.346850	-2.549682	-1.430571	...	0.157993	-0.430295	-0.228329	-0.370643	-0.211544	-0.300837	-1.174590	0.573818	0.388023	0.161782

5 rows × 31 columns

`DummyClassifier`

Let’s try a DummyClassifier, which makes predictions without learning any patterns.

dummy = DummyClassifier()
cross_val_score(dummy, X_train, y_train).mean()

0.9983017327649726

The accuracy looks surprisingly high!
Should we be happy with this model and deploy it?

Problem: Class imbalance

y_train.value_counts()

Class
0    139317
1       237
Name: count, dtype: int64

In many real-world problems, some classes are much rarer than others.
A model that always predicts “no fraud” could still achieve >99% accuracy!
This is why accuracy can be misleading in imbalanced datasets.
We need metrics that differentiate types of errors.

`DummyClassifier`: Confusion matrix

Which types of errors would be most critical for the bank to address? Missing a fraud case or flagging a legitimate transaction as fraud?

`LogisticRegression`: Confusion matrix

Are we doing better with logistic regression?

Understanding the confusion matrix

TN \(\rightarrow\) True negatives
FP \(\rightarrow\) False positives
FN \(\rightarrow\) False negatives
TP \(\rightarrow\) True positives

Practice: confusion matrix terminology

Confusion matrix questions

Imagine a spam filter model where emails labeled 1 = spam, 0 = not spam.

If a spam email is incorrectly classified as not spam, what kind of error is this?

1. A false positive
1. A true positive
1. A false negative
1. A true negative

Confusion matrix questions

In an intrusion detection system, 1 = intrusion, 0 = safe.

If the system misses an actual intrusion and classifies it as safe, this is a:

1. A false positive
1. A true positive
1. A false negative
1. A true negative

Confusion matrix questions

In a medical test for a disease, 1 = diseased, 0 = healthy.

If a healthy patient is incorrectly diagnosed as diseased, that’s a:

1. A false positive
1. A true positive
1. A false negative
1. A true negative

Metrics other than accuracy

Now that we understand the different types of errors, we can explore metrics that better capture model performance when accuracy falls short, especially for imbalanced datasets.

We’ll start with three key ones:

Precision
Recall
F1-score

Precision and recall

Let’s revisit our fraud detection scenario. The circle below represents all transactions predicted as fraud by an imaginary toy model designed to detect fraudulent activity.

Intuition behind the two metrics

Precision: Of all the transactions predicted as fraud, how many were actually fraud?
- High precision \(\rightarrow\) few false alarms (low false positives).
Recall: Of all the actual fraud cases, how many did the model catch?
- High recall \(\rightarrow\) few missed frauds (low false negatives).

Trade-off between precision and recall

Increasing recall often decreases precision, and vice versa.
Example:
- Predict “fraud” for every transaction \(\rightarrow\) perfect recall, terrible precision.
- Predict “fraud” only when 100% sure \(\rightarrow\) high precision, low recall.

The right balance depends on the application and cost of errors.

F1-score

Sometimes, we want a single metric that balances precision and recall.
The F1-score is the harmonic mean of the two:

\[ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \]

High F1 means both precision and recall are strong.
Useful when we care about both false positives and false negatives.

Summary

Metric	What it measures	High value means
Accuracy	Overall correctness	Model gets most predictions right
Precision	Quality of positive predictions	Few false alarms
Recall	Quantity of true positives caught	Few missed positives
F1-score	Balance of precision & recall	Both precision and recall are high

iClicker Exercise 9.1

Select all of the following statements which are TRUE.

1. In medical diagnosis, false positives are more damaging than false negatives (assume “positive” means the person has a disease, “negative” means they don’t).
1. In spam classification, false positives are more damaging than false negatives (assume “positive” means the email is spam, “negative” means they it’s not).
1. If method A gets a higher accuracy than method B, that means its precision is also higher.
1. If method A gets a higher accuracy than method B, that means its recall is also higher.

Counter examples

Method A - higher accuracy but lower precision

Negative	Positive
90	5
5	0

Method B - lower accuracy but higher precision

Negative	Positive
80	15
0	5

Takeaway

Accuracy summarizes overall correctness but hides class-specific behaviour.
You can have high accuracy but poor precision or recall,
especially in imbalanced datasets.
Always check multiple metrics before deciding which model is better.

Threshold-based classification

Predicting with logistic regression

Most classification models don’t directly predict labels. They predict scores or probabilities.
To get a label (e.g., “fraud” or “non fraud”), we choose a threshold (often 0.5). If the threshold changes, predictions change, and so do the errors.
What happens to precision and recall if we change the probability threshold?
Play with classification thresholds

PR curve

Calculate precision and recall (TPR) at every possible threshold and graph them.
Top left \(\rightarrow\) Very high threshold (strict model = high precision)
Bottom right \(\rightarrow\) Very low threshold (linient model = high recall)

We can look at all possible thresholds and plot the corresponding precision and recall!
As we lower the threshold, the classifier predicts more positives:
- Recall increases because we capture more true positives.
- Precision usually decreases because some of the additional positives are false positives.
As we raise the threshold, the classifier predicts fewer positives:
- Precision increases because only the most confident predictions are positive.
- Recall decreases because we miss more true positives.
The curve shows the trade-offs between precision and recall across all thresholds.
Ideal performance: top-right corner (high precision and high recall).
Use cases like fraud detection often require focusing on areas of the curve that favor one metric over the other (e.g., high recall for safety-critical tasks).

PR curve different thresholds

Which of the red dots are reasonable trade offs?

Average Precision (AP) Score

AP score summarizes the PR curve by calculating the area under the curve
It measures the ranking ability of a model; how well it assigns higher probabilities to positive examples than to negative ones, regardless of the specific threshold.

iClicker Exercise

Choose the appropriate evaluation metric for the following scenarios:

Scenario 1: Balance between precision and recall for a threshold.

Scenario 2: Assess performance across all thresholds.

1. F1 for 1, AP for 2
1. AP for 1, F1 Score for 2
1. AP for both
1. F1 for both

iClicker Exercise 9.2

Select all of the following statements which are TRUE.

1. If we increase the classification threshold, both true and false positives are likely to decrease.
1. If we increase the classification threshold, both true and false negatives are likely to decrease.
1. Lowering the classification threshold generally increases the model’s recall.
1. Raising the classification threshold can improve the precision of the model if it effectively reduces the number of false positives without significantly affecting true positives.

ROC Curve

Compute the True Positive Rate (TPR) and False Positive Rate (FPR) at every possible threshold, and plot TPR vs FPR.
How well does the model separate positive and negative classes in terms of predicted probability?
A good choice when the dataset is reasonably balanced or not extremely imbalanced (e.g., fraud detection, disease diagnosis).

ROC Curve example

Bottom-left \(\rightarrow\) very high threshold (almost everything predicted negative: low recall, low FPR).
Top-right \(\rightarrow\) very low threshold (almost everything predicted positive: high recall, high FPR).

AUC

The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

ROC AUC questions

Consider the points A, B, and C in the following diagram, each representing a threshold. Which threshold would you pick in each scenario?

1. If false positives (false alarms) are highly costly
1. If false positives are cheap and false negatives (missed true positives) highly costly
1. If the costs are roughly equivalent

Source

What did we learn?

Why accuracy is not always a good metric?
Confusion matrix
Precision, recall, & f1-score
Precision-recall curves & average precision
Receiver Operator Characteristic (ROC) curves & AUC

CPSC 330 Lecture 9: Classification Metrics

Focus on the breath!

Announcements

ML workflow

Accuracy

A fraud classification example

DummyClassifier

Problem: Class imbalance

DummyClassifier: Confusion matrix

LogisticRegression: Confusion matrix

Understanding the confusion matrix

Practice: confusion matrix terminology

Confusion matrix questions

Confusion matrix questions

Confusion matrix questions

Metrics other than accuracy

Precision and recall

Intuition behind the two metrics

Trade-off between precision and recall

F1-score

Summary

iClicker Exercise 9.1

Counter examples

Takeaway

Threshold-based classification

Predicting with logistic regression

PR curve

PR curve different thresholds

Average Precision (AP) Score

iClicker Exercise

iClicker Exercise 9.2

ROC Curve

ROC Curve example

AUC

ROC AUC questions

What did we learn?

`DummyClassifier`

`DummyClassifier`: Confusion matrix

`LogisticRegression`: Confusion matrix