CPSC 330 Lecture 8: Hyperparameter Optimization

Varada Kolhatkar

Focus on the breath!

Check-in

How are you feeling today?

1. Things are more or less under control!
1. Excited about hyperparameter optimization!
1. Lost 😞
1. Tired / Sleepy 😴
1. Secretly thinking of lunch 🥗

Announcements

Important information about midterm 1
- https://piazza.com/class/mekbcze4gyber/post/162
HW3 was due on Monday, Sept 29th 11:59 pm.
HW4 has been released

Recap: iClicker Logistic Regression 1

Which of the following are True?

1. Logistic regression can be used for binary as well as multi-class classification tasks.
1. Logistic regression computes a weighted sum of features and applies the sigmoid function.
1. The sigmoid function ensures outputs between 0 and 1, interpreted as probabilities.
1. The decision boundary in logistic regression is linear, even though the sigmoid is applied.
1. When the weighted sum is 0, \(\hat{p}\) = 0.5.

Recap: iClicker Logistic Regression 1

Which of the following are True?

1. Logistic regression coefficients always have to be positive.
1. Larger coefficients (in absolute value) indicate stronger feature influence on the prediction.
1. For \(d\) features, the decision boundary is a \(d-1\) dimensional hyperplane.
1. In sklearn, very small C value shrinks coefficients, often leading to underfitting.
1. A larger C value allows larger coefficients and a more complex model.

Recap: Logistic regression

A linear model used for binary classification tasks.
- (Optional) There is am extension of logistic regression called multinomial logistic regression for multiclass classification.
Parameters:
- Coefficients (Weights): The model learns a coefficient or a weight associated with each feature that represents its importance.
- Bias (Intercept): A constant term added to the linear combination of features and their coefficients.

Recap: Logistic regression

The model computes a weighted sum of the input features’ values, adjusted by their respective coefficients and the bias term.
This weighted sum is passed through a sigmoid function to transform it into a probability score, indicating the likelihood of the input belonging to the “positive” class.

\[ \hat{p} = \sigma\left(\sum_{i=1}^d w_i x_i + b\right) \]

\(P_{hat}\) is the predicted probability of the example belonging to the positive class.
\(w_i\) is the learned weight associated with feature \(i\)
\(x_i\) is the value of the input feature \(i\)
\(b\) is the bias term

Recap: Logistic regression

For a dataset with \(d\) features, the decision boundary that separates the classes is a \(d-1\) dimensional hyperplane.
Complexity hyperparameter: C in sklearn.
- Higher C \(\rightarrow\) more complex model meaning larger coefficients
- Lower C \(\rightarrow\) less complex model meaning smaller coefficients

Data

sms_df = pd.read_csv(DATA_DIR + "spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]
train_df.head(4)

	target	sms
3130	spam	LookAtMe!: Thanks for your purchase of a video...
106	ham	Aight, I'll hit you up when I get some cash
4697	ham	Don no da:)whats you plan?
856	ham	Going to take your babe out ?

Model building

Let’s define a pipeline

pipe_svm = make_pipeline(CountVectorizer(), SVC())

Suppose we want to try out different hyperparameter values.

parameters = {
    "max_features": [100, 200, 400],
    "gamma": [0.01, 0.1, 1.0],
    "C": [0.01, 0.1, 1.0],
}

Hyperparameter optimization with loops

Define a parameter space.
Iterate through possible combinations.
Evaluate model performance.
What are some limitations of this approach?

`sklearn` methods

sklearn provides two main methods for hyperparameter optimization
- Grid Search
- Random Search

Grid Search

Covers all possible combinations from the provided grid.
Can be parallelized easily.
Integrates cross-validation.

Grid search example

from sklearn.model_selection import GridSearchCV

pipe_svm = make_pipeline(CountVectorizer(), SVC())

param_grid = {
    "countvectorizer__max_features": [100, 200, 400],
    "svc__gamma": [0.01, 0.1, 1.0],
    "svc__C": [0.01, 0.1, 1.0],
}
grid_search = GridSearchCV(pipe_svm, 
                  param_grid = param_grid, 
                  n_jobs=-1, 
                  return_train_score=True
                 )
grid_search.fit(X_train, y_train)
grid_search.best_score_

np.float64(0.9782606272997375)

Random Search

More efficient than grid search when dealing with large hyperparameter spaces.
Samples a given number of parameter settings from distributions.

Random search example

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

pipe_svc = make_pipeline(CountVectorizer(), SVC())

param_dist = {
    "countvectorizer__max_features": randint(100, 2000), 
    "svc__C": uniform(0.1, 1e4),  # loguniform(1e-3, 1e3),
    "svc__gamma": loguniform(1e-5, 1e3),
}
random_search = RandomizedSearchCV(pipe_svm,                                    
                  param_distributions = param_dist, 
                  n_iter=10, 
                  n_jobs=-1, 
                  return_train_score=True)

# Carry out the search
random_search.fit(X_train, y_train)
random_search.best_score_

np.float64(0.9818506556179762)

Optimization bias

Pizza baking competition example

Imagine that you participate in pizza baking competition.

Training phase: Collecting recipes and practicing at home
Validation phase: Inviting a group of friends for tasting and feedback
Test phase (competition day): Serving the judges

Overfitting on the validation set

Your friends loved your pineapple pizza.
You fine-tune your recipe for the same group of friends, perfecting it for their tastes.

Pineapple Pizza

On the competition day, you confidently present your perfected pineapple pizza.
Judges are not impressed: “This doesn’t appeal to a broad audience.”

This is similar to reusing the same validation set again and again to perfect the model for it!

Lesson: Overfitting on the validation set

You tailored your recipe too closely to your friends’ tastes.
They were not representative of the broader audience (the judges).
The pizza, while perfect for your validation group, failed to generalize.
Over many iterations, the validation set no longer gives an unbiased estimate of performance.
That’s why we need a separate test set (like a group of tasters who never influenced your pizza).

Optimization bias

Why do we need separate validation and test datasets?

Mitigating optimization bias.

Cross-validation
Ensembles
Regularization and choosing a simpler model

(iClicker) Exercise 8.1

Select all of the following statements which are TRUE.

1. If you get best results at the edges of your parameter grid, it might be a good idea to adjust the range of values in your parameter grid.
1. Grid search is guaranteed to find the best hyperparameter values.
1. It is possible to get different hyperparameters in different runs of RandomizedSearchCV.

Questions for you

You have a dataset and you give me 1/10th of it. The dataset given to me is rather small and so I split it into 96% train and 4% validation split. I carry out hyperparameter optimization using a single 4% validation split and report validation accuracy of 0.97. Would it classify the rest of the data with similar accuracy?
- Probably
- Probably not

Questions for class discussion

Suppose you have 10 hyperparameters, each with 4 possible values. If you run GridSearchCV with this parameter grid, how many cross-validation experiments will be carried out?
Suppose you have 10 hyperparameters and each takes 4 values. If you run RandomizedSearchCV with this parameter grid with n_iter=20, how many cross-validation experiments will be carried out?

CPSC 330 Lecture 8: Hyperparameter Optimization

Check-in

Announcements

Recap: iClicker Logistic Regression 1

Recap: iClicker Logistic Regression 1

Recap: Logistic regression

Recap: Logistic regression

Recap: Logistic regression

Data

Model building

Hyperparameter optimization with loops

sklearn methods

Grid Search

Grid search example

Random Search

Random search example

Optimization bias

Pizza baking competition example

Overfitting on the validation set

Lesson: Overfitting on the validation set

Optimization bias

Mitigating optimization bias.

(iClicker) Exercise 8.1

Questions for you

Questions for class discussion

Class Demo

`sklearn` methods