Lecture 4: \(k\)-nearest neighbours and SVM RBFs

Varada Kolhatkar

Announcements

  • hw2 is due tonight.
  • Syllabus quiz due date is September 19th, 11:59 pm.
  • Homework 3 (hw3) has been released (Due: Sept 29th, 11:59 pm)
    • You can work in pairs for this assignment.
  • If you were on the waitlist, you should now know your course enrollment status.
  • The lecture notes here align with the content presented in the videos. Even though we do not cover all the content from these notebooks during lectures, it’s your responsibility to go through them on your own.

Recap: iClicker overfitting 1

Which of the following scenarios do NOT necessarily imply overfitting?

    1. Training accuracy is very high (0.98) while validation accuracy is much lower (0.60).
    1. In a wildlife classifier, the model predicts “wolf” whenever there”s snow in the background, because all wolf photos were taken in snowy regions.
    1. The decision boundary of a classifier is wiggly and highly irregular.
    1. Training and validation accuracies are both approximately 0.88.
    1. A cancer detection model learns that “a ruler in the corner of the X-ray” means positive, because doctors tended to measure suspicious cases.

Recap: iClicker overfitting 2

Which of the following statements about overfitting is true?

    1. Overfitting makes the model more accurate on both training and unseen data.
    1. Overfitting means the model captures noise or irrelevant details from the training data.
    1. Overfitting is desirable because it reduces both training and test error.
    1. In real-world problems, models are always at risk of overfitting if not properly validated.

Recap: iClicker underfitting

How might one address the issue of underfitting in a machine learning model.

    1. Introduce more noise to the training data.
    1. Remove features that might be relevant to the prediction.
    1. Increase the model’s complexity (e.g., more parameters, features, or deeper trees)
    1. Use a smaller dataset for training.

The fundamental tradeoff

  • As you increase the model complexity, training score tends to go up and the gap between train and validation scores tends to go up.
  • How to pick a model?

iClicker 4.1

Select all of the following statements which are TRUE.

    1. Analogy-based models find examples from the test set that are most similar to the query example we are predicting.
    1. Euclidean distance will always have a non-negative value.
    1. With \(k\)-NN, setting the hyperparameter \(k\) to larger values typically reduces training error.
    1. Similar to decision trees, \(k\)-NNs finds a small set of good features.
    1. In \(k\)-NN, with \(k > 1\), the classification of the closest neighbour to the test example always contributes the most to the prediction.

iClicker 4.2

Select all of the following statements which are TRUE.

    1. \(k\)-NN may perform poorly in high-dimensional space (say, d > 1000).
    1. In sklearn’s SVC classifier, large values of gamma tend to result in higher training score but probably lower validation score.
    1. If we increase both gamma and C, we can’t be certain if the model becomes more complex or less complex.

Similarity-based algorithms

  • Use similarity or distance metrics to predict targets.
  • Examples: \(k\)-nearest neighbors, Support Vector Machines (SVMs) with RBF Kernel.

\(k\)-nearest neighbours

  • Classifies an object based on the majority label among its \(k\) closest neighbors.
  • Main hyperparameter: \(k\) or n_neighbors in sklearn
  • Distance Metrics: Euclidean
  • Strengths: simple and intuitive, can learn complex decision boundaries
  • Challenges: Sensitive to the choice of distance metric and scaling (coming up).

Curse of dimensionality

  • As dimensionality increases, the volume of the space increases exponentially, making the data sparse.
  • Distance metrics lose meaning
    • Accidental similarity swamps out meaningful similarity
    • All points become almost equidistant.
  • Overfitting becomes likely: Harder to generalize with high-dimensional data.
  • How to deal with this?
    • Dimensionality reduction (PCA) (not covered in this course)
    • Feature selection techniques.

SVMs with RBF kernel

  • RBF Kernel: Radial Basis Function, a way to transform data into higher dimensions implicitly.
  • Strengths
    • Effective in high-dimensional and sparse data
    • Good performance on non-linear problems.
  • Hyperparameters:
    • \(C\): Regularization parameter (trade-off between correct classification of training examples and maximization of the decision margin).
    • gamma (\(\gamma\)): controls how fast the similarity decays with distance

Intuition of C and gamma in SVM RBF

  • C (Regularization): Controls the trade-off between perfect training accuracy and having a simpler decision boundary.
    • High C: Strict, complex boundary (overfitting risk).
    • Low C: More errors allowed, smoother boundary (generalizes better).
  • Gamma (Kernel Width): Controls the influence of individual data points.
    • High Gamma: Points have local impact, complex boundary.
    • Low Gamma: Points affect broader areas, smoother boundary.
  • Key trade-off: Proper balance between C and gamma is crucial for avoiding overfitting or underfitting.

Class demo

Models

Supervised models we have seen

  • Decision trees: Split data into subsets based on feature values to create decision rules
  • \(k\)-NNs: Classify based on the majority vote from \(k\) nearest neighbors
  • SVM RBFs: Create a boundary using an RBF kernel to separate classes

Comparison of models (activity)

Model Parameters and hyperparameters Strengths Weaknesses
Decision Trees
KNNs
SVM RBF