Lecture 2: Terminology, Baselines, Decision Trees

Varada Kolhatkar

🎯 Learning Outcomes

By the end of this lesson, you will be able to:

Announcements

  • Things due this week
    • Homework 1 (hw1): Due Sept 09 11:59pm
  • Homework 2 (hw2) has been released (Due: Sept 15, 11:59pm)
    • There is some autograding in this homework.
  • You can find the tentative due dates for all deliverables here.
  • Please monitor Piazza (especially pinned posts and instructor posts) for announcements.
  • I’ll assume that you’ve watched the pre-lecture videos.

Recap: What is ML?

  • ML uses data to build models that find patterns, make predictions, or generate content.
  • It helps computers learn from data to make decisions.
  • No one model works for every situation.

iClicker 2.1: ML or not

Select all of the following statements which are suitable problems for machine learning.

    1. Identifying objects within digital images, such as facial recognition in security systems or categorizing images based on content.
    1. Determining if individuals meet the necessary criteria for government or financial services based on strict guidelines.
    1. Identifying unusual patterns that may indicate fraudulent transactions in banking and finance.
    1. Automatically analyzing images from MRIs, CT scans, or X-rays to detect abnormalities like tumors or fractures.
    1. Addressing mental health issues where human empathy, understanding, and adaptability are key.

Therapists using ChatGPT secretly 😔

Recap: When is ML suitable?

  • ML excels when the problem involve identifying complex patterns or relationships in large datasets that are difficult for humans to discern manually.
  • Rule-based systems are suitable where clear and deterministic rules can be defined. Good for structured decision making.
  • Human experts are good with problems which require deep contextual understanding, ethical judgment, creative input, or emotional intelligence.

Recap: Supervised learning

  • We wish to find a model function f that relates X to y.
  • We use the model function to predict targets of new examples.

In the first part of this course, we’ll focus on supervised machine learning.

iClicker 2.2: Supervised vs unsupervised

Clicker cloud join link:

Select all of the following statements which are examples of supervised machine learning

    1. Finding groups of similar properties in a real estate data set.
    1. Predicting whether someone will have a heart attack or not on the basis of demographic, diet, and clinical measurement.
    1. Grouping articles on different topics from different news sources (something like the Google News app).
    1. Detecting credit card fraud based on examples of fraudulent and non-fraudulent transactions.
    1. Given some measure of employee performance, identify the key factors which are likely to influence their performance.

iClicker 2.3: Classification vs. Regression

Clicker cloud join link:

Select all of the following statements which are examples of regression problems

    1. Predicting the price of a house based on features such as number of bedrooms and the year built.
    1. Predicting if a house will sell or not based on features like the price of the house, number of rooms, etc.
    1. Predicting percentage grade in CPSC 330 based on past grades.
    1. Predicting whether you should bicycle tomorrow or not based on the weather forecast.
    1. Predicting appropriate thermostat temperature based on the wind speed and the number of people in a room.

Today’s focus

  • ML Terminology
  • Using sklearn to build a simple supervised ML model
  • Intuition of Decision Trees

Framework

Running example

Imagine you’re in the fortunate situation where, after graduating, you have a few job offers and need to decide which one to choose. You want to pick the job that will likely make you the happiest. To help with your decision, you collect data from like-minded people.

  • Can you think of relevant features for this problem?

Toy job happinees dataset

Here are the first few rows of a toy dataset.

toy_happiness_df = pd.read_csv(DATA_DIR + 'toy_job_happiness.csv')
toy_happiness_df
supportive_colleagues salary free_coffee boss_vegan happy?
0 0 70000 0 1 Unhappy
1 1 60000 0 0 Unhappy
2 1 80000 1 0 Happy
3 1 110000 0 1 Happy
4 1 120000 1 0 Happy
5 1 150000 1 1 Happy
6 0 150000 1 0 Unhappy

Terminology

Features, target, example

  • What are the features X?
    • features = inputs = predictors = explanatory variables = regressors = independent variables = covariates
  • What’s the target y?
    • target = output = outcome = response variable = dependent variable = labels
  • What is an example?

Classification vs. Regression

  • Is this a classification problem or a regression problem?
supportive_colleagues salary free_coffee boss_vegan happy?
0 0 70000 0 1 Unhappy
1 1 60000 0 0 Unhappy
2 1 80000 1 0 Happy
3 1 110000 0 1 Happy
4 1 120000 1 0 Happy
5 1 150000 1 1 Happy
6 0 150000 1 0 Unhappy

Inference vs. Prediction

  • Inference is about understanding why something happens.
  • Goal: Understanding and quantifying the relationship between variables
  • Involves estimating the parameters of the underlying distribution and testing hypotheses about these parameters
  • Example: Why certain factors influence happiness?
  • Prediction is about determining what will happen.
  • Goal: Accurately predicting the target without necessarily understanding the relationship between variables.
  • Example: Whether you will be happy in a particular job or not

Of course these goals are related, and in many situations we need both.

Training

  • In supervised ML, the goal is to learn a function that maps input features (X) to a target (y).
  • The relationship between X and y is often complex, making it difficult to define mathematically.
  • We use algorithms to approximate this complex relationship between X and y.
  • Training is the process of applying an algorithm to learn the best function (or model) that maps X to y.
  • In this course, I’ll help you develop an intuition for how these models work and demonstrate how to use them in a machine learning pipeline.

Separating X and y

  • In order to train a model we need to separate X and y from the dataframe.
X = toy_happiness_df.drop(columns=["happy?"]) # Extract the feature set by removing the target column "happy?"
y = toy_happiness_df["happy?"] # Extract the target variable "happy?"

Baseline

  • Let’s try a simplest algorithm of predicting the most popular target!
from sklearn.dummy import DummyClassifier
model = DummyClassifier(strategy="most_frequent") # Initialize the DummyClassifier to always predict the most frequent class
model.fit(X, y) # Train the model on the feature set X and target variable y
toy_happiness_df['dummy_predictions'] = model.predict(X) # Add the predicted values as a new column in the dataframe
toy_happiness_df
supportive_colleagues salary free_coffee boss_vegan happy? dummy_predictions
0 0 70000 0 1 Unhappy Happy
1 1 60000 0 0 Unhappy Happy
2 1 80000 1 0 Happy Happy
3 1 110000 0 1 Happy Happy
4 1 120000 1 0 Happy Happy
5 1 150000 1 1 Happy Happy
6 0 150000 1 0 Unhappy Happy

Decision trees

Intuition

  • Decision trees find the “best” way to split data to make predictions.
  • Each split is based on a question, like ‘Are the colleagues supportive?’
  • The goal is to group data by similar outcomes at each step.

Decision trees intuition

  • What would be the most effective question to ask in order to split the data in our toy example?
  • How many possible questions could we ask in this context?
supportive_colleagues salary free_coffee boss_vegan happy?
0 0 70000 0 1 Unhappy
1 1 60000 0 0 Unhappy
2 1 80000 1 0 Happy
3 1 110000 0 1 Happy
4 1 120000 1 0 Happy
5 1 150000 1 1 Happy
6 0 150000 1 0 Unhappy

Training (high level)

  • Decision tree learning is a search process to find the “best” tree among many possible ones.
  • We evaluate questions using measures like information gain or the Gini index to find the most effective split.
  • At each step, we aim to split the data into groups with more certainty in their outcomes.

Decision tree with sklearn

Let’s train a simple decision tree on our toy dataset using sklearn

from sklearn.tree import DecisionTreeClassifier # import the classifier
from sklearn.tree import plot_tree

model = DecisionTreeClassifier(max_depth=2, random_state=1) # Create a class object
model.fit(X, y)
plot_tree(model, filled=True, feature_names = X.columns, class_names=["Happy", "Unhappy"], impurity = False, fontsize=12);

Prediction

  • Given a new example, how does a decision tree predict the class of this example?
  • What would be the prediction for the example below using the tree above?
    • supportive_colleagues = 1, salary = 60000, coffee_machine = 0, vegan_boss = 1,

Prediction with sklearn

  • What would be the prediction for the example below using the tree above?
    • supportive_colleagues = 1, salary = 60000, free_coffee = 0, vegan_boss = 1,
test_example = [[1, 60000, 0, 1]]
print("Model prediction: ", model.predict(test_example))
plot_tree(model, filled=True, feature_names = X.columns, class_names = ["Happy", "Unhappy"], impurity = False, fontsize=9);
Model prediction:  ['Unhappy']

Parameters vs. Hyperparameters

  • Parameters
    • The questions (features and thresholds) used to split the data at each node.
    • Example: salary <= 75000 at the root node
  • Hyperparameters
    • Settings that control tree growth, like max_depth, which limits how deep the tree can go.

Decision boundary with max_depth=1

Decision boundary with max_depth=2

iClicker 2.4: Baselines and Decision trees

iClicker cloud join link:

Select all of the following statements which are TRUE.

    1. Change in features (i.e., binarizing features above) would change DummyClassifier predictions.
    1. predict takes only X as argument whereas fit and score take both X and y as arguments.
    1. For the decision tree algorithm to work, the feature values must be binary.
    1. The prediction in a decision tree works by routing the example from the root to the leaf.

Summary

  • Terminology
  • sklearn basic steps
  • Decision tree intuition