Lecture 1: Introduction to CPSC 330

Varada Kolhatkar

Learning outcomes

From this lecture, you will be able to

  • Explain the motivation behind study machine learning.
  • Briefly describe supervised learning.
  • Differentiate between traditional programming and machine learning.
  • Assess whether a given problem is suitable for a machine learning solution.
  • Navigate through the course material.
  • Be familiar with the policies and how the class is going to run.

QR code of CPSC 330 website


  • Course Jupyter book: https://ubc-cs.github.io/cpsc330-2024W1
  • Course GitHub repository: https://github.com/UBC-CS/cpsc330-2024W1

🤝 Introductions 🤝

Meet your instructor

  • Varada Kolhatkar [ʋəɾəda kɔːlɦəʈkər]
  • You can call me Varada, V, or Ada.
  • I am an Assistant Professor of Teaching in the Department of Computer Science.
  • I did my Ph.D. in Computational Linguistics at the University of Toronto.
  • I primarily teach machine learning courses in the Master of Data Science (MDS) program.
  • Contact information
    • Email: kvarada@cs.ubc.ca
    • Office: ICCS 237

Meet Eva (a fictitious persona)!

Eva is among one of you. She has some experience in Python programming. She knows machine learning as a buzz word. During her recent internship, she has developed some interest and curiosity in the field. She wants to learn what is it and how to use it. She is a curious person and usually has a lot of questions!

You all

  • Introduce yourself to your neighbour.
  • Since we’re going to spend the semester with each other, I would like to know you a bit better.
  • Please fill out Getting to know you survey when you get a chance.

Asking questions during class

You are welcome to ask questions by raising your hand. There is also a reflection Google Document for this course for your questions/comments/reflections. It will be great if you can write about your takeaways, struggle points, and general comments in this document so that I’ll try to address those points in the next lecture.

Activity 1: https://shorturl.at/CteOU


  • Write your answers to the questions below in this Google doc: https://shorturl.at/CteOU

  • What do you know about machine learning?

  • What would you like to get out this course?

  • Are there any particular topics or aspects of this course that you are especially excited or anxious about? Why?

What is Machine Learning (ML)?

Spam prediction

  • Suppose you are given some data with labeled spam and non-spam messages
sms_df = pd.read_csv(DATA_DIR + "spam.csv", encoding="latin-1")
sms_df = sms_df.drop(columns = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"])
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})
train_df, test_df = train_test_split(sms_df, test_size=0.10, random_state=42)
target sms
spam LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.
ham Aight, I'll hit you up when I get some cash
ham Don no da:)whats you plan?
ham Going to take your babe out ?
ham No need lar. Jus testing e phone card. Dunno network not gd i thk. Me waiting 4 my sis 2 finish bathing so i can bathe. Dun disturb u liao u cleaning ur room.

Traditional programming vs. ML

  • Imagine writing a Python program for spam identification, i.e., whether a text message or an email is spam or non-spam.
  • Traditional programming
    • Come up with rules using human understanding of spam messages.
    • Time consuming and hard to come up with robust set of rules.
  • Machine learning
    • Collect large amount of data of spam and non-spam emails and let the machine learning algorithm figure out rules.

Let’s train a model

  • There are several packages that help us perform machine learning.
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]
clf = make_pipeline(CountVectorizer(max_features=5000), LogisticRegression(max_iter=5000))
clf.fit(X_train, y_train); # Training the model

Unseen messages

  • Now use the trained model to predict targets of unseen messages:
sms
3245 Funny fact Nobody teaches volcanoes 2 erupt, tsunamis 2 arise, hurricanes 2 sway aroundn no 1 teaches hw 2 choose a wife Natural disasters just happens
944 I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one m...
1044 We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p
2484 Only if you promise your getting out as SOON as you can. And you'll text me in the morning to let me know you made it in ok.

Predicting on unseen data

The model is accurately predicting labels for the unseen text messages above!

  sms spam_predictions
3245 Funny fact Nobody teaches volcanoes 2 erupt, tsunamis 2 arise, hurricanes 2 sway aroundn no 1 teaches hw 2 choose a wife Natural disasters just happens ham
944 I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones ham
1044 We know someone who you know that fancies you. Call 09058097218 to find out who. POBox 6, LS15HB 150p spam
2484 Only if you promise your getting out as SOON as you can. And you'll text me in the morning to let me know you made it in ok. ham

A different way to solve problems

Machine learning uses computer programs to model data. It can be used to extract hidden patterns, make predictions in new situation, or generate novel content.

A field of study that gives computers the ability to learn without being explicitly programmed.
– Arthur Samuel (1959)

ML vs. traditional programming

  • With machine learning, you’re likely to
    • Save time
    • Customize and scale products

Prevalence of ML

Let’s look at some examples.

Activity: For what type of problems ML is appropriate? (~5 mins)

Discuss with your neighbour for which of the following problems you would use machine learning

  • Finding a list of prime numbers up to a limit
  • Given an image, automatically identifying and labeling objects in the image
  • Finding the distance between two nodes in a graph

Types of machine learning

Here are some typical learning problems.

  • Supervised learning (Gmail spam filtering)
    • Training a model from input data and its corresponding targets to predict targets for new examples.
  • Unsupervised learning (Google News)
    • Training a model to find patterns in a dataset, typically an unlabeled dataset.
  • Reinforcement learning (AlphaGo)
    • A family of algorithms for finding suitable actions to take in a given situation in order to maximize a reward.
  • Recommendation systems (Amazon item recommendation system)
    • Predict the “rating” or “preference” a user would give to an item.

What is supervised learning?

  • Training data comprises a set of observations (X) and their corresponding targets (y).
  • We wish to find a model function f that relates X to y.
  • We use the model function to predict targets of new examples.

🤔 Eva’s questions


At this point, Eva is wondering about many questions.

  • How are we exactly “learning” whether a message is spam and ham?
  • Are we expected to get correct predictions for all possible messages? How does it predict the label for a message it has not seen before?
  • What if the model mis-labels an unseen example? For instance, what if the model incorrectly predicts a non-spam as a spam? What would be the consequences?
  • How do we measure the success or failure of spam identification?
  • If you want to use this model in the wild, how do you know how reliable it is?
  • Would it be useful to know how confident the model is about the predictions rather than just a yes or a no?

It’s great to think about these questions right now. But Eva has to be patient. By the end of this course you’ll know answers to many of these questions!

Predicting labels of a given image

  • We can also use machine learning to predict labels of given images using a technique called transfer learning.

                         Class  Probability score
                     tiger cat              0.636
              tabby, tabby cat              0.174
Pembroke, Pembroke Welsh corgi              0.081
               lynx, catamount              0.011
--------------------------------------------------------------

                                     Class  Probability score
         cheetah, chetah, Acinonyx jubatus              0.994
                  leopard, Panthera pardus              0.005
jaguar, panther, Panthera onca, Felis onca              0.001
       snow leopard, ounce, Panthera uncia              0.000
--------------------------------------------------------------

                                   Class  Probability score
                                 macaque              0.885
patas, hussar monkey, Erythrocebus patas              0.062
      proboscis monkey, Nasalis larvatus              0.015
                       titi, titi monkey              0.010
--------------------------------------------------------------

                        Class  Probability score
Walker hound, Walker foxhound              0.582
             English foxhound              0.144
                       beagle              0.068
                  EntleBucher              0.059
--------------------------------------------------------------

Predicting housing prices

Suppose we want to predict housing prices given a number of attributes associated with houses. The target here is continuous and not discrete.

target bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
509000.0 2 1.50 1930 3521 2.0 0 0 3 8 1930 0 1989 0 98007 47.6092 -122.146 1840 3576
675000.0 5 2.75 2570 12906 2.0 0 0 3 8 2570 0 1987 0 98075 47.5814 -122.050 2580 12927
420000.0 3 1.00 1150 5120 1.0 0 0 4 6 800 350 1946 0 98116 47.5588 -122.392 1220 5120
680000.0 8 2.75 2530 4800 2.0 0 0 4 7 1390 1140 1901 0 98112 47.6241 -122.305 1540 4800
357823.0 3 1.50 1240 9196 1.0 0 0 3 8 1240 0 1968 0 98072 47.7562 -122.094 1690 10800

Building a regression model

from lightgbm.sklearn import LGBMRegressor

X_train, y_train = train_df.drop(columns= ["target"]), train_df["target"]
X_test, y_test = test_df.drop(columns= ["target"]), train_df["target"]

model = LGBMRegressor()
model.fit(X_train, y_train);
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000759 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2333
[LightGBM] [Info] Number of data points in the train set: 17290, number of used features: 18
[LightGBM] [Info] Start training from score 539762.702545

Predicting prices of unseen houses

Predicted_target bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
345831.740542 4 2.25 2130 8078 1.0 0 0 4 7 1380 750 1977 0 98055 47.4482 -122.209 2300 8112
601042.018745 3 2.50 2210 7620 2.0 0 0 3 8 2210 0 1994 0 98052 47.6938 -122.130 1920 7440
311310.186024 4 1.50 1800 9576 1.0 0 0 4 7 1800 0 1977 0 98045 47.4664 -121.747 1370 9576
597555.592401 3 2.50 1580 1321 2.0 0 2 3 8 1080 500 2014 0 98107 47.6688 -122.402 1530 1357

We are predicting continuous values here as apposed to discrete values in spam vs. ham example.

Machine learning workflow

Supervised machine learning is quite flexible; it can be used on a variety of problems and different kinds of data. Here is a typical workflow of a supervised machine learning systems.

We will build machine learning pipelines in this course, focusing on some of the steps above.



❓❓ Questions for you

iClicker cloud join link: https://join.iclicker.com/VYFJ

Select all of the following statements which are True (iClicker)

    1. Predicting spam is an example of machine learning.
    1. Predicting housing prices is not an example of machine learning.
    1. For problems such as spelling correction, translation, face recognition, spam identification, if you are a domain expert, it’s usually faster and scalable to come up with a robust set of rules manually rather than building a machine learning model.
    1. If you are asked to write a program to find all prime numbers up to a limit, it is better to implement one of the algorithms for doing so rather than using machine learning.
    1. Google News is likely be using machine learning to organize news.



Surveys

  • Please complete the “Getting to know you” survey on Canvas.
  • Also, please complete the anonymous restaurant survey on Qualtrics here.
    • We will try to analyze this data set in the coming weeks.

About this course

Important

Course website: https://github.com/UBC-CS/cpsc330-2024W1 is the most important link. Please read everything on this GitHub page!

Important

Make sure you go through the syllabus thoroughly and complete the syllabus quiz before Monday, Sept 19th at 11:59pm.

CPSC 330 vs. 340

Read https://github.com/UBC-CS/cpsc330-2024W1/blob/main/docs/330_vs_340.md which explains the difference between two courses.

TLDR:

  • 340: how do ML models work?
  • 330: how do I use ML models?
  • CPSC 340 has many prerequisites.
  • CPSC 340 goes deeper but has a more narrow scope.
  • I think CPSC 330 will be more useful if you just plan to apply basic ML.

Registration, waitlist and prerequisites

Important

Please go through this document carefully before contacting your instructors about these issues. Even then, we are very unlikely to be able to help with registration, waitlist or prerequisite issues.

  • If you are on waitlist and if you’d like to try your chances, you should be able to access Canvas and Piazza.
  • If you’re unable to make it this time, there will be two sections of this course offered next semester and then again in the summer.

Lecture format

  • In person lectures T/Th.
  • Sometimes there will be videos to watch before lecture. You will find the list of pre-watch videos in the schedule on the course webpage.
  • We will also try to work on some questions and exercises together during the class.
  • All materials will be posted in this GitHub repository.
  • Weekly tutorials will be office hour format run by the TAs and are completely optional.
    • You do not need to be registered in a tutorial.
    • You can attend whatever tutorials or office hours your want, regardless of in which/whether you’re registered.

Home work assignments

  • First homework assignment is due this coming Tuesday, September 10, midnight. This is a relatively straightforward assignment on Python. If you struggle with this assignment then that could be a sign that you will struggle later on in the course.
  • You must do the first two homework assignments on your own.

Exams

  • We’ll have two self-scheduled midterms and one final in Computer-based Testing Facility (CBTF).

Course calendar

Here is our course Calendar. Make sure you check it on a regular basis:

https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330-2024W1/blob/main/docs/calendar.html

Course structure

  • Introduction
    • Week 1
  • Part I: ML fundamentals, preprocessing, midterm 1
    • Weeks 2, 3, 4, 5, 6, 7, 8
  • Part II: Unsupervised learning, transfer learning, common special cases, midterm 1
    • Weeks 8, 9, 10, 11, 12
  • Part III: Communication and ethics
    • ML skills are not beneficial if you can’t use them responsibly and communicate your results. In this module we’ll talk about these aspects.
    • Weeks 13, 14

Code of conduct

  • Our main forum for getting help will be Piazza.

Important

Please read this entire document about asking for help. TLDR: Be nice.

Homework format: Jupyter notebooks

  • Our notes are created in a Jupyter notebook, with file extension .ipynb.
  • Also, you will complete your homework assignments using Jupyter notebooks.
  • Confusingly, “Jupyter notebook” is also the original application that opens .ipynb files - but has since been replaced by Jupyter lab.
    • I am using Jupyter lab, some things might not work with the Jupyter notebook application.
    • You can also open these files in Visual Studio Code.

Jupyter notebooks

  • Notebooks contain a mix of code, code output, markdown-formatted text (including LaTeX equations), and more.
  • When you open a Jupyter notebook in one of these apps, the document is “live”, meaning you can run the code.

For example:

1 + 1
2
x = [1, 2, 3]
x[0] = 9999
x
[9999, 2, 3]

Jupyter

  • By default, Jupyter prints out the result of the last line of code, so you don’t need as many print statements.
  • In addition to the “live” notebooks, Jupyter notebooks can be statically rendered in the web browser, e.g. this.
    • This can be convenient for quick read-only access, without needing to launch the Jupyter notebook/lab application.
    • But you need to launch the app properly to interact with the notebooks.

Lecture notes

  • All the lectures from last year are available here.
  • We cannot promise anything will stay the same from last year to this year, so read them in advance at your own risk.
  • A “finalized” version will be pushed to GitHub and the Jupyter book right before each class.
  • Each instructor will have slightly adapted versions of notes to present slides during lectures.
  • You will find the link to these slides in our repository: https://github.com/UBC-CS/cpsc330-2024W1/tree/main/lectures/102-Varada-lectures

Grades

  • The grading breakdown is here.
  • The policy on challenging grades is here.

Setting up your computer for the course

Course conda environment

  • Follow the setup instructions here to create a course conda environment on your computer.
  • If you do not have your computer with you, you can partner up with someone and set up your own computer later.

Python requirements/resources

We will primarily use Python in this course.

Here is the basic Python knowledge you’ll need for the course:

  • Basic Python programming
  • Numpy
  • Pandas
  • Basic matplotlib
  • Sparse matrices

Homework 1 is all about Python.

Note

We do not have time to teach all the Python we need but you can find some useful Python resources here.



Checklist for you before the next class