Large Language Models

Learning outcomes

By the end of this module, you will be able to:

Explain language models and large language models
Describe the concept of self-attention
Distinguish between decoder-only, encoder-only, and encoder-decoder models
Apply pre-trained large language models for zero-shot learning

Take a guess

What do you think is the vocabulary size of young adult speakers of American English?

Language models activity

Each of you will receive a sticky note with a word on it at some point. Here’s what you’ll do:

Carefully remove the sticky note to see the word. This word is for your eyes only —- don’t show it to your neighbours!
Think quickly: what word would logically follow the word on the sticky note? Write this next word on a new sticky note.
You have about 20 seconds for this step, so trust your instincts!
Pass your predicted word to the person next to you. Do not pass the word you received from your neighbour forward. Keep the chain going!
Stop after the last person in your row/table has finished.

Markov model of language

You’ve just created a simple Markov model of language!
In predicting the next word from a minimal context, you likely used your linguistic intuition and familiarity with common two-word phrases or collocations.
You could create more coherent sentences by taking into account more context e.g., previous two words or four words or 100 words.

Language model

A language model computes the probability distribution over sequences (of words or characters). Intuitively, this probability tells us how “good” or plausible a sequence of words is.

Check out this recent BMO ad.

Smart compose

A common application for predicting the next word is the ‘smart compose’ feature in your emails, text messages, and search engines.

Why should we care about predicting next word?

Many practical language-related tasks can be cast as word prediction.

Sentiment analysis as word prediction

We can cast sentiment analysis as language modeling by giving a language model a context like:

The sentiment of the sentence “I like machine learning” is:

And comparing the probability of word “postive” and the word negative. If the positive is more probable, we say the sentiment is positive, else negative.

Question answering as word prediction

We can cast question answering as language modeling by giving a language model a context like:

Q: Who won the Nobel Prize in 2024 for their work in deep learning? A:

We might expect to see that “Geoffrey” is very likely. If we continue and ask:

Q: Who won the Nobel Prize in 2024 for their work in deep learning? A: Geoffrey

We might expect to see that “Hinton” is very likely.

Text summarization casted as word prediction

Input: Long text such as a full length article
Output: Effective shorter summary of it
We can follow the text of the artile by a token like: tl;dr; (too long; didn’t read)
Since this token is sufficiently common in the recent years, a lanaguage model have seen many texts in which this token occurs before a summary. So it will interpret it as an instruction to generate a summary.

A simple model of language

Calculate the co-occurrence frequencies and probabilities based on these frequencies
Predict the next word based on these probabilities
This is a Markov model of language.

Long-distance dependencies

What are some reasonable predictions for the next word in the sequence?

I am studying law at the University of British Columbia Point Grey campus in Vancouver because I want to work as a ___

Markov model is unable to capture such long-distance dependencies in language.

Transformer models

Enter attention and transformer models! Transformer models are at the core of all state-of-the-art Generative AI models (e.g., BERT, GPT3, GPT4, Gemini, DALL-E, Llama, Github Copilot)?

Source

Transformer models

Source: GPT-4 Technical Report

Self-attention

An important innovation which makes these models work so well is self-attention.
Count how many times the players wearing the white pass the basketball?

Self-attention

When we process information, we often selectively focus on specific parts of the input, giving more attention to relevant information and less attention to irrelevant information. This is the core idea of attention.

Consider the examples below:

Example 1: She left a brief note on the kitchen table, reminding him to pick up groceries.
Example 2: The diplomat’s speech struck a positive note in the peace negotiations.
Example 3: She plucked the guitar strings, ending with a melancholic note.

The word note in these examples serves quite distinct meanings, each tied to different contexts. To capture varying word meanings across different contexts, we need a mechanism that considers the wider context to compute each word’s contextual representation.

Self-attention is just that mechanism!

What is an LLM?

A large language model learns knowledge about language and the world from vast amounts of text.
It learns complexities of language simply by emersing in it without any text book to learn a language.
At a high level, training LLMs works as follows:
- we feed the model batches of text
- it tries to predict what comes next
- we check the answers and based on how well it does, the model changes its internal settings (parameters)
- it’s learning and improving

Using LLMs in your applications

There are several Python libraries available which allow us to use pre-trained LLMs in our applications.
- 🤗 Transformers library
- OpenAI GPT
- Haystack
- LangChain
- spacy-transformers
- …

Types of LLMs

If you want to use pre-trained LLMs, it’s useful to know that there are three main types of LLMs.

Feature	Decoder-only (e.g., GPT-3)	Encoder-only (e.g., BERT, RoBERTa)	Encoder decoder (e.g., T5, BARD)
Output Computation is based on	Information earlier in the context	Entire context (bidirectional)	Encoded input context
Text Generation	Can naturally generate text completion	Cannot directly generate text	Can generate outputs naturally
Example	Our ML workshop audience is ___	Our ML workshop audience is the best! → positive	Input: Translate to Mandarin: Long but productive day! Output: 漫长而富有成效的一天！

Pipelines before LLMs

Text preprocessing: Tokenization, stopword removal, stemming/lemmatization.
Feature extraction: Bag of Words or word embeddings.
Training: Supervised learning on a labeled dataset (e.g., with positive, negative, and neutral sentiment categories for sentiment analysis).
Evaluation: Performance typically measured using accuracy, F1-score, etc.
Main challenges:
- Extensive feature engineering required for good performance.
- Difficulty in capturing the nuances and context of sentiment, especially in complex sentences.

Pipelines after LLMs

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
# Sentiment analysis pipeline
analyzer = pipeline("sentiment-analysis", model='distilbert-base-uncased-finetuned-sst-2-english')
analyzer(["I asked my model to predict my future, and it said '404: Life not found.'",
          '''Machine learning is just like cooking—sometimes you follow the recipe, 
            and other times you just hope for the best!.'''])

[{'label': 'NEGATIVE', 'score': 0.995707631111145},
 {'label': 'POSITIVE', 'score': 0.9994770884513855}]

Zero-shot learning

['i left with my bouquet of red and yellow tulips under my arm feeling slightly more optimistic than when i arrived',
 'i was feeling a little vain when i did this one',
 'i cant walk into a shop anywhere where i do not feel uncomfortable',
 'i felt anger when at the end of a telephone call',
 'i explain why i clung to a relationship with a boy who was in many ways immature and uncommitted despite the excitement i should have been feeling for getting accepted into the masters program at the university of virginia',
 'i like to have the same breathless feeling as a reader eager to see what will happen next',
 'i jest i feel grumpy tired and pre menstrual which i probably am but then again its only been a week and im about as fit as a walrus on vacation for the summer',
 'i don t feel particularly agitated',
 'i feel beautifully emotional knowing that these women of whom i knew just a handful were holding me and my baba on our journey',
 'i pay attention it deepens into a feeling of being invaded and helpless',
 'i just feel extremely comfortable with the group of people that i dont even need to hide myself',
 'i find myself in the odd position of feeling supportive of']

Zero-shot learning for emotion detection

from transformers import AutoTokenizer
from transformers import pipeline 
import torch

#Load the pretrained model
model_name = "facebook/bart-large-mnli"
classifier = pipeline('zero-shot-classification', model=model_name)
exs = dataset["test"]["text"][:10]
candidate_labels = ["sadness", "joy", "love","anger", "fear", "surprise"]
outputs = classifier(exs, candidate_labels)

Zero-shot learning for emotion detection

	sequence	labels	scores
0	im feeling rather rotten so im not very ambiti...	[sadness, anger, surprise, fear, joy, love]	[0.7367963194847107, 0.10041721910238266, 0.09...
1	im updating my blog because i feel shitty	[sadness, surprise, anger, fear, joy, love]	[0.7429746985435486, 0.13775986433029175, 0.05...
2	i never make her separate from me because i do...	[love, sadness, surprise, fear, anger, joy]	[0.3153638243675232, 0.22490324079990387, 0.19...
3	i left with my bouquet of red and yellow tulip...	[surprise, joy, love, sadness, fear, anger]	[0.42182087898254395, 0.3336702883243561, 0.21...
4	i was feeling a little vain when i did this one	[surprise, anger, fear, love, joy, sadness]	[0.5639430284500122, 0.17000176012516022, 0.08...
5	i cant walk into a shop anywhere where i do no...	[surprise, fear, sadness, anger, joy, love]	[0.37033382058143616, 0.36559492349624634, 0.1...
6	i felt anger when at the end of a telephone call	[anger, surprise, fear, sadness, joy, love]	[0.9760521054267883, 0.01253431849181652, 0.00...
7	i explain why i clung to a relationship with a...	[surprise, joy, love, sadness, fear, anger]	[0.4382022023200989, 0.232231006026268, 0.1298...
8	i like to have the same breathless feeling as ...	[surprise, joy, love, fear, anger, sadness]	[0.7675782442092896, 0.13846899569034576, 0.03...
9	i jest i feel grumpy tired and pre menstrual w...	[surprise, sadness, anger, fear, joy, love]	[0.7340186834335327, 0.11860235780477524, 0.07...

Fun tools

Harms of large language models

While these models are super powerful and useful, be mindful of the harms caused by these models. Some of the harms as summarized here are:

performance disparties
social biases and stereotypes
toxicity
misinformation
security and privacy risks
copyright and legal protections
environmental impact
centralization of power

Thank you!

That’s it for the modules! Now, let’s work on exercises.