Large Language Models

Learning outcomes


By the end of this module, you will be able to:

  • Explain language models and large language models
  • Describe the concept of self-attention
  • Distinguish between decoder-only, encoder-only, and encoder-decoder models
  • Apply pre-trained large language models for zero-shot learning

Take a guess

  • What do you think is the vocabulary size of young adult speakers of American English?

Language models activity


Each of you will receive a sticky note with a word on it at some point. Here’s what you’ll do:

  • Carefully remove the sticky note to see the word. This word is for your eyes only —- don’t show it to your neighbours!
  • Think quickly: what word would logically follow the word on the sticky note? Write this next word on a new sticky note.
  • You have about 20 seconds for this step, so trust your instincts!
  • Pass your predicted word to the person next to you. Do not pass the word you received from your neighbour forward. Keep the chain going!
  • Stop after the last person in your row/table has finished.





Markov model of language


  • You’ve just created a simple Markov model of language!
  • In predicting the next word from a minimal context, you likely used your linguistic intuition and familiarity with common two-word phrases or collocations.
  • You could create more coherent sentences by taking into account more context e.g., previous two words or four words or 100 words.

Language model

  • A language model computes the probability distribution over sequences (of words or characters). Intuitively, this probability tells us how “good” or plausible a sequence of words is.

Check out this recent BMO ad.

Smart compose

A common application for predicting the next word is the ‘smart compose’ feature in your emails, text messages, and search engines.

Why should we care about predicting next word?


  • Many practical language-related tasks can be cast as word prediction.

Sentiment analysis as word prediction

  • We can cast sentiment analysis as language modeling by giving a language model a context like:

The sentiment of the sentence “I like machine learning” is:

  • And comparing the probability of word “postive” and the word negative. If the positive is more probable, we say the sentiment is positive, else negative.

Question answering as word prediction

  • We can cast question answering as language modeling by giving a language model a context like:

Q: Who won the Nobel Prize in 2024 for their work in deep learning? A:

  • We might expect to see that “Geoffrey” is very likely. If we continue and ask:

Q: Who won the Nobel Prize in 2024 for their work in deep learning? A: Geoffrey

  • We might expect to see that “Hinton” is very likely.

Text summarization casted as word prediction

  • Input: Long text such as a full length article
  • Output: Effective shorter summary of it
  • We can follow the text of the artile by a token like: tl;dr; (too long; didn’t read)
  • Since this token is sufficiently common in the recent years, a lanaguage model have seen many texts in which this token occurs before a summary. So it will interpret it as an instruction to generate a summary.

A simple model of language


  • Calculate the co-occurrence frequencies and probabilities based on these frequencies
  • Predict the next word based on these probabilities
  • This is a Markov model of language.

Long-distance dependencies


What are some reasonable predictions for the next word in the sequence?

I am studying law at the University of British Columbia Point Grey campus in Vancouver because I want to work as a ___

Markov model is unable to capture such long-distance dependencies in language.

Transformer models


Enter attention and transformer models! Transformer models are at the core of all state-of-the-art Generative AI models (e.g., BERT, GPT3, GPT4, Gemini, DALL-E, Llama, Github Copilot)?

Source

Transformer models


Source: GPT-4 Technical Report

Self-attention


  • An important innovation which makes these models work so well is self-attention.
  • Count how many times the players wearing the white pass the basketball?

Self-attention


When we process information, we often selectively focus on specific parts of the input, giving more attention to relevant information and less attention to irrelevant information. This is the core idea of attention.

Consider the examples below:

  • Example 1: She left a brief note on the kitchen table, reminding him to pick up groceries.

  • Example 2: The diplomat’s speech struck a positive note in the peace negotiations.

  • Example 3: She plucked the guitar strings, ending with a melancholic note.

The word note in these examples serves quite distinct meanings, each tied to different contexts. To capture varying word meanings across different contexts, we need a mechanism that considers the wider context to compute each word’s contextual representation.

  • Self-attention is just that mechanism!

What is an LLM?

  • A large language model learns knowledge about language and the world from vast amounts of text.
  • It learns complexities of language simply by emersing in it without any text book to learn a language.
  • At a high level, training LLMs works as follows:
    • we feed the model batches of text
    • it tries to predict what comes next
    • we check the answers and based on how well it does, the model changes its internal settings (parameters)
    • it’s learning and improving

Using LLMs in your applications


Types of LLMs

If you want to use pre-trained LLMs, it’s useful to know that there are three main types of LLMs.

Feature Decoder-only (e.g., GPT-3) Encoder-only (e.g., BERT, RoBERTa) Encoder decoder (e.g., T5, BARD)
Output Computation is based on Information earlier in the context Entire context (bidirectional) Encoded input context
Text Generation Can naturally generate text completion Cannot directly generate text Can generate outputs naturally
Example Our ML workshop audience is ___ Our ML workshop audience is the best! → positive Input: Translate to Mandarin: Long but productive day! Output: 漫长而富有成效的一天!

Pipelines before LLMs


  • Text preprocessing: Tokenization, stopword removal, stemming/lemmatization.
  • Feature extraction: Bag of Words or word embeddings.
  • Training: Supervised learning on a labeled dataset (e.g., with positive, negative, and neutral sentiment categories for sentiment analysis).
  • Evaluation: Performance typically measured using accuracy, F1-score, etc.
  • Main challenges:
    • Extensive feature engineering required for good performance.
    • Difficulty in capturing the nuances and context of sentiment, especially in complex sentences.

Pipelines after LLMs

 

from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
# Sentiment analysis pipeline
analyzer = pipeline("sentiment-analysis", model='distilbert-base-uncased-finetuned-sst-2-english')
analyzer(["I asked my model to predict my future, and it said '404: Life not found.'",
          '''Machine learning is just like cooking—sometimes you follow the recipe, 
            and other times you just hope for the best!.'''])
[{'label': 'NEGATIVE', 'score': 0.995707631111145},
 {'label': 'POSITIVE', 'score': 0.9994770884513855}]

Zero-shot learning


['i left with my bouquet of red and yellow tulips under my arm feeling slightly more optimistic than when i arrived',
 'i was feeling a little vain when i did this one',
 'i cant walk into a shop anywhere where i do not feel uncomfortable',
 'i felt anger when at the end of a telephone call',
 'i explain why i clung to a relationship with a boy who was in many ways immature and uncommitted despite the excitement i should have been feeling for getting accepted into the masters program at the university of virginia',
 'i like to have the same breathless feeling as a reader eager to see what will happen next',
 'i jest i feel grumpy tired and pre menstrual which i probably am but then again its only been a week and im about as fit as a walrus on vacation for the summer',
 'i don t feel particularly agitated',
 'i feel beautifully emotional knowing that these women of whom i knew just a handful were holding me and my baba on our journey',
 'i pay attention it deepens into a feeling of being invaded and helpless',
 'i just feel extremely comfortable with the group of people that i dont even need to hide myself',
 'i find myself in the odd position of feeling supportive of']

Zero-shot learning for emotion detection


from transformers import AutoTokenizer
from transformers import pipeline 
import torch

#Load the pretrained model
model_name = "facebook/bart-large-mnli"
classifier = pipeline('zero-shot-classification', model=model_name)
exs = dataset["test"]["text"][:10]
candidate_labels = ["sadness", "joy", "love","anger", "fear", "surprise"]
outputs = classifier(exs, candidate_labels)

Zero-shot learning for emotion detection


sequence labels scores
0 im feeling rather rotten so im not very ambiti... [sadness, anger, surprise, fear, joy, love] [0.7367963194847107, 0.10041721910238266, 0.09...
1 im updating my blog because i feel shitty [sadness, surprise, anger, fear, joy, love] [0.7429746985435486, 0.13775986433029175, 0.05...
2 i never make her separate from me because i do... [love, sadness, surprise, fear, anger, joy] [0.3153638243675232, 0.22490324079990387, 0.19...
3 i left with my bouquet of red and yellow tulip... [surprise, joy, love, sadness, fear, anger] [0.42182087898254395, 0.3336702883243561, 0.21...
4 i was feeling a little vain when i did this one [surprise, anger, fear, love, joy, sadness] [0.5639430284500122, 0.17000176012516022, 0.08...
5 i cant walk into a shop anywhere where i do no... [surprise, fear, sadness, anger, joy, love] [0.37033382058143616, 0.36559492349624634, 0.1...
6 i felt anger when at the end of a telephone call [anger, surprise, fear, sadness, joy, love] [0.9760521054267883, 0.01253431849181652, 0.00...
7 i explain why i clung to a relationship with a... [surprise, joy, love, sadness, fear, anger] [0.4382022023200989, 0.232231006026268, 0.1298...
8 i like to have the same breathless feeling as ... [surprise, joy, love, fear, anger, sadness] [0.7675782442092896, 0.13846899569034576, 0.03...
9 i jest i feel grumpy tired and pre menstrual w... [surprise, sadness, anger, fear, joy, love] [0.7340186834335327, 0.11860235780477524, 0.07...

Fun tools

Harms of large language models

While these models are super powerful and useful, be mindful of the harms caused by these models. Some of the harms as summarized here are:

  • performance disparties
  • social biases and stereotypes
  • toxicity
  • misinformation
  • security and privacy risks
  • copyright and legal protections
  • environmental impact
  • centralization of power

Thank you!

  • That’s it for the modules! Now, let’s work on exercises.