What are LLMs Transcript
This is an autogenerated transcript of the video What are LLMs video.
Ivan: Hello, my name is Ivan. Welcome to “Getting Started with GenAI in Research.” In this video, we will continue talking about technologies behind GenAI. We have already discussed what artificial intelligence is, we have already discussed what machine learning is. Now, let‘s finally talk about large language models.
Essentially, language models are machine learning models that solve one very specific task: predicting the next word in given text. Because we can predict the next word, it also means that we can generate text word by word, creating content as long as we want.
Gabriella: Well, strictly speaking, that’s not quite what language models do...
Ivan: Oh! Hi Gabriella!
Gabriella: Hi Ivan!
Ivan: Well, actually, you‘re right. Maybe you want to talk about this topic?
Gabriella: I’d love to! Large language models are fascinating to talk about.
Ivan: Then I will invite you to talk about this topic.
Gabriella: Thank you, Ivan. To be more precise, the formal definition of a language model is that it‘s a probability distribution over all possible sentences, documents, spoken utterances, or any other linguistic units. In simpler terms, this means that for each possible sentence, language models assign a probability showing how likely it would appear in natural text.
Let’s look at some examples. On this slide, you can see five sentences:
- For the first one, a language model would assign a very small probability as it‘s just a random set of characters.
- For the second sentence, the probability should be higher but still very small because while these are all real English words, they’re just strung together randomly without any grammatical structure.
- “The statistical analysis at the conference proceedings” would get an even higher probability because it‘s grammatically correct, even though it’s semantically nonsensical.
- “The PhD student solved differential equations by juggling tennis balls” - the probability would be higher still. This sentence is both grammatically correct and could technically happen in the real world, though it‘s very unlikely.
- Finally, “The statistical analysis showed significant results” would receive the highest probability because not only is it a completely natural sentence, but it’s one you might commonly find in academic writing.
Why might it be useful to know all these probabilities? Well, because they can be applied to real-world tasks. Consider a fill-in-the-blank example: “The student ___ ___ the library.” Because the language model knows the probabilities of all sentences that start with “the student” and end with “the library”, it can suggest the most likely words to fill those spaces, such as “studied at” or “walked into.”
The most common application, however, is for the model to return the probabilities for the next possible word in a given text. For example, for the sentence “The students open their”, the model might suggest “books” or “laptops” as likely continuations, with all other words in the dictionary receiving specific but lower probabilities.
Ivan: At this point, it still might be a bit too abstract for you, so let me provide you with an example where we try to create a toy language model. We will use some training data - for this, we will use my favourite book, which is “To the Lighthouse” by Virginia Woolf.
The very naive approach would be to just pick words randomly from this book. If we do something like that, we get a sentence that would sound like “Mill and what through greatness insensibly crown would yourself listen” - which doesn‘t make any sense.
Now, let’s try a smarter approach. Let‘s try to not take words randomly but also take into account their context. In our case, the context would be just the previous word. How could this work? Let’s imagine I want to create a sentence that will start with the word “she”. I will go to this book and find all instances of “she” and see what words follow it in this book. I discovered that the word “had” appeared 21 times after “she”, “was” 190 times, etc.
It would be natural to select the most probable word, which would be “had”. So now I would have my longer sentence: “she had”. Then I go again to this book and look for all instances of “had” and find that after the word “had”, the most common word is “been”, which is probably not surprising.
Now I have “she had been”, and I can repeat this process over and over. I can create “she had been a”, then “she had been a little”, and finally “she had been a little boy”.
I think we achieved something quite remarkable here because we used a very simple model but were able to produce a sentence that sounds quite natural. Well, it‘s a bit weird, but it’s for sure much more natural than our first random attempt. We were able to do this because we used training data, but at the same time, we didn‘t copy anything directly from the training data - there is no sentence “she had been a little boy” in “To the Lighthouse”. It’s a completely new and unique sentence.
Gabriella: The intuition is that with much more training data and a larger context size, we could achieve much better results. Let‘s think about why more training data might be useful. Remember when we had “she had been a little” as the beginning of our sentence? At that point, we had very limited choices - the word “boy” occurred 11 times, “man” occurred 4 times, there was something like “book”, and the word “girl” wasn’t even in the list. It is reasonable to assume that with more training data, we‘d have a broader, more natural set of choices.
Another limitation was that when generating the word “boy”, we were only looking at the previous word “little”. If we could use the whole context, we’d know there was the word “she” earlier in the sentence, and this might suggest reducing the probability for “boy” and increasing it for “girl”.
This is essentially what modern large language models do. They also learn to predict the next word based on training data, but they use much larger context size and much larger training datasets. While our example had a context of just one word, the context size of modern LLMs can be more than 100,000 words. And instead of training on a single book with around 70,000 words, they‘re trained on roughly 10 trillion words - basically the whole internet and all the data available to the companies training these models.
Interestingly, there are estimates suggesting we might soon hit a ceiling because we’ll be using all publicly available human-generated text data. Looking at this plot, you can see how much training data different models use. If this trend continues at the same pace, we‘ll soon reach that ceiling - we’ll simply run out of human-generated text to train on.
At some point, researchers realised that a very good language model could serve as a sort of general AI tool. For example, if you have a sentence that starts with “The capital of France is”, a good language model would predict “Paris” as the next word. Suddenly, it can answer geography questions. Similarly, if you have “The sentiment of the sentence ‘I’m sad‘ is”, the model should predict “negative” with higher probability than “positive”. Now we can use it for sentiment analysis.
What’s particularly interesting is that not many people believed this would work. Richard Zellers, one of the leading researchers in this field, once shared on Twitter a review of his paper where he introduced this idea. The reviewer was quite skeptical, saying that this approach would never work. And the reviewer wasn‘t being unreasonable - they were actually being quite logical. Before 2018, before the first GPT model was introduced, language models struggled even with producing consistent text.
Ivan told me how he and his colleague were sitting in the same office when they were testing GPT-1. They typed something like “Two postdoctoral researchers, Ivan and Max, entered a bar” and let the model generate a story. While the story included some absurd details that made them laugh, the mere fact that the model could produce coherent stories was remarkable - far beyond what previous generations of models could achieve.
Once we have these powerful models, they can correctly continue phrases like “The capital of France is” with “Paris”, but that’s not exactly what we want when training them as assistants. What we actually want them to do is to be able to answer questions like “What is the capital of France?”
Here‘s where it gets interesting: if you ask the base model to continue the sentence “What is the capital of France”, it won’t say “Paris”. Think about it - in natural text on the internet, you rarely see “What is the capital of France? Paris.” Instead, you‘re more likely to find lists of questions about France or something similar. So the model would continue with something like “What is France’s largest city? What is France‘s population? What is the currency of France?” etc.
This is why the base model training is just the first step. Though it’s worth noting that this step consumes the vast majority of time and computational resources - about 99% of them go to training the base model. After this, we need additional steps to transform the model into a smart assistant.
The second step is called supervised fine-tuning. Here we continue training the model with much smaller amounts of data but of much higher quality. Unlike internet data, this data is already structured in the way we want the model to work, starting with a human question and following with a correct answer.
Here is an example from an open dataset that contains these question-answer pairs. There is a question: “Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?” There is also an answer, and please note that a human wrote it entirely. By learning from such datasets, models start to behave as assistants and actually answer questions posed to them.
The problem with this approach is that it isn‘t very scalable, as you can imagine generating a good answer is quite a time-consuming task for humans. That’s where the third step comes in. The idea behind it is that it‘s generally much easier to evaluate whether an answer is correct or wrong than to create an answer from scratch.
Think about it like finding rhymes - if you ask me to find a rhyme for a word, I might struggle, especially in a non-native language. But if you ask me whether two words rhyme, that’s much simpler.
So what companies do now is let the model generate answers, and then humans rate these answers as good or bad. They‘re also training other models to evaluate answer quality since, like humans, it’s easier for machines to distinguish between good and bad answers than to generate answers from scratch.
In this step, we continue training LLMs as we discussed in our machine learning video, providing data and updating the model‘s parameters to better match our desired output. There are also some innovative approaches to annotation. For example, OpenAI mentioned using what they call “Critique GPT”. Let’s say humans need to evaluate ChatGPT‘s answer that contains programming code. This can be challenging because it’s easy to miss errors in code. So they use another model to highlight likely sources of error. The human still makes the final decision, but they‘re less likely to miss important issues.
One final note: when you’re chatting with these AI assistants like ChatGPT, it‘s probably not just one model working alone. While companies don’t always disclose these details, it could be several models running multiple times with another model selecting the best answer. So there are all these additional layers to improve quality.
Now let me summarise what we‘ve discussed in this video. The basic idea behind language models is that they simply predict the next word. However, when you have enormous amounts of data and computational resources, these models become remarkably capable and useful across a wide variety of tasks.
We also discussed the crucial role of human annotators in ensuring output quality. A base LLM wouldn’t naturally behave as an assistant, while a model trained solely on internet data wouldn‘t be able to distinguish between reliable and unreliable sources - for instance, recognising that Wikipedia is generally more trustworthy than an obscure forum.
Human annotators, however, aren’t just useful - they can also introduce biases. We‘ll explore more in upcoming videos, but I want to mention one interesting example now. There was a study where researchers found something fascinating: if you tell the model you dislike a particular argument, it tends to respond by saying it also finds the argument unconvincing. However, when presented with exactly the same argument but told you like it, the model suddenly describes it as strong and compelling. This shouldn’t really surprise us - as we‘ve learned in this video, human annotators play an important role in training these models, and humans tend to rate agreeable responses more favourably. The models simply pick up on these patterns.
The final point I want to make is that we should expect continued advances in this field. Some people suggest we’ve hit a peak because we‘re already using massive amounts of data and computational resources, but here’s the thing: we don‘t necessarily need breakthroughs in language models themselves. Simply combining these models with different tools in complex workflows can lead to remarkable improvements.
Let me share an interesting example. There was a paper where researchers tested how well large language models could solve mathematical problems. The state-of-the-art model at the time performed poorly, solving just 8.8% of tasks. But when they modified their approach, allowing the model to generate code to solve the task and then use that code’s output, the performance jumped dramatically to 81.1%.
This reminds me of the early days of chatbots when people would laugh at their inability to solve even the simplest maths problems like 2 plus 3. But think about it - why does a large language model need to solve this directly? We already know how to compute 2 plus 3 very well. We don‘t need these models to do the calculation themselves; we just need to provide them with tools that can do it for them.
As we’ll discuss later in this course, there are many examples where simply providing models with the right tools leads to massive improvements in performance.