Placeholder Image

Subtitles section Play video

  • cSo I wanted to make a video about

  • GPT - 2

  • Because it's been in the news recently

  • this very powerful language model from open AI and I thought it would make sense to start by just doing a video about

  • transformers and language models in general because

  • GPT 2 is a very large

  • Language model implemented as a transformer, but you have a previous video about generating YouTube comments, which is the same kind of task, right?

  • That's a language modeling task from language processing to generate new samples for cooling of the most complex or magnetic

  • Consistent brackets like a computer to expect found in creating organizations

  • I believe that video was made October 2017 and this paper came out December 2017, which has kind of

  • Revolutionized the way that people carry out that kind of task. That's not the GPT - 2 that's something before that, right?

  • That's the transformer, which is a new realm. Yeah relatively new

  • architecture

  • for neural networks, that can do actually all kinds of tasks, but they're especially good at this kind of

  • language modeling task

  • a language model is a probability distribution over like sequences of

  • Tokens or symbols or words or whatever in a language?

  • So for any given like sequence of tokens, it can tell you how likely that is

  • So if you have a good language model of English

  • It can look at a sequence of you know words or characters or whatever and say how likely that is to occur in English

  • How likely that is to be an English phrase or sentence or whatever

  • And when you have that you can use that for a lot of different tasks. So

  • If you want to generate text, then you can you can just sort of sample from that distribution and keep giving it

  • its own output

  • so you you sample a word and then you say

  • And to be clear sampling from a distribution means you're just taking

  • Your you're sort of rolling the dice on that probability distribution and taking whichever one comes out. So

  • so you can like sample a word and then

  • And then say okay conditioning on that given that the first word of this sentence is V

  • What does the probability distribution look like for the second word?

  • And then you sample from that distribution and then it's you know

  • with a cat and you say given that it's the cat what's likely to come next and so on so you can you can build

  • up a

  • string of text by sampling from

  • Your distribution that's one of the things you could use it for

  • most of us kind of have an example of this sort of in our pockets of

  • Its actual absolutely right and that's like that's the that's the way that most people interact with a language model

  • I guess this is how I often start a sentence

  • apparently with I I am not sure if you have any questions or concerns, please visit the

  • Plugin settings so I can do it for the first time in the future of that's no good

  • Here's a different option. Let's just see what this way. Maybe the same

  • I am in the morning

  • But I can't find it on the phone screen from the phone screen on the phone screen on the phone screen on the phone screen

  • On the phone screen. I don't actually know how this is implemented

  • it might be a neural network, but my guess is that it's some kind of

  • like Markov model Markov chain type setup where you just

  • for each word in your language you look at your data set and you see how often a particular

  • how often each other word is

  • Following that word and then that's how you build your distribution

  • So like for the word "I" the most common word to follow that is "am" and there are a few others, you know

  • so this is like a very simple model and

  • This sentence on the phone screen on the phone screen on the phone screen on the phone screen on the phone screen

  • He's actually very unlikely, right?

  • This is the super low probability sentence where I would somebody type this and the thing is it's like myopic

  • It's only I'm not sure I even it's probably only looking at the previous word

  • It might be looking at like the previous two words, but the problem is to look back. It becomes extremely expensive

  • Computationally expensive right?

  • Like you've got I don't know 50,000 words that you might be looking at and so then it so you're you're you're remembering

  • 50,000 probability distributions or

  • 50,000 top three words

  • but you know then if you want to do

  • 2, that's 50,000 squared right and if you want to go back three words

  • You have to cube it. So you like raising it to the power of the number of words back you want to go which is

  • Which means that this type of model?

  • Basically doesn't look back by the time we're saying on the it's already forgotten the previous time

  • It said on the it doesn't realize that it's repeating itself and there are slightly better things you can do in this general area

  • But like fundamentally if you don't remember you're not going to be able to make good sentences

  • If you can't remember the beginning of the sentence by the time you're at the end of it, right?

  • and

  • so

  • One of the big areas of progress in language models is handling long term dependencies

  • I mean handling dependencies of any kind but especially long term dependencies

  • You've got a sentence that's like Shawn came to the hack space to record a video and I talked to

  • Blank right in that situation if your model is good

  • you're expecting like a pronoun probably so it's it's she they

  • You know them whatever and but the relevant piece of information is the words short

  • Which is like all the way at the beginning of the sentence

  • so your model needs to be able to say oh, okay, you know Shawn that's

  • Usually associated with male pronouns, so we'll put the male pronoun in there. And if your model doesn't have that ability to look back

  • Or to just remember what it's just said then

  • You end up with these sentences that?

  • Like go nowhere

  • It's just a slight like it might make a guess

  • just a random guess at a pronoun and might get it wrong or it might just

  • and I talked to and then just be like

  • Frank, you know just like introduced a new name because it's guessing at what's likely to come there and it's completely forgotten that sure was

  • Ever like a thing. So yeah, these kind of dependencies are a big issue with things that you would want to language model to do

  • But we've only so far talked about

  • Language models for generating text in this way, but you can also use them for all kinds of different things. So like

  • people use language models for translation

  • Obviously you have some input sequence that's like in English and you want to output a sequence in French or something like that

  • Having a good language model is really important so that you end up with something. That makes sense

  • Summarization is a task that people often want

  • Where you read in a long piece of text and then you generate a short piece of text. That's like a summary of that

  • that's the kind of thing that you would use a language model for or

  • reading a piece of text and then answering questions about that text or

  • If you want to write like a chatbot that's going to converse with people having a language model as good like basically almost all

  • like natural language processing

  • right is it's useful to have this the other thing is

  • You can use it to enhance

  • Enhance a lot of other language related tasks

  • So if you're doing like speech recognition then having a good language model

  • Like there's a lot of things people can say that sound very similar and to get the right one

  • You need to be like, oh, well, this actually makes sense, you know

  • This word. That sounds very similar

  • Would be incoherent in this sentence. It's a very low probability

  • It's much more likely that they this thing which is like would flow in the language

  • And human beings do this all the time same thing

  • With recognizing text from images, you know

  • You've got two words that look similar or there's some ambiguity or whatever and to resolve that you need

  • an

  • understanding of what word would make sense there what word would fit if you're trying to use a neural network to do the kind of

  • thing we were talking about before, of having a phone, you know autocorrect based on the previous word or two

  • Suppose you've got a sequence of two words going in you've got "so" and then "I" and you put

  • both of these into your network and it will then output, you know

  • like "said" for example as like a sensible next word and then what you do is you throw away or so and you then

  • Bring your set around and you make a new

  • Sequence which is I said and then put that into your network and it will put out

  • like I said - for example would make sense and so on and you keep going around, but the problem is

  • This length is really short you try and make this long enough to contain an entire

  • Sentence just an ordinary length sentence and this problem starts to become really really hard

  • And networks have a hard time learning it and you don't get very good performance

  • and even then

  • You're still like have this absolute hard limit on how long a thing you you have to just pick a number

  • That's like how far back am I looking a better thing to do you say recurring neural network? Where you

  • You give the thing. Let's like divide that up

  • So in this case, then you have a network you give it this vector?

  • You just like have a bunch of numbers which is gonna be like the memory

  • for that network is the idea like the problem is it's forgotten in the beginning of the sentence by the time it gets to the

  • end so we've got to give it some way of remembering and

  • rather than feeding it the entire sentence every time you give it this vector and

  • you give it to just one word at a time of your inputs and

  • This vector, which you initialize I guess with zeros. I want to be clear

  • This is not something that I've studied in a huge amount of detail

  • I'm just like giving the overall like structure of the thing. But the point is you give it this vector and the word and

  • it outputs its guess for the next word and also a

  • Modified version of that vector that you then for the next thing you give it

  • where did it spit out or the sequence that it spit out and

  • Its own modified version of the vector every cycle that goes around. It's modifying this memory

  • Once this system is like trained very well

  • If you give it if you give it the first word Shawn then part of this vector is going to contain some

  • information that's like this subject of this sentence is the word short and

  • some other part will probably keep track of like

  • We expect to use a male pronoun for this sentence and that kind of thing

  • So you take this and give it to that and these are just two instances of the same network, and then it keeps going

  • every time

  • So it spits out like this is I so then the AI also comes around to here you might then put outside and so on

  • But it's got this continuous thread of

  • of memory effectively going through because it keeps passing the thing through in principle if it figures out something important at the beginning of

  • You know

  • The complete works of Shakespeare that it's generating. There's nothing

  • Strictly speaking stopping that from persisting from being passed through

  • From from iteration to iteration to iteration every time

  • In practice, it doesn't work that way because in practice

  • The whole thing is being messed with by the network on every step and so in in the training process it's going to learn

  • That it performs best when it leaves most of it alone and it doesn't just randomly change the whole thing

  • But by the time you're on the fiftieth word of your sentence

  • whatever the network decided to do on the first word of the sentence is a

  • photocopy of a photocopy of a photocopy of a photocopy and so

  • things have a tendency to

  • Fade out to nothing. It has to be successfully remembered at every step of this process

  • and if at any point it gets overwritten with something else or just

  • It did its best to remember it but it's actually remembering 99% of it each time point nine

  • Nine to the fifty is like actually not that big of a number

  • So these things work pretty well, but they still get the performance like really quickly drops off once the sentences start to get long

  • So this is a recurrent neural network

  • rnl because all of these boxes

  • Are really the same box because this is the same network at different time steps. It's really a loop like this

  • You're giving the output of the network back as input every time so this works better and then people have tried all kinds of interesting

  • Things things like LS TMS. There's all kinds of variants on this general like recurrent Network

  • LS TM is the thing. That might use isn't it? Right right long short-term memory, which is kind of surreal

  • But yeah, so the idea of that is it's a lot more complicated inside these networks

  • There's actually kind of sub networks that make specific decisions about gating things. So

  • Rather than having to have this system learn that it ought to pass most things on it's sort of more in the architecture that passes

  • most things on and then there's a there's a sub there's like part of the learning is

  • Deciding what to forget

  • At each step and like deciding what to change and what to put it in what parcel and so on and they perform better

  • They can hang on to the information the relevant information for longer

  • But the other thing that people often build into these kinds of systems is something called attention

  • Which is actually a pretty good metaphor

  • Where in the same way that you would have?

  • networks that decide which parts of your hidden state to hang on to or which starts to forget or

  • Those kinds of decisions like gating and stuff

  • You have a system which is deciding which parts of the input to pay attention to which parts to use in

  • The in the calculation and which parts to ignore and this turns out to be actually very powerful. So there was this paper

  • When was this?

  • 2000

  • 2017. Yeah, so this is funny because this came out the same year as

  • The video you have about generating YouTube comments. This is in December. I think that video was October ancient history now

  • Alright, we're talking two years ago. The idea of this is as its called attention is all you need. They developed this system. Whereby

  • it's actually as

  • it's a lot simpler as a

  • As a network you can see on the diagram here if you compare this to the diagram for an LS TM or

  • Any of those kind of variants? It's relatively simple and it's just kind of using attention to do everything

  • So when made that video the ASTM type stuff was like state-of-the-art and that was until a couple of months later

  • I guess when this paper came out the idea of this is that attention is all you need of it like this stuff about

  • having gates for forgetting things and

  • All of that all of that kind of stuff in fact your whole recurrence like architecture

  • you can do away with it and just use attention attention is powerful enough to

  • do everything that you need at its base attention is about actively deciding in the same way that

  • the LS TM is actively deciding what to forget and so on this is deciding which parts of

  • some other part of the data it's going to

  • take into account which parts it's going to look at like it can be very dangerous in AI to

  • use words for things that are words that people already use

  • For the way that humans do things. It makes it very easy transform for more finds and just

  • make, you know get confused because the abstraction doesn't quite work but I think attention is a pretty decent thing because it is

  • It does make sense

  • It sort of draws the relationships between things so you can have attention from the output to the input

  • Which is what that would be you can also have attention from the output to other parts of the output

  • so for example when I'm generating in that sentence like

  • Shawn came to record a video or whatever by the time I get to generating the word him

  • I don't need to be thinking about the entire sentence

  • I can just focus my attention on where I remember

  • The name was so the attention goes to Shawn and then I can make the decision for to use the word him based on

  • that

  • so

  • so rather than having to hang on to a huge amount of memory you

  • Can just selectively look at the things that are actually relevant and the system learns

  • Where to look where to pay attention to and that's really cool like you can do it

  • There's attention based systems for all kinds of things like not just text you can do

  • Like suppose you have your input is like an image and you want to caption it

  • You can actually look at when it was outputting the sequence you can say when you generated the word dog

  • What was your you can get like an attention heat map and it will highlight the dog

  • Because that's the part of the image that it was paying attention to when it generated that output

  • It makes your system more interpretable because you can see what it was thinking and sometimes you can catch problems that way as well

  • which is kind of fun like

  • It generates the output that's like a man is lifting a dumbbell or something like that and you look at it

  • And it's not actually correct. It's like its owner trots and I go he's drinking some tea out of a mug, right and

  • what you find is then when you look at your

  • Outputs where it says dumbbell you look at the attention and the attention is like mostly looking at the arms. That's usually somebody muscular

  • Who's lifting the dumbbell in your photos?

  • It's and so it it's overriding the fact that this kind of looks like a mug because it was looking at the arms

  • So the idea is this system which is called a transformer is a type of neural network

  • which just relies very heavily on attention to

  • Produce like state-of-the-art performance and if you train them on a large

  • corpus of natural language they can learn

  • They can learn to do very well, right they give you they can be very powerful language models

  • We had the example of a language model on your phone

  • That's like a very very basic and then trying to do this with neural networks and the problems with remembering

  • And so you have like recurrent systems that keep track of they allow you to pass memory along so that you can remember the beginning

  • of the sentence at least by the end of it and

  • Things like LSTMs there is all these different varieties that people try different things

  • That are better and hanging on to memory so that they can do better it they can have longer term

  • Dependencies, which allows you to have more coherent

  • outputs

  • in just generally better performance, and then the transformer is

  • Is a variant on that?

  • Well is a different way of doing things where you really focus on attention. And so these are actually not recurrent which is an

  • important distinction to make we don't have this thing of like

  • Taking the output and feeding that back as the input and so on every time

  • Because we have attention. We don't need to keep a big memory

  • That we run through every time when the system wants to know something it can use its attention to look back to that part

  • It's not like memorizing the text as it goes. It's

  • paying attention to different bits of the text as

  • they as it thinks that they're relevant to the bit that it's looking at now and

  • The thing about that is when you have this recurrent thing

  • It's kind of inherently serial

  • most of the calculations for this you can't do them until you have

  • The inputs and the inputs are the output of the previous network. And so

  • You can't do the thing that people like to do now, which is run it on a million computers

  • And get lightning-fast performance because you have to go through them in order right? It's like inherently serial

  • Where as transformers are much more parallelizable, which means you get better computational performance out of them as well?

  • Which is another

  • Selling point so they they work better and they run faster. So they're they're really a

  • Step up. So transformers. Are this really powerful

  • architecture. They seem to give really good performance on this kind of sort of language modeling type tasks and

  • we

  • But what we didn't know really was how far you can push them or how how good they can get

  • What happens if you take this architecture and you give it a bigger data set than any of them has ever been given and more?

  • Compute to train with, you know, a larger model with more parameters and more data

  • How good can these things get how how good a language model?

  • Can you actually make and that's what opening I was doing with GPT 2?

  • So an executable binary the net effect of slotting that T diagram against here slightly downwards is to show you

  • That the C you've written gets converted into binary and the net output from this

  • process it produces out a program that you probably store in a

cSo I wanted to make a video about

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it