Placeholder Image

Subtitles section Play video

  • In the previous video we were talking about

  • transformers this architecture that uses attention to give

  • Unprecedented ly good performance on sort of language modeling tasks and some other tasks as well

  • but when were looking at language modeling and that was in preparation to make a video about

  • GPG 2, which is this very giant language model that has been there was recently

  • Well, it was recently not released actually by open AI the way that they generated the data set for this is pretty cool

  • to get enough text they went to Reddit and

  • They pulled every website that is linked to from reddit. Do we have any idea of how many days lots?

  • Literally, everything was everything that had more than three karma

  • I think or maybe more than two karma something like that like

  • Anything that had somebody had thought to post around it and at least two or three people who had thought was good enough to upload

  • They scraped the text from that. It's pretty much just a transformer. It's not the the

  • Architecture is not especially novel. They haven't done any like amazing new

  • new discovery, but

  • What they realized was?

  • Transformers it seems like

  • the more data you give them the better they do and the bigger you make them the better they do and

  • Everything that we built up until this point is clearly not

  • Like we haven't hit the limits of what this can do

  • We they thought we think we're probably

  • Bottle necked on data and maybe network size

  • So what happens if we'd like to turn that 211 what happens if we just give this all?

  • The data and make a really big one. It makes sense to talk about the acronym right so it's a generative pre-training

  • Transformer so generative same as generative adversarial network. It generates outputs to generate samples

  • Your pre-trained is this thing. I was talking about all of the different things

  • You can use a language model for right you can do you can do translation. You can try and resolve ambiguities

  • You can do summarization. You can answer questions. You can use the probabilities for augmenting other systems

  • So yeah, there's a bunch of different benchmarks for these different tasks

  • that you might want your language model to do and

  • This is what we talked about in the grid worlds video of having these like standardized problems with standardized metrics and standardized data sets

  • So that if you're comparing two different methods, you know that you're actually comparing apples to apples

  • And this is like very important it gives you numbers on these things. It's often quite difficult

  • Expected to like you're generating samples of text and it's like how plausible is this text? How realistic does it look like?

  • How do you put a number on that it's kind of difficult. So there's all of these standardized metrics and

  • the thing that

  • People came to realize which actually I mean I say that as though it's like some amazing discovery

  • It's fairly obvious. If you train your system in a like an unsupervised way on a large corpus of just general English text and

  • then you take that and

  • Train that with the data from this benchmark or the data from that benchmark

  • You can like fine-tune it so you start with something which has like a decent

  • Understanding of how English works more or less and then you say now I'm going to give you these

  • Samples for like question answering or I'm going to build a system using that to solve to go for this benchmark

  • So it's pre trained you start with something. That's like a general-purpose language model and then you from that a

  • Fine-tuned it to whichever

  • Actual benchmark or problem you're trying to solve

  • and this

  • Can give you better performance than to starting from nothing and training to each of the benchmarks from scratch

  • make sense

  • and so

  • The point of the GPT 2 paper the thing that makes it cool is they said okay if we make a really huge one

  • What if we?

  • don't

  • Fine tune it at all

  • What if we just make a giant model and then just try and run it on the benchmarks without messing with it?

  • Without showing it any of their specialized data for that benchmark. Just the raw

  • general-purpose language model, how does that perform and it turns out

  • surprisingly well, so this is a

  • Very very large data set for text

  • It's about 40 gigabytes

  • which

  • Actually doesn't sound like very much but like for text text that's insane, right? It's

  • somebody said that this was the size of

  • Google's entire index of the Internet in 98

  • So like it's yeah, it's a lot of text

  • and they trained it on that and they ended up with a

  • 1.5 billion parameter model, but which is like a previous state of the art system was 345 million

  • This is 1.5 billion

  • So they've just made the thing much much bigger and it performs really well some of their samples that they published quite

  • captured the public imagination

  • You could say and now that we've talked a little about the problems that

  • Neural networks or any language model really?

  • Has with a long term dependency

  • we can now realise just how impressive these samples are because when you look at them as a you know,

  • If you look at them uninitiated, you're like yeah, that's pretty realistic

  • It seems to like make sense and it's cool. But when you look at it knowing how language models work, it's like

  • very impressive the the coherence and the

  • Consistency and the long-range dependencies so we can look at this one that got everybody's attention the unicorns one

  • right

  • So they prompted it with in a shocking finding scientists discovered a herd of unicorns

  • living in a remote previously unexplored valley in the Andes Mountains

  • Even more surprising to the researchers was the fact that the unicorns spoke perfect English

  • And from there you then say you go to your language model gbgt, and you say given that we started with this

  • What's the next word and what's the word after that and so on?

  • So it goes on the scientist named the population after their distinctive horn of its unicorn

  • These four horned silver white unicorns were previously unknown to science

  • We do have a clue here as a human being unicorns for horned doesn't quite make sense

  • But nonetheless we're going okay

  • Now after almost two centuries the mystery of what sparked this odd phenomenon is finally solved. Dr

  • Budetti Jorge Jorge Perez

  • Jo are G an evolutionary biologist from the University of La Paz

  • This is impressive because we've mentioned the Andes Mountains in our prompt and so now it's saying okay

  • This is clearly, you know in a shocking finding. This is a science press release news article

  • It's seen enough of those because it has every single one that was ever linked to from reddit, right?

  • So it knows how these go it knows. Okay third paragraph

  • This is when we talk about the scientist, we interview the scientist, right? Okay

  • First word of the scientist paragraph, dr. Obviously, right because this is the now we're in the name of the scientist

  • What name are we going to give?

  • It needs to be a name

  • conditioning on the fact that we have the Andes Mountains

  • So we need to get where we're in South America

  • The name probably should be Spanish or maybe Portuguese

  • So we get we get dr. Perez here

  • And then evolutionary biologist makes sense because we're talking about animals

  • from the University of La Paz again

  • This is the first sentence like when you have that first clause that introduces the scientist you always say where they're from

  • So we say from the University of and then university names tend to be the name of a city

  • What's the city where we have the Andes Mountains, so we're going to Bolivia lapaz. Perfect

  • And the thing that's cool about this is it's remembered all of these things that were quite a long time ago several sentences ago

  • Well, it hasn't remembered them. It's paid attention to them across that distance, which is impressive

  • But also this is encoding a bunch of understand understanding a bunch of information about the real world

  • Right all that was given all it knows is statistical relationships between words, but the way that it comes out to us

  • Is that it knows?

  • Where the Andes Mountains are what kind of names people in that area have what their cities are what the universities are all of those

  • Facts about the real world because in order to have a really good language model it turns out you have to kind of implicitly encode

  • information about the world because

  • We use language to talk about the world and knowing what's likely to come next

  • Requires actual real world understanding and that's something that we see in some of the other

  • Things that they got it to do you can see the real world understanding coming through

  • Let's keep going

  • University of a person several companions were exploring the Andes Mountains when they found a small valley with no other animals or humans peres see

  • We're hanging on to him. Yep. We're referring to him again

  • but now we've changed it to be just the surname because that's the

  • format that people use in news articles Peres noticed that the valley had what appeared to be a natural fountain surrounded by two peaks of

  • Rock and silver snow presently others, then ventured further into the valley a round about here in our article

  • We should have a quote from the scientist right quote

  • By the time we reached the top of one peak the water looked blue with some crystals on top and we're talking about this fountain

  • I guess it's natural fountain. We're referring back to the previous int. It's like everything is

  • Relying on in contingent on earlier parts of the text while examining there by snipped paragraph while examining these bizarre

  • Creatures the scientists discovered that the creatures also spoke some fairly regular English know when I read that I like, okay

  • this is now unusually good because that's the second sentence of the lead right where six paragraphs in and

  • It knows about this point. I've covered the first sentence of this

  • initial paragraph

  • now it's time to talk about this second sentence of the lead even more surprising to the research of us of the fact that they

  • spoke English and

  • It completely ignored the speaking English part until it got to the part of the news article where that comes in

  • You've gone six whole paragraphs

  • the idea of

  • Accurately remembering that the unicorn speak perfect

  • English is like that's very impressive to me and then it goes into its gets a little bit unhinged

  • Starts talking about it's likely that the only way of knowing for sure if unicorns are indeed

  • The descendants of a lost alien race is through DNA. That's read it really

  • Well, it's not actually stuff on reddit. It's stuff linked to from reddit. But yeah, this is this is news articles men

  • They seem to be able to communicate in English quite well

  • Which I believe is a sign of evolution or at least a change in social organization said the scientist

  • That's his evolutionary biology there. Right? Right, right. Yeah, we know here's an evolutionary biologist. So so the the

  • coherence of this text is

  • really dependent on its ability to

  • Condition what it's generating on

  • Things that it's generated a long time ago

  • So yeah

  • So it can generate really nice news articles and it can generate all kinds of text things that it anything that is

  • Sufficiently well represented in the original data set. So that's GPG - it's a really

  • Unusually powerful and like versatile

  • language model that can do all of these different natural language processing

  • Tasks without actually being trained specifically on those tasks

  • It's really and that's that's why it's impressive

  • It's not that it's a it's a brand new architecture or a brand new approach or whatever

  • It's just when you make these things really huge and give them tremendously large amounts of data

  • The results are really impressive

  • In the original data set. So it will it will write you the Lord of the Rings fan fiction