Placeholder Image

Subtitles section Play video

  • what is going on?

  • Everybody, We're gonna part eight of our check pot with python in tensorflow tutorial.

  • Siri's in this tutorial.

  • What I'd like to do is talk about some of the more high level concepts and parameters of our chat bond with the neural machine translation code that we're using.

  • And I hope thio least give you an idea of a better idea of what's actually going on here because there's actually a lot going on here.

  • My initial intention was to start with the basic secrets sequence, the old English to French translation model that was done with tensorflow.

  • But tensorflow deprecate id it, and you really can't run it without running the tensorflow one point.

  • Oh, it leaves the time on my filming this we're already at 1.4.

  • And so I just was like, Now I'll just start with with the current anti model, but it's gonna be a lot of information that I'm about to throw at you.

  • So I apologize.

  • Here we go.

  • So So, first of all, any time you've got a translation, whether it's, you know, neural machine translation translation from one language to another language or in our case, even though it's English, English, it's still a form of translation that's gonna be going on there.

  • It's just input to an output of language.

  • Any time you have that, obviously words are not numbers.

  • So the first thing that we're gonna we need to do is we need to token eyes, inputs and an easy way you could Token eyes inputs is just a split by space and probably punctuation.

  • Okay, so I am a student token ized.

  • There's a special token for I a token for a space.

  • A token for AM token for space.

  • Token for a and so on.

  • Um so you you token eyes that input.

  • And generally that's what the encoder is gonna do.

  • It's just gonna be token izing your input.

  • And then once it's token ized Thea other thing an encoder is likely to do is take those tokens.

  • And one thing we could do is just a sign it like an arbitrary I d.

  • But that's not ideal.

  • We'd like those ideas to be somewhat meaningful, So we generally are gonna create a word vector that is gonna put similar words and give them similar I.

  • D.

  • S.

  • And this is gonna help both in actually translating.

  • But also, it's gonna help us in evaluating how good our translations were because a lot of times you might be really close.

  • It like like translating.

  • Um, you say you've got airplane in Japanese and you translate it Thio car.

  • Or better yet, airplane Japanese.

  • You translated to car versus airplane and Japanese and you translated Thio, Um, I don't know, shoe or cat or something like that, right?

  • At least car is closer to plane than shoe or cat is.

  • So we would know that as we get closer, that correct translation we want to reward for that we don't wanna have necessarily equally wrong just because we missed the word right.

  • So generally we're gonna have word vectors as well.

  • Once you have those ideas, though, then you're gonna feed it through some sort of a neural network with language information.

  • This is generally a recurrent neural network that just makes sense.

  • So we have that sort of non static temporal sense going forward.

  • Um, and then generally, it's, ah of a recurring role network re usually using long short term memory.

  • L s t m.

  • Then once you've done that, you feed it again through a decoder, and then you've got your output.

  • And that's your basic sequences sequence on language data Translation model.

  • Now, let's start talking about some of the problems that we might have while doing that.

  • So, first of all your input and your output, first of all, these don't even match right.

  • I am.

  • A student is four tokens, not including spaces.

  • And then the output is three tokens again, not including spaces.

  • Okay, so So already there's no match.

  • But also I am.

  • Is your input always gonna be those four tokens or whatever many tokens you you're gonna have?

  • No right.

  • Your input is gonna very right.

  • So So at least the way.

  • Like initially that we saw for that was with padding.

  • So we might say, Okay, what's our longest sense we've ever had?

  • Let's say it's 100 tokens.

  • So the longest sentences, 100 tokens.

  • Let's say we got we just got a sentence.

  • That's five tokens.

  • Well, what we're gonna say is the input layer is always gonna be 100 tokens or 100 nodes and each nodes value.

  • Starting value's just gonna be the idee.

  • Hopefully a meaningful idea.

  • It's gonna be the i d.

  • And then let's say we only have five tokens.

  • Well, we put in those 1st 5 and then we have a special token called the Pad token, and then we're just gonna use that pad took on every single node after.

  • So we just do a bunch of padding, and that's one way we can solve it.

  • But of course, that's not a good idea, most likely, and it doesn't train well and it doesn't perform well because what ends up happening is because we use padding so much longer your sentence, the less impact those later words are gonna wind up having because the neuron hours gonna learn like Okay, these pads don't mean anything.

  • So we're just gonna change the weights.

  • And then we get a long sentence once in a while in those last few words, just don't mean anything.

  • It just doesn't train well to do that.

  • So then again, still on that first sequence sequence, translation.

  • And again, that was also this is Charles.

  • Like if you guys have been following either the twitch stream or you follow Charles the Aye Aye, on Twitter.

  • Um, that was This is what he has been running.

  • So he's been on that kind of V one initial sequence of sequence model from TENSORFLOW until very recently until this serious started.

  • Basically, um, So So the other idea was that we could use bucket ing so that the idea here was like, Okay, well, what we could do is we could have input layers that are buckets so we'll take our tokens, Won't have, like, four buckets will have a bucket for the stuff that's like 5 to 10 long.

  • We'll have a bucket for the stuff that's 11 toe, 25 long a bucket for 25 35 then a bucket for 35 50 or something like that we would do bucket ing.

  • And then whichever bucket held the longest version of that string, that's the bucket that we would use.

  • And we were trained, and this did okay, you know, it did all right, but not ideal, because we still are going to be using padding.

  • It's still so that brings us to today with Tensorflow.

  • We have what are called dynamic, recurrent neural networks, and we can have a dynamic input coming in now, moving back in time a little bit and also getting into now again.

  • Now we're going to be talking.

  • So what kind of talk about up to this point Right before dynamic was that first secrets of sequence stuff from tensorflow.

  • Um, And now what I'm gonna be talking about is actually the empty the current age and empty that were using.

  • So another problem that we have besides the bucket ing, which is solved.

  • But besides the budgeting and padding, which is basically solved by dynamic, recurrent all networks, we have the issue of just language in general.

  • So, for example, let's let's consider a translation chap up a translation task like English to French.

  • In general, if you want to translate English or French or English or German or English or Spanish, there's there might be slight variance in the syntax of like, a noun phrase.

  • But in general there's a algorithmic solving of the translation of English thio, French, German, Spanish, whatever.

  • That's pretty, Pretty simple, I guess.

  • And it goes in a linear order.

  • But then you have a language like Japanese, so English and Japanese are very, very different languages, and they don't follow even remotely similar rules and in Japanese, Sometimes that last character changes the meaning of all the other characters.

  • And things were just totally different, right?

  • And then also with the chat bought, the same thing is true.

  • And in a lot of translations, same thing is true that it's like with an L S t m generally and l s t m is really pretty good at remembering in a sequence about, like, 10 to maybe 20 tokens.

  • Right?

  • So let's say tokens are words.

  • So in our case, we are token izing toe word.

  • Another option you can do to token eyes is to token eyes by character.

  • So each little character a b c d on its own Or you can token eyes by like, kind of like almost Ah, try to think of the word.

  • I can't think of this stupid word.

  • Um, syllable.

  • Anyway, I'm totally blanking.

  • My brain is done Anyways, um, I think it's syllable, just chunks of letters.

  • Basically that had, like, hair is son or hair ISS son, right?

  • You could token eyes by those kind of little bits, and you'll have much less.

  • And that's BP token ization.

  • We might be talking about that later on.

  • But for now, we're token izing by word.

  • So anyway, think of it by words.

  • So think of that how hard it would be for you if you could only remember if you could only respond to 10 to 20 tokens, Max at a time.

  • And really 10.

  • Okay, So you needed to do you think about if you needed to build your response to 10 tokens at a time and build out your response as you slid a window of 10 tokens at a time.

  • So how hard it would be if you write or think about how hard it would be if you write that.

  • Okay.

  • Start building a response.

  • You don't even know if you want.

  • Right.

  • Um, and then how hard it would be if you could only or something like that, right?

  • Keep sliding it, and you have to build a response.

  • The neural network must generate a response.

  • So this could be very challenging if we're on Lee thinking back, um, historically, but also on Lee.

  • If we can only remember into 20 tokens, the other issue is how hard it would be.

  • Okay, think about how hard it would be, and then if you could only and that's kind of like, almost like a filler.

  • If you could only and then think in terms of 10 to 20 tokens at a time, it's It's almost like the first part.

  • In order to understand that first phrase, you have to get that last bit too.

  • But you also couldn't take just the last bit if you could.

  • On Lee.

  • I think in terms of 10 to 20 tokens, right, you know, now, now you don't know how to respond to that.

  • If you could only think of.

  • What do you mean?

  • You have to go back to the think about how it would be right.

  • And so this is where we have kind of two new concepts they're coming in here, which is both, at least in terms of using future data as well with bidirectional recurring role networks.

  • So in a bidirectional, recurrent or network, we're gonna feed data both sequentially forward and then also reverse order backwards through that hidden layer in the model.

  • So that's one thing.

  • And then we're also going to make use of attention models, which you can again, tension models are in this tutorial.

  • There's also a great paper on them.

  • Um, so this kind of explains attention, Miles a little bit.

  • And then I also took an image from the paper on attention models, which I think pretty much drives the entire point home.

  • So this is the a graph of Blue Score to sentence length.

  • So think of these is like your tokens.

  • And this red line here is basically no use of an attention model, and a blue score is bilingual evaluation.

  • Understood anything?

  • It's basically it's just a score of how good a translation.

  • Waas basically So So the more the better.

  • And, um, as you can see, like, the model is pretty good at 10 even at just before 20 it does its best blue score, but then very quickly she just falls off, especially as you get the longer sentences, which is problematic weed.

  • We tend to speak in pretty long sentences, and sometimes you need multiple sentences to understand it.

  • Meaning so then these air, just with with blue score, some attention models being applied and the real thing T drive home here is, it helps both on the very short stuff, but also on the longer tokens.

  • It's still like it, basically flatlines after, like, 40 ish.

  • There's a slight decline there, but it pretty much holds out all the way up to 70 and probably out further from there.

  • So, um, so the attention model's gonna help us remember longer sequences at a time which can help us to kind of brute force our way through this kind of context problem where we need both historical and future information.

  • But then we use bidirectional, recurrent or networks to kind of mix it up.

  • And because language is not necessarily on Lee in perfect sequence, sometimes we have to use context, and we have to hear the full sentence before we can go back and respond to words that we heard leading up to that point.

  • So real quick.

  • What I'd like to do is just visualize a bi directional current neural network, So let's go do that really quick on a simple, recurrent role network.

  • You have your input, layer your output layer, and then we'll have one whom layer for simplicity's sake.

  • Then your connections go from the input layer to the hidden layer, where each note in the hidden layer also passes dated down to the next head and later node, which is how we get our temporal and not so static characteristics from recurrent neural networks.

  • As the previous inputs are allowed to carry forward to some degree down on through that hidden layer on a bidirectional recurrent neural network.

  • Basically, the hidden layer has data that goes both down and up through our business, in both directions.

  • Through that that hidden layers.

  • So you still have your input layer.

  • You're still gonna have your output layer.

  • You're still gonna have the connections from the input to the hidden and from the hidden to the output.

  • But then also in that hit and layer, you've got basically the data that goes in this innocent drawing down and then up or forward and then reverse, depending on what drawing you're looking at.

  • And then from here, you're gonna have.

  • In theory, this is actually fully connected just because again, this is all that hidden layer is really there.

  • Just one hidden layer.

  • All those notes are part of the same hidden layer.

  • But then you might also Seymour fancy types of connections, just thio just to get it more than just simply forward and reverse.

  • Just get a little bit Maur complexity out of the network while we're there.

  • But anyways, that's the difference between a simple Rickert nor network in a bidirectional recurrent neural network.

  • Because in a lot of tasks, it's not just what happened leading up to a certain point, we do actually care what happens after that point as well.

  • So it's still important for us to go actually go both ways.

  • All right, so now what I want to do is cover some of the training metrics that you're gonna come across and give you an idea of kind of what, at least I've found to be useful as far as training the model and kind of tweaking it as time goes on and kind of knowing when it's done.

  • So, uh, the first thing I want to do is show you tense aboard.

  • Hopefully, you're familiar with tense aboard already, But if not, here's.

  • Here's what you can do tow.

  • Run it.

  • Basically, it's just a way you can visualize how training has been.

  • Our training's going basically right now, you can visualize a lot of scale er's.

  • There's some other things that we can do and I'll show you one of them.

  • So first of all, toe run tense aboard.

  • You can run this while your algorithm is training.

  • So we don't want to do is go into the model directory.

  • Don't worry.

  • If you don't have all these same directories, I've made lots of copies and lots of things as I'm training.

  • So anyway, head into model and then trained log is where you have all your logging files.

  • Your pride.

  • Just have event files like this.

  • You won't have necessarily these other files.

  • So from model, what I'm gonna do is I'm gonna open up a command prompt just with CMD typing in there and then just type tensor board dash dash logger equals trained underscore law.

  • And then you would hit in her.

  • Now, actually, I haven't cancel ex actually already have it up.

  • I wanted to bring it up prior because, uh, it can take a little bit to load, and I'm loading like 100,000 steps.

  • So anyway, it's just answer board longer train log now, once you have it up, it's over.

  • This is my tensor board for Charles v to so he was trained with Think about three million pairs.

  • And then right now I'm training a model with, like, 70 million pairs, so hopefully we'll be better than this one.

  • But anyway, this is kind of this.

  • Is this the tensor board and the information that I had?

  • So the big things you wanna pay attention to are the blue score.

  • Blue Blue is probably the best determining factor in how good a translation was.

  • The problem is, we're not really doing translations.

  • So So, basically, with when you're translating English or French, for example, in general there's either just one proper translation.

  • Or maybe there's three or five.

  • But when you're translating English to English, when you're doing like a chap by, like it input to an output comment in a response from any given comment, it's really like infinite responses.

  • There's there's no limit to what the responses and its unless it's not coherent.

  • And unless it doesn't make any sense at all, it's a valid response.

  • So So for me, the blue score is relatively useful.

  • We'd like to see it go up, but I don't think we're going to see a blue score of like 20.

  • Okay, we're probably going to see blue scores around 34 Max.

  • Like, there's really no reason why we would see a super high blue score unless maybe we over fit too much.

  • Like, if you have you trained, Unlike I don't know, you did like 5000 epoxy or something like that, right?

  • You'd probably see a pretty good blue score, but you probably over fit.

  • So So yeah, Wouldn't, um I wouldn't put too much weight into that, but in general, we'd like the blue score to go up.

  • Okay, Um clipped.

  • Radiant.

  • I wouldn't I'm not really paying much attention.

  • And then the great norm, the other thing, I think in the previous tutorial there was like G n I'm again.

  • I'm not positive.

  • That's great noise.

  • But I'm gonna go ahead and guess that after re looking at this, it's probably radiant.

  • Norm, it's my guest.

  • Uh, anyway, initially, you want that great norm to kind of fall.

  • But honestly, I haven't I just haven't paid much attention to it.

  • Then we've got the learning rate.

  • As you can see here I started with 20 negative three.

  • After about one full epoch, I went ahead and drop that, and I basically just have it every time.

  • So this bit, you know, there's like basically one have here have in them one final half.

  • So from 0.1 20.0 five and then to five and then 0.1 Hope I got that right for you.

  • Anyway, you just keep having it, um, and basically won the negative.

  • And then once you've got 20 native like, you're not gonna get any better than that.

  • It's just not gonna work out for you to keep having, in my opinion.

  • And that's what the Adam Optimizer again.

  • If you're gonna use, like, stochastic Iranian descent, then you probably better off doing some sort of auto decay kind of structure.

  • Um, and that makes more sense.

  • But with Adam, the 80 a literally stands for adaptive, and then the M is momentum.

  • I think.

  • Anyway, it's adaptive, so you don't really need to be messing around with your learning rate as much as you would with, like stochastic greedy in dissent where you might decay every step for every 100 steps or whatever in general, I would really just decay Adam.

  • Maybe every 1 to 2 beatbox, and if you want, you can calculate that out.

  • Batch size is 1 28 So divide how many samples you have by 1 28 That's how many steps it will take for you to go through a holy book.

  • So it's learning rate, train loss.

  • Hopefully you're not.

  • You know what train losses.

  • But generally, obviously, we'd like to see that going down in basic.

  • I stopped the model because one I've had already decreased learning rate as much as I felt it was necessary and it looked like train loss was maybe even starting to climb again.

  • So there's really no reason to keep training this model, in my opinion on and then, for whatever reason, when we do a bi directional current neural network, this graph just doesn't work.

  • So I think that's a bug in the in the code because it does work totally fine.

  • Uh, with a non bi directional current or network.

  • Anyway, that's your tensor board stuff.

  • I do want to bring up um, I guess I'll bring down the other one.

  • So I have you have a model that's training right now on paper space?

  • Uh, is that okay?

  • I think you could see that.

  • Yes.

  • As you can see, this is Tensor Bird.

  • I got censor board running right here.

  • I'm currently on step 59,000.

  • Things were looking pretty good.

  • I haven't decayed anything just yet.

  • Uh, train loss keeps going down.

  • The other thing I want to bring up to is so So there's 22 two major metrics, right?

  • Blue, which we've already discussed.

  • It's basically how good of a translation is it?

  • Um, And then PPL is perplexity and perplexities is this?

  • It's basically just a probability distribution.

  • So it's how far off are you?

  • So blue score.

  • It's like the more the better where his perplexity is.

  • We'd like that to be a smallest possible.

  • We'd like perplexity to be in, like, the single digits, if possible again on English on a chat.

  • Bott.

  • That's prime.

  • Not gonna happen on English to French.

  • We actually could get a perplexity in the single digits, but yeah, probably the chap.

  • We're not gonna be able to do that.

  • And if we did, we'd probably just over fit.

  • That's my guess.

  • So anyway, the main numbers We're looking for his perplexity in general, if I recall, right, the perplexing and Charles V to the one that's live on Twitter, at least at the time I recording this, although he's probably gonna be replaced with this model when it's done.

  • Um, I think you've gotten to like the fifties.

  • I don't think he even gotten in the forties for perplexity.

  • But these were just general numbers again.

  • The hard part with English English with a chap pot is there's no set.

  • One answer.

  • So you want to get generally vaguely close, but you're not gonna see incredible numbers there.

  • So moving this aside, the last thing on a show is the, you know, basically the reason for a word vector.

  • And I've brought up.

  • I've already kind of loaded.

  • It's it could take some time to load.

  • So this is the projector of okay.

  • For some reason, it was looking like it might not work, and then these air, like are basically our word vectors for with the This is the encoder.

  • But you can also do the same thing with the decoder.

  • But anyway, we can kind of zoom in, and we can kind of see the actual words themselves on my mouse keeps loading It's annoying anyway, um, eBay and, yeah, eBay's related to eBay.

  • No surprise Amazon, You know, things you might do on your computer.

  • Wireless more eBay, USB SSD buyer?

  • Yes, Like computer stuff.

  • Some.

  • Some word I've never seen before.

  • It must have something to do with computers, setups, Camaro router, more computer stuff ranging waterproof.

  • Um, as you can see.

  • Order what?

  • That one word was, um, Walmart Photography iPhone.

  • Anyway, do you keep looking?

  • There's 100 that, like the vocab here, was 100,000.

  • So there's obviously a lot of words here, but anyways, so this is kind of how it's gonna work like a vagina.

  • Bring back childhood memories.

  • Sorry.

  • This is kind of fun.

  • Um, I won't be long until we come up with a pretty bad word.

  • Probably.

  • So I approached it.

  • Stop.

  • Anyway, this is a family friendly to Troy.

  • Uh, so, um, I think that's enough for now, with the various metrics that you should be looking at all that again in general training.

  • Loss.

  • You want to see that go down once it flatlines, you might want to consider decaying.

  • Um, at least with the Adam optimizer you prize should stop picking at one.

  • You negative for, um, And then in general, we'd like to see perplexity falling.

  • We'd like to see single digit perplexity, but that's not gonna happen with the chat.

  • Um, at least not yet, anyway.

  • And if it did it, pride would be unreliable.

  • Is my guest um blue score?

  • We'd like to see that go up.

  • Um, it's really as simple as that.

  • And hopefully we'd like to see many iPAQ.

  • So right now, the model I'm training on, um, paper space is with 70 million samples because this one was only three million samples.

  • So I did.

  • I forget how many parks that would be was 100,000 times 1 28 which, unless my math is wrong, then be like 12.

  • Me, like four.

  • You box?

  • I don't know.

  • Hopefully I'm not off by an order of magnitude, but I think it's about four blocks.

  • Words here won hee park would be like 480,000 steps, so that is a lot of steps.

  • Whereas this one only had to do 100,000 steps to do a few pox.

  • This one's gonna have to do 480,000 just for one night.

  • And my hope is that because all that data is unique like every single input data, every single sample is a unique sample.

  • My hope is that it's going to be much more accurate in a much more interesting and unique.

  • And his responses won't necessarily piece of similar so often because every single one he's seen even upto 500,000 steps will be a new unique, never before seen sample and ideally, at I'd hoped to be able to do to hypoxia, that would be like a 1,000,000 steps.

  • And that's gonna take even this one's gonna take.

  • I forget, I think eight days to get to maybe a little less than eight days about a week just to do one iPAQ.

  • So, yeah, I'm not sure I want to do two weeks of training, Okay, but honestly up.

  • I'll just keep training until training loss until I get to one negative for And training loss still won't decline.

  • Until then, I'm gonna keep training and hoping that unless unless perplexity starts going up crazily or blue score starts falling or something like that, but, um, yeah, Okay, so a lot of information I probably missed some stuff.

  • If you have questions or anything on what you're seeing here, do feel for you.

  • Ask if you've got a correction.

  • If I said something wrong, which I'm sure I did somewhere in here is this is so much stuff.

  • Feel free to correct me.

  • If you've got more information you want to share with everybody, feel free to comment about that.

  • Um, in the next tutorial, what I'd like to do is start talking about Okay, Well, let's say you get to the point, like, right now when this model trains, it outputs files so you can go to the testing file.

  • So if you're going to model, let me just do a refresh here.

  • Yeah, OK, so So it'll output to output Dev and Output test.

  • You can pull up Output, Dev and kind of see what his responses have been recently.

  • So again, this is kind of a newish model, but you can see again these are this is after a decoder, but it hasn't been fully I think the proper wor d token ized, I guess.

  • Um, anyway, you can see, Although it's pride pretty small for you, huh?

  • I think this one.

  • I can't.

  • Uh, sorry.

  • Anyway, you can see the responses on Ben.

  • I've got a script that basically pairs the two together.

  • So because I wanted to be able to see what are these responses to and stuff like that?

  • So talk a little bit about that, How you compare these together and kind of see, how good is it looking so far?

  • Obviously, there's a lot of hunks.

  • There's a lot of repetition in here.

  • I'm not sure is being settled.

  • I'm not happy with this yet.

  • It's got a long way to go, though, as I know it's like only 10% through fully pock yet, I believe a little bit more than 10%.

  • Um, so not not too bad.

  • Uh, that's pretty much expected, but anyways, you can kinda see how it's doing like this, but eventually you're going to get the urge to speak to your chat.

  • But so we're gonna talk a little bit more in the next Victoria, about how do we pair the output of maybe next tutorials Anyway, how do we pair the output deaf to the actual testing input that that generated this so you can kind of see Are these responses coherent, or do they make no sense at all?

  • Uh, and then also eventually, we need to talk about how do we actually deploy the model?

  • Let's say you've got the model.

  • You're happy with it.

  • How do we deploy it?

  • How do you do something like launch?

  • Charles de ay, ay on Twitter.

  • Um, and just in case anybody is not familiar with Charles de I, uh, pull him up here.

  • What a wonderful little guy.

  • Anyway, this is him.

  • Uh, give him a fallen anyway.

  • Uh, that's all for now.

  • Questions, comments, concerns What?

  • Ever feel free to leave them below.

  • Otherwise I will see you in the next to drill.

what is going on?

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it