Placeholder Image

Subtitles section Play video

  • What's going on.

  • Everybody welcome apart six of the reinforcement learning tutorial Siri's as well as part two of the Deep Que learning as well as de que en and deep cute networks tutorials where we left off.

  • We're basically ready to add the train method to her agent and then basically incorporate her agent to the environment and start that literate of training process.

  • So I've switched machines.

  • I'm on a paper space, a boon to machine.

  • Now and then, I've pulled up the exact versions that I'm using.

  • Also, people ask a lot like Why am I on paper?

  • Space is just a nice virtual machines in the cloud kind of cloud desktop, but also high end G pews.

  • It's great for doing machine learning in the cloud if you want.

  • I'll put a link in the description referral link.

  • It's like a $10 credit.

  • You could definitely check him out.

  • It wasn't really meant to be a sponsor spot and, like I really do use paper like I use the heck out of paper.

  • So anyway, and I've got plenty of high end reap use locally, it's just very convenient to use paper space.

  • But anyways, I've pulled up the exact versions of things I'm using.

  • Tensorflow two point.

  • Oh, is on the way.

  • Uh, just not quite ready.

  • So anyway, still using tensorflow one?

  • These are the exact versions you can feel free to fall along on different version.

  • It's just if something is going wrong and you want to match my exact versions of things, you can do this.

  • Also, in case you didn't know what's not all caps, please.

  • In case, you know, like t install, exact versions would be like Pip three install tensor flow dash GPU double equals 1.13 point 14 examples.

  • That's how you installed an exact version of something.

  • Okay, uh, let's go ahead and continue.

  • So I am in.

  • Actually think I've got it up already?

  • Yeah.

  • Cool.

  • Um, thanks, Harrison.

  • Any time, bro.

  • So so we're gonna do is come down to our agent class of the D.

  • Q.

  • An agent.

  • And I'm just gonna go to the very bottom here.

  • I'm gonna make a bunch of space and come up here.

  • And now what we're gonna do is just add the new train method, so define train, and then we're gonna pass here self terminal underscore states, and then step Whatever.

  • Step around.

  • Okay, So the first thing we want to do is do a quick check to see should we actually train?

  • So recall we're gonna have this replay memory and then from the replay memory, which should be quite a large memory, we're gonna grab a mini batch that's relatively small, but also is a batch size that is of decent size.

  • So typically with neural networks like to train in batches of, like, 32 or 64 something like that.

  • Um, so we're gonna do the same thing here, so we want 32 or 64 to be pretty small compared to the size of our memories.

  • So in this case, I want to say our max memory size is 50,000 um, and then the least amount that we're willing to train on its 10,000.

  • And the reason why we want to do that is we don't want to wind up over fitting.

  • So we want to use this replay memory principle, I guess.

  • But what we don't want to do is have replay memories so small that every step is trained on the exact same thing.

  • So we're effectively doing like, you know, hundreds of pox on the exact same data.

  • That's not what we want.

  • So anyway, we're going to say is if Len of self dot replay underscore memory is less than the men replay memory size.

  • If that's the case, we'll just return.

  • We actually want to do anything here.

  • Otherwise, if it's enough, then let's go ahead and get our mini batch, which would be a random dot sample of self doubt.

  • Replay, memory and the sample size that we want is mini batch underscore size, and we need to do both.

  • We need to import random and set the mini batch size, so I'm gonna go to the tippy top here.

  • I'm going to import random, and then I'm gonna come down here.

  • Mini Batch size will set that to 64 I'll go back to our bottom here.

  • Awesome.

  • So now that we've done that, we want to get our accu values.

  • So don't forget as well that, um, one of the things on my face.

  • Uh, let's go here.

  • Python programming dot net.

  • Let me type d.

  • Q.

  • N.

  • We're actually gonna use some stuff from the other one, too, so just pull it up.

  • But recall the following image.

  • So we actually want to get, um so this bit is kind of handled otherwise both learning rate and then just some of this logic is handled by the neural network.

  • But we still need the reward discount in that future value.

  • So we still want to use this little bit.

  • So to do that, we need to know current que values and future que values.

  • So, uh, mini batch.

  • Okay, so what we're gonna say here is current underscore states is equal to the number pi array.

  • And there really is a list comprehension here.

  • We're gonna say transition 04 transition in mini batch.

  • Um, okay, cool.

  • And then what we want to do here is normalize it.

  • So again, if you don't know why we're normalizing that check out the basics tutorials.

  • But basically, it's just images.

  • So any time you have RGB images that it's like 022 55 onward, you can really not have normalized the wrong word.

  • I've probably missed saying that wrong the whole time anyway, Instead, we're actually trying to do is scale it because all the images are normalized already pretty much, but instead we're actually trying to scale it between zero and one.

  • That's just the best way for convolution, all neural networks to learn on image data.

  • It's just useful toe, really.

  • It's the best for any machine learning.

  • You generally want to scale between zero and one or a negative one and one anyways.

  • Enough on that.

  • So current states, then what we want to grab is the current cues list, and that is also gonna be equal to self dot model dot Predict now pay attention, that SOFA model.

  • So it's that that model that is the crazy model self top model that predict, uh, and we want to predict on current states.

  • And then we want to do the same thing with future.

  • So we're going to say new current states.

  • This is after we take steps, it's going to be able to the num pyre Ray.

  • And again we're gonna say transition.

  • Um three.

  • You want to say?

  • Is that right?

  • Yes.

  • Transition 34 transition in our mini batch, and again we want to do by 2 55 and I'll explain these index values in a second if you don't understand those, or you forgot, uh, should be up here somewhere, um, or is that I think we're gonna have to define that in our environment.

  • Anyway, I'll explain that in a moment.

  • Um, yeah.

  • So coming back new.

  • Okay, so now what we're gonna say is future cues.

  • I'll explain it.

  • Basically, bottom line, you'd accuse list is evil thio self dot target model.

  • So now we're using that target model that doesn't so crazily change dot Predict against the new current states.

  • Now, we're gonna separate things out into our typical exes and wise.

  • So these will be our feature sets.

  • These will be our labels or targets, so these will be This will be, like, literally the images from the game.

  • And then this will be the action that we decide to take.

  • Um, Which we still that's gonna be in the environment.

  • It's one.

  • We'll get there anyway.

  • It'll be like up, down left, right, diagonals and all that stuff.

  • Okay, so now what we're gonna say is four index and then we're gonna have this giant to pull.

  • And this is what is consisting in this mini batch of things.

  • So you've got the current state.

  • You've got the action.

  • The reward the new current state Tut Don't put the s there and then whether or not we're done with the environment.

  • So that's exactly where these indexes air coming from.

  • So current states were grabbing the current state from mini batch.

  • Um, And then down here, we're grabbing that new 0123 Right.

  • We're grabbing that from the mini batch and they were predicting on it based on the state itself.

  • But then with this information, what we can do is calculate that last step of our formula from before, right?

  • We can calculate this little bit here to create our cues.

  • So for index creditor todo done, apparently my mouse has decided to stop working.

  • Ah, what do we want to do?

  • The first thing to say is, if not done, then we still want to actually perform our operations.

  • So what we're gonna say is, Max Future Q is equal to the n p dot max of future cues list for whatever index were actually at right now.

  • And then we're gonna say the new Q is equal to the reward plus the discount discount which I don't think we have cause it didn't try to auto complete that times, Max.

  • Future Q Uh, else we're just going to say new cubed equals reward.

  • So if we are done, then we need to set the nuke you to be whatever the word is at the time, because there is no future acute.

  • Right?

  • We're done.

  • Okay, so now we need to go up to the top and set our discount.

  • Okay, so going up to the top here, let's set, uh, discount equals 0.99 Then we'll go back to the bottom here and now we want to do is actually update these cues.

  • So recall, Mom, this is how our neural network's gonna work.

  • You got input, then you got the output, which is your cue values.

  • So let's imagine the scenario where we got this out, the action that we would take given no Epsilon, let's say would be this one right.

  • 01 at index one.

  • We take action here for because the Q value is 9.5.

  • It was the largest, que value.

  • But then let's say I we don't like that action.

  • We ended up actually degrading it a little bit.

  • And we actually want that nuke you value to be 8.5.

  • We'll tow update that for a neural network, like in our table.

  • We just updated it right.

  • You're somebody that one value.

  • But in the neural network, we output thes four values.

  • So what we end up having to do is we update this 9.5 to 8.5, like in a list.

  • And then we re fit the neural network to instead be 3.28 point 57.21 point three.

  • So that's the next thing that we need to do.

  • So coming down here what we want to say still, in our four loop, we're gonna say current cues equals the current cues list at the index that were at as we're iterating over and I miss sublime text.

  • Uh, we're iterating over Yes, thes two values index.

  • And then you should be able to guess what I meant to have here.

  • And that was an enumerates, um, and also in who?

  • Maybe today's not a day for tutorials.

  • For me, she's on then many batches.

  • Pride going off of the screen here, see if I can help out here.

  • There we go.

  • So four index, current state action reward new current state done in a numerator mini batch.

  • I guess I got stuck there because I was explaining these and then pointing out the transitions.

  • Anyway, now it's a valid loop.