Placeholder Image

Subtitles section Play video

  • what is up, everybody.

  • Today you're gonna learn how to go from a paper to a fully functional implementation of deep deterministic policy.

  • Grady INTs If you're not familiar with deep deterministic policy greetings or D d.

  • P.

  • D.

  • For short, it is a type of deeper reinforcement learning that is used in environments with continuous action spaces.

  • You see, most environments have discreet action spaces.

  • This is the case with, say, the Atari Library say ah, like break out or space invaders, where the agent could move left, right, It could shoot, but it can move left, right and shoot by fixed, discreet intervals.

  • Fixed amounts, right In other environments like, say, robotics, the robots could move a continuous amount so it can move in anywhere from 0 to 1 minus one plus one.

  • Anything along a continuous number interval.

  • And this poses a problem for most deep reinforcement learning methods like sake, you learning, which work spectacularly well in discreet environments but cannot tackle continuous action spaces.

  • Now, if you don't know what any of this means, don't worry.

  • I'm gonna give you the rundown here in a second.

  • But for this set of tutorials, you're gonna need to have installed the opening I Jim, you'll need Python 3.6, and you also need tensorflow and pytorch.

  • Other packages you'll need include Matt Pot Live to handle the plotting of the learning curve, which will allow us to see the actual learning of the agent as well as numb pie to handle your typical vector operations.

  • Now, uh, here, I'll give you a quick little rundown of reinforcement learning.

  • So the basic idea is that we have an agent that interact with some environment and receives a reward.

  • The rewards kind of take the place of labels and supervised learning, and that they tell the agent what is good.

  • What is it that it is shooting for in the environment.

  • And so the agent will attempt to maximize the total rewards over time by solving something known as the bellman equation.

  • We don't have to worry about the actual mathematics of it, But just so you know, for your future research the algorithm for typically concerned with solving the bellman equation, which tells the agent the expected future returns, assuming to follow something called its policy.

  • So the policy is the probability that the agent will take a set of actions given it's in some state s, it's basically probability distribution.

  • Now many types of algorithms such askew learning will attempt to solve the bellman equation by finding what's called the value function.

  • The value function or the action value.

  • Function in this case maps the current state and set a possible actions to the expected feature returns the agent expects to receive.

  • So in other words, Agent says, Hey, I'm in some state meaning some configuration of pixels on the screen in the case off theatre origin Atari Library, for instance, and says, Okay, if I take one or another action, What is the expected future return?

  • Assuming that I follow my policy actor critic methods are slightly different and that the attempt to learn the policy directly and recall the policy is a probably a distribution that tells the agent what the probability selecting an action is given its in some state s.

  • So these two algorithms have a number of strength between them.

  • Ah, and deep deterministic policy.

  • Grady is a way to marry the strength of these two algorithms into something that does really well for discreet action.

  • Sorry, continuous action spaces.

  • You don't need to know too much more than that.

  • Everything else you need to know.

  • I'll explain in their respective videos.

  • So in the first video, you're going to get to see how ah, I go ahead and read papers and then implement them on the fly.

  • Um, and in the second video, you're going to see the implementation of the deterministic policy radiance in pytorch.

  • In a separate environment, both these environments are in both.

  • These environments are continuous, and so they will demonstrate the power of the algorithm quite nicely.

  • You don't need ah particularly powerful GPU, but you do need some kind of deep you to run these as it does take a considerably long time, even on a GPU.

  • So you will need at least a like, say, a Maxwell Class GP or above.

  • So something from the 700 Siris on a video side.

  • Unfortunately, neither of these ah frameworks really work well with a M D cards.

  • So if you have those, you'd have to figure out some sort of Cluj to get the open sea el implementation to trance compile Tokuda.

  • That's just a technical detail.

  • Uh, I don't have any information on that.

  • So you're on your own.

  • Sorry.

  • So this is a few hours of content.

  • Grab a snack drink and watch us at your leisure.

  • It's best to watch it in order.

  • I actually did the videos in a separate order.

  • Reverse order on my channel just so I could get it out.

  • So I did the implementation and pytorch first and then the video on implementing the paper intensive close second.

  • But it really is best for a new audience to go from the paper paper video to the pytorch video.

  • So I hope you like it.

  • Leave any comments, questions, suggestions, issues down below.

  • I'll try to address as many as possible.

  • You can check out the code for this on my get hub.

  • And you can find many more videos like this.

  • All my YouTube channel machine learning with Phil.

  • I hope you all enjoy it.

  • Let's get to it.

  • What is up?

  • Everybody In today's video, we're gonna go from the paper on deep, deterministic policy ingredients all the way into a functional implementation and tensorflow.

  • So you're going to see how to go from a paper to a real world implementation.

  • All in one video grab a snack.

  • A drink is going to take awhile.

  • Let's get started.

  • So the first step in my process really isn't anything special.

  • I just read the entirety of the paper.

  • Of course, starting with the abstract, the abstract tells you what the paper is about at a high level, it's just kind of an executive.

  • Summary introduction is where the authors will pay homage to other work in the field.

  • Kind of set the stage for what is going to be presented in the paper as well is the need for it?

  • Ah, the background kind of expand on that and you can see here.

  • It gives us a little bit of mathematical equations and you will get a lot of useful information here.

  • This won't talk too much about useful nuggets on implementation, but it does set the stage for the mathematics community implementing which is of course critical for any deep burning or, in this case, deep reinforcement learning paper implementation.

  • The algorithm is really where all the meat of the problem is.

  • It isn't here and that they lay out the exact steps you need to take two.

  • Implement the algorithm right?

  • That's why it's title that way.

  • So this is the section you want to read most carefully.

  • And then, of course, they will typically give a table where they outline the actual algorithm.

  • And oftentimes, if I'm in a hurry, I will just jump to this because I've done this enough times that I can read this.

  • What is called pseudo code if you're not familiar with that pseudo code is just an English representation of computer code s Oh, we will typically use that we outlined a problem on.

  • It's often used in papers, of course.

  • So typically, I'll start here reading it and then work backward by reading through the paper to see what I missed.

  • But of course it talks about the performance across a whole host of environments.

  • And of course, all of these have in common that they are continuous control.

  • So, uh, what that means is that the action space is a vector whose elements can very on a continuous reel number line, instead of having discreet actions of 012345 s o.

  • That is the really motivation behind deep, deterministic policy grievances that allows us to use deep reinforcement learning to tackle these types of problems, And in today's video, we're gonna go ahead and tackle the I guess pendulum swing up, also called the pendulum problem.

  • Reason being is that while it would be awesome to start out with something like the bipedal walker, you never want to start out with maximum complexity.

  • Always want to start out with something very, very small on that scale, your way up, and the reason is that you're gonna make mistakes.

  • And it's most easy to debug most quick to debug very simple environments that execute very quickly.

  • So the pendulum problem only has, I think, three elements in its state vector and only a single action.

  • So or maybe it's two actions I forget.

  • But either way, it's very small problem relative to something like the bipedal walker or many of the other environment.

  • You could also use the continues version of the card pole or something like that.

  • That would be perfectly fine.

  • I've just chosen the pendulum for this pretty quickly because we haven't done it before, so it's in here that they give a bunch of plots of all of the ah performance of their algorithm of various sets of constraints placed upon it and different implementations so you can get an idea.

  • And one thing you notice right away.

  • Um, it's always important to look at plots because they give you a lot of information visually, right?

  • It's much easier to gather information from plots, and it is text.

  • You see that right away they have a scale of one.

  • So that's telling you it's relative performance and you have to read the papers and relative to what?

  • Um, I don't like that particular approach.

  • Ah, they have similar data in a table form.

  • And here you see, Ah, whole bunch of environments they used and is a broad, broad variety.

  • They wanted to show that the algorithm has a wide arena of public ability, which is, ah, typical technique and papers.

  • They want to show that this is relevant, right?

  • If they only showed a single environment, people reading it would say, Well, that's all well and good.

  • You can solve one environment.

  • What about these a dozen other environments, right?

  • And part of the motivation behind reinforcement learning his generality can can we model real learning and biological systems such that it mimics the generality of biological learning.

  • One thing you notice right away is that these numbers are not actual scores.

  • So that's one thing I kind of take note of and caused me to raise an eyebrow.

  • So you have one of the motivation behind that.

  • Why would the author's express scores in the ratios?

  • A couple different reasons.

  • One is because they want to just to make all the numbers look uniform.

  • Maybe the people reading the paper wouldn't be familiar with each of these environments, so they don't know what a good score is.

  • And that's a perfectly valid reason.

  • Another possibility is they want to hide poor performance.

  • I don't think that's going on here, but it does make me raise my eyebrow whenever I see it.

  • The one exception is the torques, which is a totally open rate race car simulator environment.

  • I don't know if we'll get to that on this channel.

  • That would be a pretty cool project, but that would take me a few weeks to get through, Um, but right away you notice that they have a whole bunch of environments.

  • The scores are all relative to one, and one is the score that the agent gets on a planning algorithm, which they also detail later on.

  • So those were the results, Um, and they talk more about I don't think we saw the headline, but they talk about related work which talks about other algorithms that are similar and their shortcomings, right.

  • They don't ever want to talk up other algorithms.

  • You always wanna talk up your own.

  • I wrote them to make yourself sound good.

  • Um, you know, whilst you be writing a paper in the first place and of course, it concluding that tie everything together references I don't usually go deep in the references.

  • Um, if there is something that I feel I really, really need to know, I may look at a reference, but I don't typically bother with him.

  • If you were a PhD student, then it would behoove you to go into the references because you must be an absolute expert on the topic.

  • And for us, we're just, you know, hobbyists on the youtuber.

  • It's so I don't go into too much depth with the background information.

  • And the next most important bit of the paper are the experimental details.

  • And it is in here that it gives us the parameters and architectures for the networks.

  • So this is where if you saw my previous video where I did the implementation of DD PG and PYTORCH and the continues lunar lander environment, this is where I got most of this stuff.

  • It was almost identical.

  • With a little bit of tweaking.

  • I left out some stuff from this paper, but, uh, pretty much all of it came from here.

  • In particular the hidden layer sizes 403 100 units, A cz well, as the initialization of the parameters from uniform distribution of the given ranges s So just to recap, this was a really quick overview of the paper just showing my process of what I look at.

  • Uh, the most important parts are the details of the algorithm as well as the experimental details.

  • So, as you read the paper, um, like I said, I gloss over the introduction because I don't really I kind of already understand the motivation behind it.

  • I get the idea.

  • It says, basically tells us that you can't really handle just continuous action spaces with deep you networks.

  • We already know that, and it says, you know, you can describe ties the state space.

  • But then you end up with really, really huge actions.

  • Sorry.

  • You can describe ties, the action space, but then you end up with a whole boatload actions.

  • You know, What is it?

  • 2187 actions.

  • So it's intractable anyway.

  • And they say what we present, you know, a model free off policy algorithm.

  • And then it comes down to this section where it says the network is trained off policy with samples from a replay buffer to minimize correlations very good and train with the target que network to give a consistent targets.

  • Dearing temporal difference, backups.

  • So in this world to make use of the same ideas along with batch normalization.

  • So this is a key chunk of text, and this is why you want to read the whole paper because sometimes ill in bed stuff in there that you may not otherwise catch.

  • So, ah, as I'm reading the paper, what I do is I take notes and you could do this and paper.

  • You can do it in, you know, text document.

  • In this case, we're going to the editor.

  • So that way I can show you what's going on, and it's a natural place to put this stuff because that's where you implement the code anymore.

  • Let's hop over to the editor and you'll see what I take notes.

  • So right off the bat, we always want to be thinking in terms of what sort of classes and functions will we need to implement this algorithm.

  • So the paper mentioned a replay buffer as well as a target cute network.

  • So the target que network For now, we don't really know what it's gonna be, but we can write it down so we'll say we'll need a replay buffer class and I need a class for a target Hugh Network.

  • Now, I would assume that if you were going to be implementing a paper off this advanced difficulty, you'd already be familiar with you learning where you know that the Target Network is just another instance of a generalized network.

  • Uh, the difference between the target evaluation networks are, you know, the way in which you update their weights.

  • So, um, worried about we know that we're gonna have a single class at least one.

  • If you know something about actor critic methods, you'll know that you probably have two different classes.

  • One for an actor, one for a critic because those two architectures or general, a little bit different.

  • But what do you know about Q networks?

  • We know that Q networks are state action value functions right there, not just value functions.

  • So the critic in the actor critic methods is just a state value function in general.

  • Whereas here we have a queue network, which is going to be a function of the state and actions.

  • We know that it's a function of S and a so we know right off the bat.

  • It's not the same as a critic.

  • It's a little bit different.

  • Ah, and it also said, Well, we will use.

  • We will use batch norm.

  • So batch normalization is just a way of normalizing inputs to prevent divergence in a model.

  • I think it was discovered in 2015 2014 something like that, Um, so that we use that.

  • So we need that in our network.

  • So we know at least two right off the bat, a little bit of an idea of what the network is gonna look like.

  • So let's go back to the paper and see what other little bits of information we can glean from the text before we take a look at the algorithm reading along, we can say, uh, blah, blah.

  • This a key feature simplicity.

  • It requires only a straightforward actor, critic architecture and very few moving parts.

  • And then they talk it up and say can learn policies that exceeded performance of the planner.

  • You know, the planning, our them, uh, even learning from pixels which we won't get to in this particular implementation.

  • So then Okay, no, really other nuggets there.

  • Um, the background talks about the mathematical structure of the algorithm, So this is really important if you wanna have a really deep in depth knowledge of the topic.

  • If you already know enough about the background, you would know that you know the formula for discounted future rewards.

  • You should know that if you've done a whole bunch of reinforcement learning algorithms, if you haven't then definitely read through this section, um, to get the full idea of the background in the motivation behind the mathematics.

  • Other thing to note is that says the action value function is using many algorithms.

  • We know that from deep you learning on.

  • Then it talks about the re curse of relationship known as the bellman equation.

  • That is known as, well, Other thing to note what's interesting here and this is the next nugget is if the target policy is deterministic weaken described as a function mu.

  • And so you see that in the remainder of the paper, like in the algorithm, they do indeed make use of this parameter mu, so that tells us right off the bat that our policy is going to be deterministic.

  • Now, if you have, you could probably guess that from the title right deed to turn his deep, deterministic policy radiance.

  • Right.

  • So you would guess from the name that the policies deterministic.

  • What does that mean exactly?

  • So a stochastic policy is one in which the, uh, software maps the probability of taking an action to a given state so you can put a set of state and out comes a probability selecting in action and you select an action according to that probability distribution, so that right away bakes in a solution to the explore, exploit dilemma, so long as all probabilities are finite, right?

  • So so as long as a probability of taking an action for all states doesn't go to zero.

  • There is some element of exploration involved in that algorithm.

  • Um, he learning handles the explore exploit dilemma by using Absalon greedy action selection where you have a random parameter that tells you how often to select a random number.

  • Sorry, random action.

  • And then you select a greedy action the remainder of the time course politics ratings don't work that way.

  • They typically use the stochastic policy.

  • But in this case, we have a deterministic policy.

  • So you got to wonder right away, okay?

  • We have a deterministic policy how we're gonna handle the explore exploit dilemma.

  • So let's go back to our text editor and make a note of that.

  • So we just want to say that the the policy is deterministic howto handle explorer exploit on.

  • That's a critical question, right?

  • Because if you only take what I perceived as the greedy actions, you never get a really good coverage of the parameter space of the problem.

  • And you're going to converge on a sub optimal strategy.

  • So this is a critical question we have to answer in the paper.

  • Let's head back to the paper and see how they handle it.

  • So we're back in the paper and you can see the reason they introduce that deterministic policies to avoid an inter expectation.

  • Or maybe that's just a byproduct.

  • I guess it's not accurate to say that's the reason they do it.

  • Ah, but what's needed says the expectation depends only on the environment.

  • Means is possible.

  • Learn cute.

  • The mu meaning he was a function of mu off policy using transitions which are generated from a different stochastic policy.

  • Beta.

  • So right there we have off policy learning, which is, say, explicitly with a stochastic policy.

  • So we are actually gonna have two different policies in this case.

  • So then this already answers the question of how we go from a deterministic policy to solving the exports board dilemma.

  • And the reason is that we're using a stochastic policy to learn the greedy policy or they sorry, purely deterministic policy.

  • And of course, they talk about the parallels with Q learning because there are many between the two algorithms and you get toothy lost function, which is of course, critical to the algorithm and this y of tea parameter.

  • Then, of course, they talk about what Cuban It has been used for these.

  • They make mention of deep neural networks, which is, of course, what we're gonna be using.

  • That's for the deep compass from, uh, and talks about the Atari games, which we've talked about on this channel is, Well, um and importantly, they say, um, the the two changes that the introducing que learning, which is the concert of the replay buffer.

  • And that's the target network, which, of course, they already mentioned before.

  • They're just reiterating and reinforcing what they said.

  • That's why we want to read the introduction on background time material to get a solid idea what's gonna happen.

  • So now we get to the algorithmic portion, and this is where all of the magic happens.

  • So, uh, they again reiterate that is not possible to apply que learning to continues action spaces because, you know reasons, right?

  • It's pretty obvious you have an infinite number of actions.

  • That's a problem.

  • Um, and then they talk about the deterministic politics rating algorithm, which we're not gonna go too deep into.

  • Right for this For this video, we don't want to do the full thesis.

  • We don't want to do a full doctoral dissertation on the field.

  • We just want to know how to implement it and get moving.

  • So ah, this goes through and gives you on update for the Grady int of this parameter J and gives it in terms of the Grady int of Q, which is the state action value function and the Grady INTs of the policy, the deterministic policy mu.

  • Other thing to note here is that this Grady INTs these great ings are over two different parameters.

  • So the Grady int of Q is with respect to the actions such a the action A equals Myu of S t.

  • So what this tells you is that Q is actually a function not just of the state but is intimately related to that policy muse.

  • So it's not, um, it's not an action chosen.

  • According to an Arg Max, for instance, it's an action short chosen according to the output of the other network.

  • And for the update of MU.

  • It's just the grating with respect to the weights, which you would kind of expect.

  • So, um, they talk about another algorithm, and if q c A.

  • I don't know what that is, honestly, mini batch version blah blah.

  • Our contribution.

  • Here's to provide modifications a.

  • D.

  • B.

  • G Inspired by the success of D.

  • Q N, which allowed these neural network function Approximate is to learn in large state in action spaces online.

  • We call D d p G.

  • Very creative.

  • Ah, as they say again, we use a replay buffer to address the issues of correlations between samples generated on subsequent steps within an episode.

  • Finance eyes, cash size are, huh?

  • Transition sample from the environment.

  • So we know all of this.

  • So if you don't know all of it, what you need to know here is that you have state action reward and that new state transition.

  • So what this tells the agent is started in some state s, took some action, received some reward and ended up in some new state.

  • Why is that important?

  • It's important, because in, um, in anything that isn't dynamic programming, you're really trying to learn the state probability distributions.

  • You're trying to learn the probability of going from one state to another and receiving some reward along the way.

  • If you knew all those beforehand, then you could just simply solve a set a very, very large set of equations, for that matter, to arrive at the optimal solution, right?

  • If he knew all those transitions, you say OK, but I start in this state and take some action.

  • I'm gonna end up in some other state with certainty.

  • Then you'd say, Well, what's the most advantageous state?

  • What state is going to give me the largest reward?

  • And so you could kind of construct some sort of our than for traversing the set of equations to maximize the reward overtime.

  • Now, of course, you often don't know that, and that's the point of the replay.

  • Buffer is toe.

  • Learn that through experience and interacting with the environment.

  • And it says when the replay buffer was fooled, the samples were discarded.

  • Okay, that makes sense.

  • It's finite size.

  • It doesn't grow indefinitely at each time.

  • Step actor and critic are updated by sampling a mini batch uniformly from the buffers, so it operates exactly.

  • According to Q learning.

  • It does a uniform sampling Brandon sampling of the buffer and uses that update the actor and critic networks.

  • Ah, what's critical here is that combining this statement with the topic of the previous paragraph is that when we write our replay buffer class, it must sample states had random.

  • So what that means is you don't want a sample, a sequence of subsequent steps.

  • And the reason is that there are large correlations between those steps, right, as you might imagine.

  • And those correlations can cause you to get trapped in little nicks, nooks and crannies of parameter space and really cause your algorithm to go wonky.

  • So you want to sample that uniformly.

  • So that way you're sampling across many, many different episodes to get a really good idea off the I guess, the breadth of the parameter space to use kind of loose language on.

  • Then it says directly implementing que learning with neural networks pretty unstable many environments and they're they're gonna talk about using the target network, okay, but modified for actor critic using soft target updates rather than directly copying the weights.

  • So in Q learning, we directly copy the weights from the evaluation to the Target Network here, it says, we create a copy of the actor and critic networks Q Prime and Mu prime, respectively.

  • There to use for calculating the target values the way to these target networks are then updated by having them slowly tracked the learned networks.

  • Data prime goes to faded.

  • Uh, date of times towel plus one minus tal time.

  • State of prime.

  • With how much?

  • Much less than one.

  • This means that the target values are constrained to change slowly, greatly improving the stability of learning.

  • Okay, so this is our next little nugget.

  • So let's head over to the paper and make to our text editor and make note of that.

  • What we read was that the we have to not in caps.

  • We don't want to shout.

  • We have two networks.

  • Um uh, Target networks.

  • Sorry.

  • We have to actor and two critic networks.

  • A target for each updates are soft according to data equals tal times data plus one minus towel times.

  • Uh, stayed up prime, so I'm sorry.

  • That should be fated.

  • Prime.

  • So this is the update rule for the parameters of our target networks.

  • And we have to target networks one for the actor and one for the critics.

  • So we have a total of four deep neural networks.

  • And so this is why the hour them runs so slowly even on my beastly rig, It runs quite slowly, even in the lunar lander and continues lunar lander environment.

  • Ah, I've done the bipedal walker and it took about 20,000 games to get something that approximates a decent score.

  • So this is a very, very slow algorithm on that 20,000 games took, I think about a day to run so quite slow, Um, but nonetheless quite powerful.

  • It's only method we have so far of implementing deep reinforcement, learning and continuous control environment.

  • So hey, you know, beggars can't be choosers, right?

  • But we know just to recap that we're gonna use four networks to their own policy into off policy.

  • And the updates are gonna be soft with with town, much less than one.

  • If you're not familiar with mathematics, this double less than or double greater than sign means much less than are much greater than respectively.

  • So, Ah, what that means is that towel is gonna be a border 0.1 or smaller, right 0.1 isn't much smaller.

  • That's kind of smaller 0.1 I would consider much smaller they use.

  • We'll see in the in the details.

  • We'll see what value they use.

  • But you should know that it's an order 0.1 or smaller, and the reason they do this is to allow the updates to happen very slowly to get good conversions, as they said in the paper.

  • So let's head back to the paper and see what other nuggets we can clean before getting to the outline of the algorithm and then the very next sentence.

  • They say this simple change moves the relative unstable problem of learning the action value function closer to the case of supervised learning problem for which it robust solution exists.

  • We found that having both the target new prime and Q prime was required to have stable targets.

  • Why I in order to consistently train the critic with out diversions, this may slow learning, since the target networks delay the propagation of value estimates over in practice, we found this was always greatly outweighed by the stability of learning, and I found that as well.

  • You don't get a whole lot of diversions, but it does take a while to train.

  • Ah, then they talk about learning in low dimensional and higher dimensional environments.

  • Uhm and they do that to talk about the need for feature scaling so one approach to the problem.

  • While which is the ranges of variations in parameters, right, So in different environments like the mountain car, you can go from plus minus 1.6 minus 1.620 point four something like that, and the velocity is the plus minus 40.7 So you have a two ordered magnitude variation there and the parameters that's kind of large even in that environment.

  • Then, when you compare that to other environments where you can have parameters that are much larger on the order hundreds, you can see that there's a pretty big issue with the scaling of the inputs to the neural network, which we know from our experience that neural networks are highly sensitive to the scaling between inputs.

  • So says their solution that problems to manually scale the future's so they're in similar across environments and units, and they do that by using batch normalization.

  • And it says this technique normalizes each dimension across the samples in a mini batch Dev unit mean and variance, and also maintains a running average of being in variance used for normalization during testing during exploration and evaluation.

  • So in our case.

  • Uh, training and testing are a slightly different than in the case of supervised learning.

  • So supervised learning.

  • You've maintained different data sets or, uh, shoveled subsets of a single data set to do training and evaluation on.

  • Of course, in the evaluation phase, you perform no wait.

  • Updates of the network.

  • You just see how it does based on the train, in reinforcement, learning you do something similar where you have a set number of games where you train the agent to achieve some set of results, and then you turn off the learning and allow it to just choose actions based upon whatever policy it learns.

  • And if you're using batch normalization in pytorch in particular, there are significant differences in how batch normalization is used in the two different phases.

  • So you have to be explicit in, uh, setting, training or evaluation mode in particular, and pytorch they don't track statistics, an evaluation mode, which is why, when we wrote the DD PG algorithm and pytorch, we had to call the evil and train functions so often.

  • Okay, so we've already established will need batch normalization, so everything's kind of starting to come together.

  • We need a replay network backed organization.

  • We need four networks, right?

  • We need to We need to each of a target of an actor into each of a critic.

  • So half of those are gonna be used for on policy.

  • Have them are going to be used for off policy for the targets.

  • Um, and then it says we scroll down.

  • Ah, major challenge.

  • A learning and continuous action spaces is exploration.

  • An advantage of off policy algorithms such as d d.

  • P.

  • G.

  • Is that we can treat the problem of exploration independently from the learning algorithm.

  • We constructed an exploration policy view prime by adding noise sampled from a noise process and to our actor policy.

  • Okay, so right here is telling us what the basically the target actor function is.

  • It's mu prime is basically mu plus some noise and and can be chosen to suit the environment as detailed in the supplementary materials.

  • We used an Ornstein Golombek process.

  • Degenerate.

  • Temperley correlated exploration for exploration, efficiency and physical control problems with inertia.

  • If you're not familiar, physics inertia just means the tendency of stuff to stay in motion.

  • It has to do with, like, environments that move like the Walkers, But she does stuff like that.

  • The ants.

  • Okay, so we've kind of got one of the Nuggets and to our text editor.

  • Let's head back over there and write that down.

  • Okay, So, uh, the target actor is just the, um, evaluation will call it that for lack of a better word via Yushin actor.

  • Plus some noise process.

  • They used Ornstein.

  • Uh, Roland Beck.

  • I don't think I spelled that correctly.

  • Um, we'll need to look that up, Uh, that I've already looked it up.

  • Um, my background is in physics, so it made sense to me.

  • It's basically a noise process that models the motion of browning particles which are just particles that move around under the influence of their interaction with the other particles in some type of medium like losses.

  • Meeting like a perfect food or something like that.

  • Um, and in the Orange seed, Well, in that case, there are temporally correlated meaning of these time.

  • Step is related to the times that prior to it, and I hadn't thought about it before, but that's probably important for the case of Marco.

  • Decision process is right.

  • So an MD peas wth e um, current state is only related to the prior state and the action taken.

  • You don't even know the full history.

  • The environment s o.

  • I wonder if that was chosen that way.

  • If there's some underlying physical reason from that just kind of a question, of course, Meal top my head.

  • I don't know the answer to that.

  • If someone knows by drop the answer in the comments, I would be very curious to see the answer.

  • So we have enough nuggets here.

  • So just to summarize, we need to replay buffer class.

  • Uh, we'll also need a class for the noise.

  • Right.

  • So we'll need a class Franois a class for the replay buffer.

  • Ah, we'll need a class for the target que network, and we're going to use batch normalization.

  • Ah, the policy be deterministic.

  • So what that means in practice is that the policy will output the actual actions instead of the probability of selecting the actions.

  • So the policy will be limited by whatever the action space of the environment is.

  • So we need some way of taking that into account so so deterministic policy means, puts the actual action and said of a probability we'll need a way to bound the actions to the environment.

  • Environment limits.

  • And, of course, these notes don't make it into the final code.

  • These were just kind of things you think of as you are reading the paper.

  • Ah, you would want to put all your questions here.

  • Uh, I don't have questions.

  • I've already implemented it, but this is kind of my thought processes I went through the first time.

  • Um, a CZ best as I can model it after having finished the problem.

  • Um, and you can also use a sheet of paper that some kind of magic about writing stuff down on paper, But we're gonna use the code editor because I don't want to use an overhead projector to show you guys a friggin seated favorites Isn't great school here.

  • So let's head back to the paper and take a look and the actual algorithm to get some real sense of what we're gonna be implementing.

  • The results really aren't super important to us yet.

  • We'll use that later on.

  • If we wanted the debugged, the model performance, but the fact that they express it, roll it into a planning.

  • Our that makes it difficult, right?

  • So school down to the data really quick.

  • So, uh, they give another thing to note.

  • I can't talk about this earlier, but I guess now is a good time.

  • Is the stipulations on this on this performance?

  • Data says performance after training across whole environments for at most, 2.5 million steps.

  • So I said earlier, I had to train the bipedal walker for around 20,000 games.

  • That's around times.

  • Think that's around about about 2.5 1,000,000 steps or so I think, was actually 15,000 steps.

  • So maybe around three million steps.

  • Something like that.

  • We report both the average and best observed across five runs, So why would they use five runs?

  • So this was a super duper algorithm end, which none of them are.

  • This isn't a slight on their algorithm.

  • This isn't meant to be star hear anything?

  • What it tells us is that they had to use five runs because there is some element of chance involved.

  • So, you know, in one problem with deep learning is the problem of replica bility, right?

  • It's hard to replicate other people's results, if particularly if you system clocks as seeds for random number generators, right?

  • Using the system clocked to see the random number generator guarantees that if you run the simulation at even a millisecond later, right, uh, that you're gonna get different results because we'll be starting with different sets of parameters.

  • Now you will get qualitatively similar results.

  • Right?

  • You'll be able to repeat the the general idea of the experiments, but you won't get the exact same results.

  • It's kind of what it's an objection to the whole deep learning phenomenon.

  • It makes it kind of not scientific.

  • But whatever it works has another success.

  • So we won't quibble about semantics or, you know, philosophical problems.

  • But we just need to know for our purposes that even these people that invented the algorithm had to run it several times to get some idea of what was gonna happen.

  • Because the algorithm is inherently probabilistic.

  • And so they report averages and best case scenarios.

  • So that's another little tidbit.

  • And, um, they included at results for both the low dimensional cases where you received just a state vector from the environment as well as the pixel inputs.

  • We won't be doing the picture one puts for this particular video, but maybe we'll get to them later.

  • I'm trying to work on that as well.

  • S so these were the results.

  • And the interesting tidbit here is that it's probabilistic.

  • It's gonna take five runs, so Okay, fine.

  • Other that we don't really care about results for now.

  • We'll take a look later, but that's not really our concern at the moment.

  • So now we have a series of questions.

  • We have answers to all these questions.

  • We know how we're gonna handle the explore exploit dilemma.

  • We know the purpose of the target networks.

  • We know how we're gonna handle the noise.

  • We know how we're gonna handle the replay buffer.

  • Um, and we know what the policy actually is going to be.

  • It's It's the actual problem.

  • It's the actual actions the agent is going to take.

  • So we know a whole bunch of stuff.

  • So it's time to look at the algorithm and see how we fill in all the details So randomly Initialize a critic, network and actor network with weights Fada, Super cute fate, A sperm, you.

  • So, uh, this is handled by whatever library use.

  • You don't have to manually initialize weights, but we do know from the supplemental materials that they do constrain these updates to be with Sorry, these initialization sze to be within some range.

  • So put a note on the back your mind that you're gonna have to constrain these a little bit and then it says initialized.

  • Target Network Q Prime and the mute prime with weights that are equal to the original network.

  • So they had a super Q prime gets initialized with fate a super Q and fatum you prime gets initialized of fate at super you.

  • So, uh, we will be updating the weights right off the bat for the target networks with the evaluation that works and initialize the replay buffer are now this is an interesting question.

  • How do you initialize that replay buffers?

  • So I've used a couple different methods.

  • You can just initialize it with all zeros.

  • And then if you do that when you perform the learning, you want to make sure that you have a number of memories that are greater than or equal to the mini batch size of your training.

  • So that what you're not sampling the same states more than once, right?

  • If you have 64 memories in a batch that you want a sample, but you only have 10 memories in your replay buffer.

  • Then you're gonna sample, let's say, 16 memories and you're going to sample each other's memories four times, right?

  • So then that's no good.

  • So the question becomes, if you update.

  • If you neutralize the replay buffer with zeros, then you have to make sure that you don't learn until you exit the warm up period, where the warm up period is just a number of steps equal to your replay buffer.

  • You're buffer sample size, or you can initialize it with the actual environmental play.

  • Now, this takes quite a long time.

  • You know, the replay buffers are border a 1,000,000.

  • So if you love the out the rhythm taking lane steps at random, then it's gonna take a long time.

  • I was you zeroes and then, you know, just wait until the agent fills up the mini batch size of memories, just a minor detail there.

  • Then it says, for some number of episodes.

  • Do so a four loop initialize, a random process end for action exploration.

  • So this is something now reading it.

  • I actually made a little bit of a mistake.

  • So Ah, in my previous implementation, I didn't reset the noise process at the top of every episode.

  • Uh, so it's explicit here.

  • I must have missed that line.

  • And I've looked at other people's code.

  • Some do some don't, But it worked.

  • Within how many episodes was it within a, uh, under 1000 episodes.

  • The agent managed to beat the continuous wonderland environment.

  • So is that critical?

  • Maybe not.

  • Um and I think I mentioned that in the video receiver.

  • Initial state observation s one.

  • So for each step of the episode, T girls wanted capital t do select the action.

  • A safety equals Myu the policy plus and sip tea according to the current policy and exploration noise.

  • Okay, so that's straightforward.

  • Just use Just feed the state forward.

  • What does that mean?

  • It means feed the state forward through the network, received the vector output of the action and add some noise to it.

  • Okay, Execute an action and research and observe reward and new state simple store the transition.

  • You know, the old state action reward a new state in your replay buffer are okay.

  • That straightforward each time Step sample around a mini batch of n transitions from the replay buffer.

  • And then you want to use that set of transitions to said Why so by equals.

  • So I is sorry having difficulties here.

  • So I is each step of that eyes, each element of that mini batch of transitions.

  • So you wanna basically loop over that set or do a vector rise?

  • Implementation Looping is more straightforward.

  • That's what I do.

  • I always opt for the most straightforward and not necessarily most efficient way of doing things the first time through because you want to get it working first and worry about implementation.

  • Uh, sorry efficiency.

  • Later, so said Why so by equals are so by plus gamma Gammas your discount factor Times Q Prime of the new State As said by plus one times, uh, where the action is chosen according to Mu Prime.

  • Given some weight state a super mu prime and they'd a super cute prime.

  • So, uh, what's important here is that, and this isn't immediately clear.

  • If you're reading this for the first time, what's this is a very important detail.

  • So it's the action.

  • Ah must be chosen according to the Target Actor network.

  • So you actually have Q as a function of the state as well as the output?

  • Excuse me of another network.

  • That's very important.

  • Update the critic by minimizing the loss.

  • Basically a weighted average of that.

  • Why so by minus the output from the actual que network where the ace ABIs are from the actually the actions you actually took during the course of the episode.

  • So this Ace Abi is from the replay buffer and the's actions right are chosen according to the Target Actor network.

  • So for each learning step, you're gonna have to do a feed forward pass of not just this Target que Network, but also the Target Actor network, as well as theme evaluation Critic Network.

  • I hope I said that, right.

  • So the feed forward pass of the Target Critic Network, as well as the target after Network and the Evaluation Critic Network as well.

  • And then it says, update the actor policy using the sample policy.

  • Grady, this is the hardest step in the whole thing.

  • This is the most confusing part.

  • This is the Grady in't is equal to one over end times of some.

  • So a mean, basically, whenever you see one of Rand times a song, that's a mean the grading to respect actions of Q ah, where the actions were chosen according to the policy mu of the current state's S times, a Grady int with respect to the weights of Mu, where you just put the set of states Okay, so that'll be a little bit tricky to implement.

  • So and this is part of the reason I chose tensorflow for this particular video is because TENSORFLOW allows us to calculate Grady's explicitly, Ah and Pytorch.

  • You may have noticed that all I did was said cue to be a function of the, um, the current state as well as the actor network.

  • And so I allowed pytorch to handle the chain rule.

  • This is effectively a changeable, So let's let's scroll up a little bit, too.

  • Look at that because this kind of gave me pause the 1st 10 times that read it.

  • So this is the hardest part to implement.

  • If you screw up, you see that this exact same expression appears here, right?

  • And this is in reference to this.

  • So it's a Grady int with respect to the weights.

  • They had a super mu of Q of s and a ah such that you're choosing in action A according to the policy mute.

  • So really, with this is is the chain rule.

  • So it's the, uh this grating is proportional to ingredient of this quantity times ingredient of the other quantity.

  • It's just a general from calculus.

  • So, um, in the in the pytorch paper, we implemented this version and these are these are equivalent.

  • It's perfectly valid to do one of the other.

  • So in pytorch, we did this version today.

  • We're gonna do this particular version, so that's good to know.

  • All right, so next step on his time step.

  • You want update the target networks according to this soft update rule, so they don't super cute.

  • Prime gets updated as town Time State A super Q plus one minus town.

  • They did super cute

what is up, everybody.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it