Placeholder Image

Subtitles section Play video

  • what is up, everybody.

  • Today you're gonna learn how to go from a paper to a fully functional implementation of deep deterministic policy.

  • Grady INTs If you're not familiar with deep deterministic policy greetings or D d.

  • P.

  • D.

  • For short, it is a type of deeper reinforcement learning that is used in environments with continuous action spaces.

  • You see, most environments have discreet action spaces.

  • This is the case with, say, the Atari Library say ah, like break out or space invaders, where the agent could move left, right, It could shoot, but it can move left, right and shoot by fixed, discreet intervals.

  • Fixed amounts, right In other environments like, say, robotics, the robots could move a continuous amount so it can move in anywhere from 0 to 1 minus one plus one.

  • Anything along a continuous number interval.

  • And this poses a problem for most deep reinforcement learning methods like sake, you learning, which work spectacularly well in discreet environments but cannot tackle continuous action spaces.

  • Now, if you don't know what any of this means, don't worry.

  • I'm gonna give you the rundown here in a second.

  • But for this set of tutorials, you're gonna need to have installed the opening I Jim, you'll need Python 3.6, and you also need tensorflow and pytorch.

  • Other packages you'll need include Matt Pot Live to handle the plotting of the learning curve, which will allow us to see the actual learning of the agent as well as numb pie to handle your typical vector operations.

  • Now, uh, here, I'll give you a quick little rundown of reinforcement learning.

  • So the basic idea is that we have an agent that interact with some environment and receives a reward.

  • The rewards kind of take the place of labels and supervised learning, and that they tell the agent what is good.

  • What is it that it is shooting for in the environment.

  • And so the agent will attempt to maximize the total rewards over time by solving something known as the bellman equation.

  • We don't have to worry about the actual mathematics of it, But just so you know, for your future research the algorithm for typically concerned with solving the bellman equation, which tells the agent the expected future returns, assuming to follow something called its policy.

  • So the policy is the probability that the agent will take a set of actions given it's in some state s, it's basically probability distribution.

  • Now many types of algorithms such askew learning will attempt to solve the bellman equation by finding what's called the value function.

  • The value function or the action value.

  • Function in this case maps the current state and set a possible actions to the expected feature returns the agent expects to receive.

  • So in other words, Agent says, Hey, I'm in some state meaning some configuration of pixels on the screen in the case off theatre origin Atari Library, for instance, and says, Okay, if I take one or another action, What is the expected future return?

  • Assuming that I follow my policy actor critic methods are slightly different and that the attempt to learn the policy directly and recall the policy is a probably a distribution that tells the agent what the probability selecting an action is given its in some state s.

  • So these two algorithms have a number of strength between them.

  • Ah, and deep deterministic policy.

  • Grady is a way to marry the strength of these two algorithms into something that does really well for discreet action.

  • Sorry, continuous action spaces.

  • You don't need to know too much more than that.

  • Everything else you need to know.

  • I'll explain in their respective videos.

  • So in the first video, you're going to get to see how ah, I go ahead and read papers and then implement them on the fly.

  • Um, and in the second video, you're going to see the implementation of the deterministic policy radiance in pytorch.

  • In a separate environment, both these environments are in both.

  • These environments are continuous, and so they will demonstrate the power of the algorithm quite nicely.

  • You don't need ah particularly powerful GPU, but you do need some kind of deep you to run these as it does take a considerably long time, even on a GPU.

  • So you will need at least a like, say, a Maxwell Class GP or above.

  • So something from the 700 Siris on a video side.

  • Unfortunately, neither of these ah frameworks really work well with a M D cards.

  • So if you have those, you'd have to figure out some sort of Cluj to get the open sea el implementation to trance compile Tokuda.

  • That's just a technical detail.

  • Uh, I don't have any information on that.

  • So you're on your own.

  • Sorry.

  • So this is a few hours of content.

  • Grab a snack drink and watch us at your leisure.

  • It's best to watch it in order.

  • I actually did the videos in a separate order.

  • Reverse order on my channel just so I could get it out.

  • So I did the implementation and pytorch first and then the video on implementing the paper intensive close second.

  • But it really is best for a new audience to go from the paper paper video to the pytorch video.

  • So I hope you like it.

  • Leave any comments, questions, suggestions, issues down below.

  • I'll try to address as many as possible.

  • You can check out the code for this on my get hub.

  • And you can find many more videos like this.

  • All my YouTube channel machine learning with Phil.

  • I hope you all enjoy it.

  • Let's get to it.

  • What is up?

  • Everybody In today's video, we're gonna go from the paper on deep, deterministic policy ingredients all the way into a functional implementation and tensorflow.

  • So you're going to see how to go from a paper to a real world implementation.

  • All in one video grab a snack.

  • A drink is going to take awhile.

  • Let's get started.

  • So the first step in my process really isn't anything special.

  • I just read the entirety of the paper.

  • Of course, starting with the abstract, the abstract tells you what the paper is about at a high level, it's just kind of an executive.

  • Summary introduction is where the authors will pay homage to other work in the field.

  • Kind of set the stage for what is going to be presented in the paper as well is the need for it?

  • Ah, the background kind of expand on that and you can see here.

  • It gives us a little bit of mathematical equations and you will get a lot of useful information here.

  • This won't talk too much about useful nuggets on implementation, but it does set the stage for the mathematics community implementing which is of course critical for any deep burning or, in this case, deep reinforcement learning paper implementation.

  • The algorithm is really where all the meat of the problem is.

  • It isn't here and that they lay out the exact steps you need to take two.

  • Implement the algorithm right?

  • That's why it's title that way.

  • So this is the section you want to read most carefully.

  • And then, of course, they will typically give a table where they outline the actual algorithm.

  • And oftentimes, if I'm in a hurry, I will just jump to this because I've done this enough times that I can read this.

  • What is called pseudo code if you're not familiar with that pseudo code is just an English representation of computer code s Oh, we will typically use that we outlined a problem on.

  • It's often used in papers, of course.

  • So typically, I'll start here reading it and then work backward by reading through the paper to see what I missed.

  • But of course it talks about the performance across a whole host of environments.

  • And of course, all of these have in common that they are continuous control.

  • So, uh, what that means is that the action space is a vector whose elements can very on a continuous reel number line, instead of having discreet actions of 012345 s o.

  • That is the really motivation behind deep, deterministic policy grievances that allows us to use deep reinforcement learning to tackle these types of problems, And in today's video, we're gonna go ahead and tackle the I guess pendulum swing up, also called the pendulum problem.

  • Reason being is that while it would be awesome to start out with something like the bipedal walker, you never want to start out with maximum complexity.

  • Always want to start out with something very, very small on that scale, your way up, and the reason is that you're gonna make mistakes.

  • And it's most easy to debug most quick to debug very simple environments that execute very quickly.

  • So the pendulum problem only has, I think, three elements in its state vector and only a single action.

  • So or maybe it's two actions I forget.

  • But either way, it's very small problem relative to something like the bipedal walker or many of the other environment.

  • You could also use the continues version of the card pole or something like that.

  • That would be perfectly fine.

  • I've just chosen the pendulum for this pretty quickly because we haven't done it before, so it's in here that they give a bunch of plots of all of the ah performance of their algorithm of various sets of constraints placed upon it and different implementations so you can get an idea.

  • And one thing you notice right away.

  • Um, it's always important to look at plots because they give you a lot of information visually, right?

  • It's much easier to gather information from plots, and it is text.

  • You see that right away they have a scale of one.

  • So that's telling you it's relative performance and you have to read the papers and relative to what?

  • Um, I don't like that particular approach.

  • Ah, they have similar data in a table form.

  • And here you see, Ah, whole bunch of environments they used and is a broad, broad variety.

  • They wanted to show that the algorithm has a wide arena of public ability, which is, ah, typical technique and papers.

  • They want to show that this is relevant, right?

  • If they only showed a single environment, people reading it would say, Well, that's all well and good.

  • You can solve one environment.

  • What about these a dozen other environments, right?

  • And part of the motivation behind reinforcement learning his generality can can we model real learning and biological systems such that it mimics the generality of biological learning.

  • One thing you notice right away is that these numbers are not actual scores.

  • So that's one thing I kind of take note of and caused me to raise an eyebrow.

  • So you have one of the motivation behind that.

  • Why would the author's express scores in the ratios?

  • A couple different reasons.

  • One is because they want to just to make all the numbers look uniform.

  • Maybe the people reading the paper wouldn't be familiar with each of these environments, so they don't know what a good score is.

  • And that's a perfectly valid reason.

  • Another possibility is they want to hide poor performance.

  • I don't think that's going on here, but it does make me raise my eyebrow whenever I see it.

  • The one exception is the torques, which is a totally open rate race car simulator environment.

  • I don't know if we'll get to that on this channel.

  • That would be a pretty cool project, but that would take me a few weeks to get through, Um, but right away you notice that they have a whole bunch of environments.

  • The scores are all relative to one, and one is the score that the agent gets on a planning algorithm, which they also detail later on.

  • So those were the results, Um, and they talk more about I don't think we saw the headline, but they talk about related work which talks about other algorithms that are similar and their shortcomings, right.

  • They don't ever want to talk up other algorithms.

  • You always wanna talk up your own.

  • I wrote them to make yourself sound good.

  • Um, you know, whilst you be writing a paper in the first place and of course, it concluding that tie everything together references I don't usually go deep in the references.

  • Um, if there is something that I feel I really, really need to know, I may look at a reference, but I don't typically bother with him.

  • If you were a PhD student, then it would behoove you to go into the references because you must be an absolute expert on the topic.

  • And for us, we're just, you know, hobbyists on the youtuber.

  • It's so I don't go into too much depth with the background information.

  • And the next most important bit of the paper are the experimental details.

  • And it is in here that it gives us the parameters and architectures for the networks.

  • So this is where if you saw my previous video where I did the implementation of DD PG and PYTORCH and the continues lunar lander environment, this is where I got most of this stuff.

  • It was almost identical.

  • With a little bit of tweaking.

  • I left out some stuff from this paper, but, uh, pretty much all of it came from here.

  • In particular the hidden layer sizes 403 100 units, A cz well, as the initialization of the parameters from uniform distribution of the given ranges s So just to recap, this was a really quick overview of the paper just showing my process of what I look at.

  • Uh, the most important parts are the details of the algorithm as well as the experimental details.

  • So, as you read the paper, um, like I said, I gloss over the introduction because I don't really I kind of already understand the motivation behind it.

  • I get the idea.

  • It says, basically tells us that you can't really handle just continuous action spaces with deep you networks.

  • We already know that, and it says, you know, you can describe ties the state space.

  • But then you end up with really, really huge actions.

  • Sorry.

  • You can describe ties, the action space, but then you end up with a whole boatload actions.

  • You know, What is it?

  • 2187 actions.

  • So it's intractable anyway.

  • And they say what we present, you know, a model free off policy algorithm.