Placeholder Image

Subtitles section Play video

  • your own network's air good for learning lots of different types of patterns To give an example of how this would work.

  • I imagine you had a four pixel camera, so not not for mega pixels, but just four pixels.

  • And it was only black and white, and you wanted to go around and take pictures of things and determined automatically, then whether these pictures were of solid, all white or all dark image, vertical line or a diagonal line or a horizontal line.

  • This is tricky because you can't do this with simple rules about the brightness of the pixels.

  • Both of these are horizontal lines.

  • But if you try to make a rule about which picnic pixel was bright and which was dark, you wouldn't be able to do it.

  • So to do this with the neural network, you start by taking all of your inputs.

  • In this case are four pixels, and you break them out into input neurons, and you assign a number to each of these, depending on the brightness or darkness of the pixel.

  • Plus one is all the way.

  • White minus one is all the way black, and then gray is zero right in the middle.

  • So these values, once you haven't broken out and listed like this on the input neurons.

  • It's also called the input vector, or array.

  • It's just a list of numbers that represents your inputs.

  • Right now it's a useful notion to think about the receptive field of a neuron.

  • All this means is what set of inputs makes the value of this neuron as high as it can possibly be.

  • For input neurons, this is pretty easy.

  • Each one is associated with just one pixel.

  • And when that pixel is all the way white, the value of that input neuron is this high.

  • As it could go, the black and white checkered areas show pixels that an input neuron doesn't care about.

  • If they're all the way white or all the way black, it still doesn't affect the value of that in putting her on it all.

  • Now, to build a neural network, we create a neuron.

  • The first thing this does is it adds up all of the values of the input neurons.

  • So in this case, what if we add up all of those values?

  • We get a 0.5 now to complicate things just a little bit.

  • Each of the connections are weighted, meaning they're multiplied by a number that never could be one or minus one or anything in between.

  • So, for instance, if something has a weight of minus one, it's multiplied and you get the negative of it, and that's added it.

  • If something has a weight of zero, then it's effectively ignored.

  • So here's what those weighted connections might look like.

  • You noticed that after the values of the input neurons, air waited and added the values that the final value is completely different.

  • Graphically, it's convenient to represent these weights as white links being positive waits, black links being negative weights and the thickness of line is roughly proportional to the magnitude of the weight.

  • Then, after you add the weighted input neurons, they get squashed, and I'll show you what that means.

  • You have a sigmoid squashing function.

  • Sigmoid just means s shaped.

  • And what this does is you put a value in Let's say, 0.5, and you run a vertical line up to your sigmoid and then a horizontal horizontal line over from where it crosses.

  • And then where that hits the Y axis.

  • That's the output of your function.

  • So in this case slightly lasted 0.5.

  • It's pretty close as your input number gets larger.

  • Your output number also gets larger, but more slowly and eventually, no matter how big the number you put in, the answer is always, uh, less than one.

  • Similarly, when you go negative, the answer is always greater than negative one.

  • So this insurers that that neurons value never gets outside of the range of +12 minus one, which is helpful for keeping that computations in the neural network bounded and stable.

  • So after you, some the weighted values of neurons and squash.

  • The result.

  • You get the output in this case 0.746 That is a neuron.

  • So we can call this we can collapse all that down.

  • And this is ah, neuron that doesn't weighted sum and squash the result.

  • And now, instead of just one of those, I assume you have a whole bunch.

  • There are four shown here, but there could be 400 or four million now to keep our picture clear, We'll assume for now that the weights are either plus one white lines minus one black lines or zero, in which case they're missing entirely.

  • But in actuality, all of these neurons that we created our each attached toe all of the input neurons and they all have some weight between minus one and plus what?

  • When we create this first layer of our neural network, the receptive fields get more complex.

  • For instance, here each of those end up combining two of our input neurons.

  • And so the value, the receptive field, the pixel values that make that first layer neuron as large as it could possibly be.

  • Look now, like pairs of pixels, either all white or a mixture of white black, depending on the weights.

  • So, for instance, this neuron here is attached to this input pixel, which is upper left, and this input pixel just lower left.

  • And both of those weights are positive, so it combines the two of those.

  • And that's its receptive field.

  • The receptive field of this one plus receptive field, this one.

  • However, if we look at this neuron, it combines are this pixel upper right and this pixel lower right.

  • It has a weight of minus one for the lower right pixel, so that means it's most active when this pixel is black.

  • So here is It's receptive field now, uh, the because we were careful of how we created that first layer.

  • Its values look a lot like input values, and we can turn right around and create another layer on top of it, the exact same way with the output of one layer being the input to the next layer.

  • And we can repeat this three times or seven times or 700 times for additional layers each time.

  • The receptive fields get even more complex, so you can see here using the same logic.

  • Now they cover all of the pixels and Maur more special arrangement of which are black and which are white.

  • We can create another layer er again.

  • All of these neurons in one layer are connected toe all of the neurons in the previous layer.

  • But we're assuming here that most of those weights or zero and not shown it's not generally the case, Um, so just to mix things up, we'll create a new layer.

  • But if you notice are squashing function isn't there anymore?

  • We have something new called a rectified linear unit.

  • This is another popular neuron type so you do your weighted sum of all your inputs.

  • And instead of squashing, you do rectified linear units.

  • You rectify it.

  • So if it is negative, you make the value.

  • Zero.

  • If it's positive, you keep the value.

  • This is obviously very easy to compute, and it turns out to have very nice stability properties for neural networks as well in practice.

  • So after we do this, because some of our weights are positive and some are negative connecting to those rectified linear units, we get receptive fields and they're opposites.

  • You look at the patterns there, and then finally, when we've created as many layers with as many neurons as we want, we create an output layer.

  • Here we have four outputs they were interested in Is the image solid vertical, diagonal or horizontal?

  • So to walk through an example here of how this would work, let's say we start with this input image showed on the left dark pixels on top, white on the bottom as we propagate that to our input layer.

  • This is what those values would look like.

  • The top pixels, the bottom pixels.

  • As we move that to our first layer, we can see the combination of a dark pixel in a light pixel.

  • Some together get a zero gray.

  • Um, where is down here?

  • We have the combination of a dark pixel, plus a light pixel with a negative wait.

  • So that gets as a value of negative one there.

  • Which makes sense, because if we look at the receptive field here upper left pixel white, lower left, pixel black, it's the exact opposite of the input that we're getting.

  • And so we would expect its value to be as low as possible minus one.

  • As we move to the next layer, we see the same types of things combining zeros to get zeros, um, combining a negative and in negative with the negative weight, which makes a positive to get a zero.

  • And here we have combining two negatives to get a negative.

  • So again you'll notice the receptive field of this is exactly the inverse of our input.

  • So it makes sense that its weight would be negative.

  • His value would be negative, and we moved to the next layer.

  • All of these, of course, the zeros propagate forward.

  • Um, here, this is a negative, has a negative value and it gets has a positive wait, so it just moves straight forward.

  • Because we have a rectified linear unit.

  • Negative values become zero, so now it is zero again, too.

  • But this one gets rectified and becomes positive.

  • Negative times negative is positive.

  • And so when we finally get to the output, we can see they're all zero, except for this horizontal, which is positive.

  • And that's the answer, Our neural network said.

  • This is an image of a horizontal line.

  • Now neural networks usually aren't that good, not that clean.

  • So there's a notion off with an input.

  • What is truth in this case?

  • The truth is, this has a zero for all of these values, but a 14 horizontal.

  • It's not solid.

  • It's not vertical.

  • It's not diagonal.

  • Yes, it is horizontal.

  • An arbitrary neural network will give answers that are not exactly truth.

  • It might be off by a little or a lot.

  • And then the error is the magnitude of the difference between the truth and the answer given and you can add all these up to get the total error for the neural network.

  • So the idea the whole idea with learning and trading is to adjust the weights to make the error as low as possible.

  • So the way this is done is we put an image in.

  • We calculate the error at the end, then we look for how to adjust those weights higher or lower to either make that error go up or down.

  • And we, of course, it just the weights in the way that make the error go down.

  • Now the problem with doing this is each time we go back and calculate the error, we have to multiply all of those weights by all of the neuron values at each layer.

  • And we have to do that again and again once for each wait.

  • This takes forever in computing terms on computing scale.

  • And so it's not a practical way to train a big neural network you can imagine.

  • Instead of just rolling down to the bottom of a simple valley.

  • We have a very high dimensional valley, and we have to find our way down.

  • And because there are so many dimensions one for each of these weights that the computation just becomes prohibitively expensive.

  • Luckily, there was an insight that lets us do this in a very reasonable time, and that's that.

  • If we're careful about how we design our neural network, we can calculate the slope directly.

  • The Grady in't We can figure out the direction that we need to adjust the weight without going all the way back through our neural network and re calculating.

  • So just review.

  • The slope that we're talking about is when we make a change in weight, the error will change a little bit.

  • And that relation of the change in weight to the change in error is this slope.

  • Mathematically, there are several ways to write.

  • This will favor the one on the bottom.

  • It's technically most correct.

  • We'll call it D d w for shorthand every time you see it.

  • Just think that change in error when I change, await or the change in the thing on the top when I change the thing on the bottom.

  • Um, this is, uh, does get into a little bit of calculus.

  • We do take derivatives.

  • It's how we calculate slope.

  • If it's new to you, I strongly recommend a good semester of calculus just because the concepts air so universal and, ah, a lot of them have very nice physical interpretations, which I find very appealing.

  • But don't worry.

  • Otherwise, just gloss over this and pay attention to the rest, and you'll get a general sense for how this works.

  • So in this case, if we change the weight by plus one, the error changes by minus two, which gives us a slope of minus two.

  • That tells us the direction that we should adjust our weight and how much we should adjust it to bring the error down.

  • Now, to do this, you have to know what your error function is.

  • So assume we had error function.

  • That was the square of the weight.

  • And you can see that our weight is right at minus one.

  • So the first thing we do is we take the derivative change in error, divided by changing Wait d E d w, the derivative of weight squared is two times the weight.

  • And so we plug in our weight of minus one and we get a slope D d w of minus two.

  • Now, the other trick that lets us do this with deep neural networks is chaining.

  • And to show you how this works, imagine a very simple, trivial neural network with just one hidden layer, one input layer, one output layer and one wait connecting each of them.

  • So it's obvious to see that the value why is just the value x times the weight connecting them?

  • W one.

  • So if we change w one a little bit, we just take the derivative of why, with respect to W one and we get X, the slope is X.

  • If I change w won by a little bit, then why will change by X times the size of that adjustment?

  • Similarly, for the next step, we can see the e is just the value.

  • Why times the weight?

  • W two.

  • And so when we calculate d e D why it's just w two.

  • Because this network is so simple, we can calculate from one end to the other.

  • X times w one times W two is the error e And so if we want to calculate, how much will the error change?

  • If I change w one, we just take the derivative of that with respect to w one and get x times W two.

  • So this illustrates you can see here now that what we just calculate, it is actually the product of our first derivative that we took the D.

  • Y.

  • D W.

  • One times the derivative for the next step D D.

  • Y multiplied together.

  • This is changing.

  • You can calculate the slope of each tiny step and then multiply all of those together to get the slope of the full chain derivative of the full chain.

  • So in a deeper neural network, what this would look like is if I want to know how much the error will change.

  • If I adjust, await, that's deep in the network.

  • I just calculate the derivative of each tiny little step all the way back to the weight that I'm trying to calculate and then multiply them all together.

  • This computational E is many, many times cheaper than what we had to do before of re calculating the error for the whole neural network.

  • For every wait.

  • Now, in the neural network that we've created, there are several types of back propagation.

  • We have to do their several operations we have to do for each one of those.

  • We have to be able to calculate the slope.

  • So if the 1st 1 is just a weighted connection between two neurons A and B.

  • So let's assume we know the change in error with respect to be we want to know the change in error with respect to A To get there, we need to know d b d a.

  • So to get that, we just write the relationship between B and es.

  • Take the derivative A B with respect to a you get the wait w And now we know how to make that step.

  • We know how to do that little nugget of back propagation.

  • Another element that we've seen is sums.

  • All of our neurons sum up a lot of inputs to take this brat back propagation step.

  • We do the same thing.

  • We write our expression and then we take the derivative of our end point Z with respect to our step.

  • There were propagating to a and D c.

  • D.

  • A.

  • In this case is just one which makes sense.

  • If we have a sum of a whole bunch of elements, we increase one of those elements by one.

  • We expect the sum to increase by one.

  • That's the definition of a slope of one 121 relation there Another element that we have that we need to be able to back propagate is the sigmoid function.

  • So this one's a little bit more interesting.

  • Mathematically, we'll just write it shorthand like this.

  • The sigma function.

  • Um, it is entirely feasible.

  • Thio go through and take the derivative of this analytically and calculated It just so happens that this function has a nice property that to get its derivative, you just multiply it by one minus itself.

  • So this is very straightforward to calculate.

  • Another element that we've used is the rectified linear unit again to figure out how to back propagate this we just write out the relation B is equal to a phase positive.

  • Otherwise it zero and peace wise.

  • For each of those we take the derivative So D p.

  • D.

  • A.

  • Is either one phase positive or zero.

  • And so with all of these little back propagation steps and the ability to change them together, we can calculate the effect of adjusting any given wait on the error for any given input and so Thio train.

  • Then we start with a fully connected network.

  • We don't know what any of these weights should be, Um and so we assigned them all random values.

  • We create a completely arbitrary, random neural network.

  • We put in an input that we know the answer to.

  • We know whether it's solid vertical, diagonal or horizontal.

  • So we know what truth should be and so we can calculate the error.

  • Then we run it through, calculate the error and using back propagation, go through and adjust all of those weights a tiny bit in the right direction.

  • And then we do that again with another input and again with another input.

  • For if we could get away with it, many thousands or even millions of times, and eventually all of those weights will gravitate.

  • They'll roll down that many dimensional valley toe a nice low spot in the bottom, where it performs really well and does pretty close to truth on most of the images.

  • If we're really lucky, it'll look like what we started with with intuitively understandable, receptive fields for those neurons and a relatively sparse representation, meaning that most of the weights are small or close to zero.

  • It doesn't always turn out that way, but what we are guaranteed is it.

  • It'll find a pretty good representation of the best that it can do adjusting those weights to get as close as possible to the right answer for all of the inputs.

  • So what we've covered is just a very basic introduction to the principles behind neural networks.

  • I haven't told you quite enough to be able to go out and build one of your own, but if you're feeling motivated to do so, I highly encourage it.

  • Here are a few resource is that you'll find useful.

  • You want to go and learn about bias neurons.

  • Dropout is useful training tool.

  • There are several.

  • Resource is available from Andre Carpathia, who is an expert in neural networks and great at teaching about it.

  • Also, there's a fantastic article called The Black Magic of Deep Learning that just has a bunch of practical from the trenches.

  • Tips on how to get them working well, neural networks air famously difficult to interpret.

  • It's hard to know what they're actually learning when we train them, so let's take a closer look and see whether we can get a good picture of what's going on inside.

  • Just like every other supervised machine learning model, neural networks learn relationships between input variables and output variables in fact, we can even see how it's related to the most iconic model of all.

  • Linear regression.

  • Simple linear regression assumes a straight line relationship between an input variable X and an output variable.

  • Why exes multiplied by a constant M, which also happens to be the slope of the line and its added to another constant be, which happens to be where the line crosses the Y axis.

  • We can represent this in a picture.

  • Our input value X is multiplied by M.

  • Our constant B is multiplied by one, and then they get added together to get why this is a graphical representation of y equals M X plus B on the far left.

  • The circular symbols just indicate that the value is passed through the rectangles.

  • Legal, labeled M and be indicate that whatever goes in on the left comes out multiplied by em or be on the right, and the box with the Capital Sigma indicates that whatever goes in on the left gets added together and spit out on the right.

  • We can change the names of all the symbols for a different representation.

  • This is still a straight line relationship.

  • We've just changed the names of all the variables.

  • The reason we're doing this is to translate our linear regression into the notation we'll use in neural networks.

  • This will help us keep track of things as we move forward.

  • At this point, we have turned a straight line equation into a network.

  • A network is anything that has nodes connected by edges.

  • In this case, ex Subzero and X Sub one are our input nodes.

  • Visa zero is an output node, and our weights connecting them are edges.

  • This is not the traditional sense of a graph, meaning a plot or a grid like in a graphing calculator or graph paper.

  • It's just a formal word for a network for nodes connected by edges.

  • Another piece of terminology might.

  • Here is a directed a cyclic graph abbreviated as D.

  • E.

  • A.

  • G or DAG.

  • A directed graph is one where the edge is just go in one direction.

  • In our case, input goes to output, but output never goes back to input.

  • Our edges are directed.

  • A cyclic means that you can't ever draw a loop.

  • Once you have visited a node, there's no way to jump from edges to notes to edges to notes to get back to where you started.

  • Everything flows in one direction Through the graph, we can get a sense of the type of models that this network is capable of learning by choosing random values for the weights w sub 00 and W sub 10 and then seeing what relationship pops out between except one and Visa zero.

  • Remember that we set except zero equal toe one and are holding it there always.

  • This is a special node called a bias node.

  • It should come as no surprise that the relationships that come out of this linear model are all straight lines.

  • After all, we've taken our equation for the line and rearranged it, but we haven't changed it in any substantial way.

  • There's no reason we have to limit ourselves to just one input variable.

  • We can add an additional one.

  • Now, Here we have an exit zero an ex of one and an exit, too.

  • We draw on edge between except two, and our summation with the wait W.

  • Said to zero except two times W sub to zero is again you sub to zero, and all of our use get added together to make a V subzero, and we could add more inputs as many as we want.

  • This is still a linear equation, but instead of being two dimensional, we could make a three dimensional or higher.

  • Writing this out mathematically could get very tedious, so we'll use a shortcut.

  • Well, substitute the sub script I for the index of the input.

  • It's the number of the input we're talking about.

  • This allows us to write you sub i zero.

  • Where are you sub?

  • I equals X sub I times w sub i zero.

  • And again our output v subzero is just the summation over all values of I of u sub i zero.

  • For this three dimensional case, we can again look at the models that emerge when we randomly choose r w sub I zeros are weights.

  • As we would expect, we still get the three dimensional equivalent of a line a plane in this case.

  • And if we were to extend this tomb or inputs, we would get the M dimensional equivalent of a line which is called an M dimensional hyper plane.

  • So far, so good.

  • Now we could start to get fans here our input except one looks a lot like our output.

  • Be subzero.

  • In fact, there's nothing to prevent us from taking our output and then using it as an input toe another network just like this one.

  • Now we have two separate identical layers.

  • We can add a sub script Roman numeral I and a sub script Roman Numeral II or two to our equations, depending on which later were referring to.

  • And we just have to remember that our ex of one in later, too, is the same as our Visa zero in Layer one.

  • Because these equations are identical and each are layer, each of our layers work just the same.

  • We can reduce this toe one set of equations, adding a sub script Capital L to represent which layer we're talking about.

  • As we continue here, we'll be assuming that all the layers are identical and to keep the equations cleaner, we'll leave out the capital l But just keep in mind that if we were going to be completely correct and verbose, we would add the l subscript onto the end of everything to specify the layer it belongs to.

  • Now that we have two layers, there's no reason that we can't connect them in more than one place instead of our first layer, generating just one output we could make.

  • Several outputs in our diagram will add a second output visa of one, and we'll connect this to 1/3 input into our second layer except two.

  • Keep in mind that the ex of zero input to every layer will always be equal toe one.

  • That bias node shows up again and every layer.

  • Now there are two notes shared by both layers.

  • We can modify our equations accordingly to specify which of the shared nodes were talking about.

  • They behave exactly the same so we can be efficient and reuse our equation.

  • But we can specify subscript J to indicate which output we're talking about.

  • So now if I'm connecting the AIF input to the J ith output, then I and J will determine which wait is applied and which use get added together to create the output v sub J.

  • And we can do this as many times as we want.

  • We can add as many of these shared notes as we care, too.

  • The model as a whole on Lee knows about the input.

  • X of one into the first layer and the output visa zero off the last layer from the point of view of someone sitting outside the model the shared nodes between layer one and later to our hidden there inside the black box.

  • Because of this, they're called hidden nodes.

  • We can take this to layer linear network, create ah 100 hidden nodes, said all of the weights randomly and see what model it produces.

  • Even after adding all of this structure, the resulting models are still straight lines.

  • In fact, it doesn't matter how many layers you have or how many hidden Nosy Slayer has.

  • Any combination of these linear elements with weights and sums will always produce a straight line result.

  • This is actually one of the traits of linear computation that makes it so easy to work with.

  • But unfortunately for us, it also makes really boring models.

  • Sometimes a straight line is good enough, but that's not why we go to neural networks.

  • We're going to want something a little more sophisticated in order to get more flexible models.

  • We're going to need to add some non linearity will modify our linear equation here after we calculate our output.

  • Visa zero.

  • We subject it to another function f which is not linear, and we'll call the result y subzero.

  • One really common nonlinear function to add here is the logistic function.

  • It's shaped like an s, so sometimes it's called a sigmoid function, too.

  • Although that could be confusing because technically any function shaped like an s is a sigmoid.

  • We can get a sense of what logistic functions look like by choosing random waits.

  • For this one, input one output one layer network and meeting the family.

  • One notable characteristic of logistic functions is that they live between zero and one.

  • For this reason, they're also called squashing functions.

  • You can imagine taking a straight line and then squashing the edges and bending and hammering it down so that the whole thing fits between zero and one.

  • No matter how far out you go, working with logistic functions brings us to another connection with machine learning models.

  • Logistic regression.

  • This is a bit confusing because regression refers to finding a relationship between an input and output, usually in the form of a line or a curve or surface of some type.

  • Logistic regression is actually used as a classifier.

  • Most of the time, it finds a relationship between a continuous input variable and a categorical output variable.

  • It treats observations of one category as zeros, treats observations of the other categorias ones, and then finds the logistic function that best fits all those observations.

  • Then, to interpret the model, we had a threshold, often around 0.5, and wherever the curve crosses the threshold, there's a demarcation line.

  • Everything to the left of that line is predicted to fall into one category, and everything to the right of that line is predicted to fall into the other.

  • This is how a regression algorithm gets modified to become a classification algorithm.

  • As with linear functions, there's no reason not to ADM or inputs.

  • We know that logistic regression can work with many input variables, and we can represent that in our graph as well.

  • Here we just had one in order to keep the plot three dimensional, but we could add as many as we want to see what type of functions this network can create.

  • We can choose a bunch of random values for the weights, as you might have expected.

  • The functions we create are still s shaped, but now they're three dimensional.

  • They look like a tablecloth laid across two tables of unequal height.

  • More importantly, if you look at the contour lines projected down onto the floor of the plot, you could see that they're all perfectly straight.

  • The result of this is that any threshold we choose for doing classifications, we'll split our input space up into two halves, with the divider being a straight line.

  • This is why logistic regression is described as a linear classifier.

  • Whatever the number of inputs you have, whatever dimensional space you're working in, logistic regression will always split it into two halves using a line or a plane or a hyper plane of the appropriate dimensions.

  • Another popular nonlinear function is the hyperbolic tangent.

  • It's closely related to the logistic function and could be written in a very symmetric way.

  • We can see when we choose some random weights and look at examples that hyperbolic tangent curves look just like logistic curves, except that they very between minus one and plus one, just like we tried to do before.

  • With linear functions, we can use the output of one layer as the input to another layer weaken, stack them in this way and can even add hidden notes the same way we did before.

  • Here we just show to hidden nodes in order to keep the diagram simple.

  • But you can imagine as many as you want there.

  • When we choose random waits for this network and look at the output, we find that things get interesting.

  • We've left the realm of the linear because the hyperbolic tangent function is nonlinear.

  • When we add them together, we get something that doesn't necessarily look like a hyperbolic tangent.

  • We get curves, wiggles, peaks and valleys and a much wider variety of behavior thing we ever saw.

  • With single layer networks.

  • We can take the next step and add another layer tor network.

  • Now we have a set of hidden nodes between layer one and layer two and another set of hidden nodes between layer two and layer three.

  • Again, We choose random values for all the weights and look at the types of curbs that can produce again.

  • We see wiggles and peaks, valleys and a wide selection of shapes.

  • If it's hard to tell the difference between these curves and the curves generated by a two layer network.

  • That's because they're mathematically identical.

  • We won't try to prove it here, but there's a cool result that shows that any curve you can create learning a many using a many layered network.

  • You can also create using a two layer network as long as you have enough hidden nodes.

  • The advantage of having a many layered network is that it can help you create more complex curves using fewer total nodes.

  • For instance, in our two layer network, we used 100 hidden nodes better in our three layer network.

  • We used 11 hidden nodes in the first layer and nine hidden nodes in the second layer.

  • That's only 1/5 of the total number we use in our two later network.

  • But the curves that produces show similar richness.

  • We can use these fancy wiggly lines to make a classifier, as we did with logistic regression.

  • Here we use the zero line as the cut off.

  • Everywhere that air curve crosses the zero line, there's a divider in every region that the curves it's above the zero line.

  • We'll call this category A, and similarly everywhere the curve is below the zero line.

  • We have category B.

  • What distinguishes these non linear classifier from linear ones is that they don't just split the space into two halves.

  • In this example, regions of A and B are inter leaved.

  • Building a classifier around a multi layer nonlinear network gives it a lot more flexibility.

  • It can learn more complex relations.

  • This particular combination of multi layer network with hyperbolic tangent nonlinear function has its own name.

  • Multi layer Perceptron, as you can guess when you have only one layer, has just called a perceptron, and in that case you don't even need to add the nonlinear function to make it work.

  • The function will still cross the X axis at all the same places.

  • Here is the full network diagram of a multi layer perceptron.

  • This representation is helpful because it makes every single operation explicit.

  • However, it's also visually cluttered.

  • It's difficult to work with.

  • Because of this.

  • It's most often simplified toe look like circles connected by lines.

  • This implies all the operations we saw in the previous diagram.

  • Connecting lines each have a weight associated with them hidden nodes and output notes, performs summation and nonlinear squashing.

  • But in this diagram, all of that is implied.

  • In fact, our bias nodes, the notes that always have a value of one in each layer, are omitted for clarity.

  • So our original network reduces to this.

  • The bias knows air still present in their operation hasn't changed it all.

  • But we leave them out to make a cleaner picture.

  • We only show to hidden nodes from each layer here, but in practice we used quite a few more again to make the diagram as clean as possible.

  • We often don't show all the hidden nodes.

  • We just show a few, and the rest are implied.

  • Here's a generic diagram, then for a three layer single input single output network.

  • Notice that if we specify the number of inputs, the number of outputs and the number of layers and the number of hidden nodes in each layer, then we can fully define a neural network.

  • We can also take a look at a to input single output neural network because it has two inputs.

  • When we plot it's outputs, it will be a three dimensional curve.

  • We can once again choose random weights and generate curve to see what types of functions this neural network might be able to represent.

  • This is where it gets really fun.

  • With multiple inputs, multiple layers and nonlinear activation functions, neural networks can make really crazy shapes.

  • It's almost correct to say that they could make any shape you want.

  • It's worth taking a moment, though, to notice what its limitations are.

  • First noticed that all of the functions fall between plus and minus one.

  • The dark red and the dark green regions kiss the floor and the ceiling of this range, but they never cross it.

  • This neural network would not be able to fit a function that extended outside of this range.

  • Also notice that these functions all tend to be smooth.

  • They have hills and dips and valleys and wiggles and even points and wells.

  • But it all happens relatively smoothly.

  • If we hope to fit a function with a lot of jagged jumps and drops, this neural network might not be able to do a very good job of it.

  • However, aside from these two limitations, the variety of functions that this neural network can produce is a little mind boggling.

  • We modified a single output neural network to be a classifier.

  • When we look out of the multi layer Perceptron.

  • Now there's another way to do this.

  • We can use a two output neural network instead.

  • Outputs of a three layer one input to output neural network like this, we can see that there are many cases where the two curves cross, and in some instances they cross in several places.

  • We can use this to make a classifier.

  • Wherever the one output is greater than another, it can signify that one category dominates another graphically.

  • Wherever the two output functions cross, we can draw a vertical line.

  • This chops up the input space into regions in each region.

  • One output is greater than the other.

  • For instance, wherever the blue line is greater, we can assign that to be category A.

  • Then wherever the peach colored line is greater, those regions or category B just like the multi layer perceptron.

  • This lets us chop this space up in more complex ways than a linear classifier.

  • Could regions of category and category B can be shuffled together arbitrarily when you only have to outputs.

  • The advantages of doing it this way over a multi layer perceptron with just one output are not at all clear However, if you move to three or more outputs, the story changes.

  • Now we have three separate outputs and three separate output functions.

  • We can use our same criterion off, letting the function with the maximum value determined the category.

  • We start by chopping up the input space according to which function has the highest value.

  • Each function represents one of our categories.

  • We're going to assign our first function to be category A and label every region where it's on top as category A.

  • Then we could do the same with their second function and our third using this trick, we're no longer limited to two categories.

  • We can create as many output notes as we want and learn and chop up the input space into that many categories.

  • It's worth pointing out that the winning category may not be the best buy very much.

  • In some cases, you can see they could be very close.

  • One category will be declared the winner, but the next runner up, maybe almost a cz good if it there's no reason that we can't extend this approach to too warm or inputs.

  • Unfortunately, it does get harder to visualize.

  • You have to imagine several of these lumpy landscape plots on top of each other, and in some regions one will be greater than the others in that region.

  • That category associated with that output will be dominant.

  • To get a qualitative sense for what these regions might look like, you can look at the projected contours on the floor of these plots.

  • In the case of a multi layer perceptron, these plots are all sliced at the y equals zero level.

  • That means if you look at the floor of the plot, everything an initiate of green will be one category, and everything in any shade of red will be the other category.

  • The first thing that jumps out about these category boundaries is how diverse they are.

  • Some of them are nearly straight lines, albeit with a small wiggle.

  • Some of them have wilder bends and curves and some of them choppy input space up into several disconnected regions of green and red.

  • Sometimes there's a small island of green or island of red in the middle of a sea of the other color.

  • The variety of boundaries is what makes this such a powerful classifications tool.

  • The one limitation we can see.

  • Looking at it this way is that the boundaries air all smoothly curved.

  • Sometimes those curves air quite sharp, but usually they're gentle and rounded.

  • This shows the natural preference.

  • The neural networks with hyperbolic tension activation functions have for smooth functions and smooth boundaries.

  • The goal of this exploration was to get an intuitive sense for what types of functions and category boundaries.

  • Neural networks can learn when used for regression or classifications.

  • We've seen both their power and their distinct preference for smoothness.

  • We've only looked at two nonlinear activation functions, logistic and hyperbolic tangent, both of which are very closely related.

  • There are lots of others, and some of them do a bit better at capturing sharp, nonlinear Aridjis.

  • Rectified linear units or re loose, for instance, produce surfaces and boundaries that are quite a bit sharper.

  • But my hope was to cede your intuition with some examples of what's actually going on under the hood when you train your neural network.

  • Here are the most important things to walk away with.

  • Neural networks learned functions and can be used for regression.

  • Some activation functions limit the output range, but as long as that matches the expected range of your outputs.

  • It's not a problem.

  • Second, neural networks are most often used for classifications.

  • They've proven pretty good at it.

  • Third, neural networks tend to create smooth functions when used for regression and smooth category boundaries.

  • When used for classifications forth for fully connected vanilla neural networks to layer network can learn any function that a deep network can learn.

  • However, a deep network might be able to learn it with fewer nodes.

  • Fifth, making sure that inputs are normalized, that is, they have a mean near zero and a standard deviation of less than one.

  • This helps neural networks to be more sensitive to their relationships.

  • I hope this helps you as you jump into your next project.

  • Happy building.

  • Welcome to how convolution a LL neural networks work.

  • Convolution A LL, neural networks or conv nets or CNN's can do some pretty cool things if you feed them a bunch of pictures of faces.

  • For instance, the learns of basic things like edges and dots, bright spots, dark spots and then because there are multi layer neural network, that's what gets learned in the first layer of the second layer are things that are recognizable as eyes, noses, mouths, the third layer of things that look like basis.

  • Similarly, if you feed it a bunch of images of cars down at the lowest layer, you'll get things again that look like edges and then higher up things that look like tires, wheel wells and hoods.

  • And it a level above that, things that are clearly identifiable.

  • His cars.

  • CNN's could even learn to play video games by forming patterns of the pickles as they appear on the screen and learning what is the best action to take when it sees a certain pattern, CNN can learn to play video games in some cases far better than a human ever could.

  • Not only that, if you take a couple of CNN's and have them set toe watching YouTube videos, one can learn objects.

  • I again picking out patterns, and the other one can learn types of grasps.

  • This, then coupled with some other execution software, can let a robot learn to cook just by watching YouTube, so there's no doubt CNN's or powerful.

  • Usually when we talk about them, we do so in the same way we might talk about magic, but they're not magic.

  • What they do is based on some pretty basic ideas applied in a clever way.

  • So to illustrate, these will talk about a very simple toy convolution, all neural network.

  • What this one does is takes it an image, a two dimensional ray of pixels.

  • You can think of it as a checkerboard, and each square on the checkerboard is either light or dark.

  • And then, by looking at that, the CNN decides whether it's a picture of an ex urban hope.

  • So, for instance, on top there we see an image with an X drawn in white pixels on a black background we would like to identify.

  • This adds a necks and the O like to identify as it.

  • Oh, so how a CNN does this is has several steps in it.

  • What makes it tricky is it?

  • The exes aren't exactly the same every time.

  • Uh, the actor.

  • The oak could be shifted.

  • It could be bigger or smaller to be rotated a little bit thicker or thinner.

  • And in every case, we would still like to identify whether it's an X or an O.

  • Now, the reason that this is challenging is because for us deciding whether these two things are similar, it's straightforward.

  • We don't even have to think about it for a computer.

  • It's very hard.

  • What a computer sees is This checkerboard is two dimensional array as a bunch of numbers, ones and minus ones.

  • One is a bright pixel, minus one is a black pixel, and what it could do is go through pixel by pixel.

  • It compare whether they match or not.

  • So it's a computer to a computer.

  • It looks like there are a lot of pixels that match, but some that don't quite a few that don't actually.

  • And so it might look at this.

  • It's I I'm really not sure whether these were the same.

  • And so it was because of Computer is so literal would say, uncertain.

  • I can't say that they're equal.

  • And one of the tricks that convolution all neural networks use is to match parts of the image rather than the whole thing.

  • So you break it down into its smaller parts or features.

  • Then it becomes much more clear whether to these two things are similar.

  • So examples of these little features are little.

  • Many images in this case, just three pixels by three pixels.

  • The one on the left is a diagonal line, slanting downward from left to right.

  • The one on the right is also a diagonal line, slanting in the other direction, and the one in the middle is a little X.

  • These are little pieces of the bigger image, and you can see as we go through.

  • If you choose the right feature and put it in the right place, it matches the image exactly.

  • So okay, we have the bits and pieces now to take a step deeper there.

  • The math behind matching these is called filtering, and the way this is done, it's a feature is lined up with the little patch of the image.

  • And then one by one, the pixels air compared, they're multiplied by each other and then added up and divided by the total number of pixels.

  • So to step through this to see why it makes sense to do this, you can see starting in the upper left hand pixel in both the future and the image patch multiplying the one bio one gives you a one, and we can keep track of that by putting that into position off the pixel that we're comparing.

  • We step to the next one minus one times minus one is also what And we continue to step through pixel by pixel, multiplying them all by each other.

  • And because they're always the same, the answer is always one.

  • When we're done, we take all these ones and add them up and divide by nine.

  • And the answer is one.

  • So now we want to keep track of where that feature Waas.

  • In the image we put a one there say when we put the future here we get a match of what?

  • That is a filter.

  • Now we could take that same feature and move it to another position and perform the filtering again.

  • These are the same pattern the first pick silver matches.

  • The second picture matches.

  • The third pixel does not match minus one times one equals minus one.

  • So we record that in our results and we got through and do that through the rest of the image patch.

  • And when we're done, we noticed we have two minus ones this time.

  • So we add up all the pixels, add up to five, divide by nine and we get a 0.55 So this is very different one, and we can record the 0.55 in that position where we were where it occurred.

  • So by moving our filter around to different places in the image, we actually find different values for how well that filter matches or how well that feature is represented at that position.

  • So this becomes a map where the feature occurs by moving it around to every possible position we do.

  • Convolution.

  • That's just the repeated application of this feature.

  • This filter over and over again, and what we get is a nice map across the whole image of where this feature occurs, and if we look at it, it makes sense.

  • This feature is a diagonal line slanting downward left to right, which matches the downward left to right diagonal of the ex.

  • So if we look at our filtered image, we see that all of the high numbers one's 0.7 sevens are all right along that Diack that suggests that that feature matches along that diagonal much better than it does elsewhere in the image Thio.

  • Use a shorthand notation here.

  • We'll do a little X with a circle in it to represent convolution, the act of trying every possible match, and we repeat that with other features.

  • We can repeat that with our ex filter in the middle and with our upwards slanting diagonal line on the bottom. 01:00:10.880 --> 01:

your own network's air good for learning lots of different types of patterns To give an example of how this would work.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it