Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • SPEAKER 1: All right.

  • Welcome back, everyone, to an introduction

  • to Artificial Intelligence with Python.

  • Now last time, we took a look at machine learning-- a set of techniques

  • that computers can use in order to take a set of data

  • and learn some patterns inside of that data, learn how to perform a task,

  • even if we, the programmers, didn't give the computer explicit instructions

  • for how to perform that task.

  • Today, we transition to one of the most popular techniques and tools

  • within machine learning that have neural networks.

  • And neural networks were inspired as early as the 1940s

  • by researchers who were thinking about how it is that humans learn,

  • studying neuroscience and the human brain,

  • and trying to see whether or not we can apply those same ideas to computers as

  • well, and model computer learning off of human learning.

  • So how is the brain structured?

  • Well, very simply put, the brain consists of a whole bunch of neurons,

  • and those neurons are connected to one another

  • and communicate with one another in some way.

  • In particular, if you think about the structure of a biological neural

  • network-- something like this--

  • there are a couple of key properties that scientists observed.

  • One was that these neurons are connected to each other

  • and receive electrical signals from one another,

  • that one neuron can propagate electrical signals to another neuron.

  • And another point is that neurons process

  • those input signals, and then can be activated, that a neuron becomes

  • activated at a certain point, and then can propagate further signals

  • onto neurons in the future.

  • And so the question then became, could we take this biological idea of how it

  • is that humans learn-- with brains and with neurons--

  • and apply that to a machine as well, in effect,

  • designing an artificial neural network, or an ANN, which

  • will be a mathematical model for learning that is inspired

  • by these biological neural networks?

  • And what artificial neural networks will allow us to do

  • is they will first be able to model some sort of mathematical function.

  • Every time you look at a neural network, which we'll see more of later today,

  • each one of them is really just some mathematical function

  • that is mapping certain inputs to particular outputs,

  • based on the structure of the network, that depending

  • on where we place particular units inside of this neural network,

  • that's going to determine how it is that the network is going to function.

  • And in particular, artificial neural networks

  • are going to lend themselves to a way that we can learn what

  • the network's parameters should be.

  • We'll see more on that in just a moment.

  • But in effect we want to model, such that it is easy for us

  • to be able to write some code that allows for the network

  • to be able to figure out how to model the right mathematical function,

  • given a particular set of input data.

  • So in order to create our artificial neural network,

  • instead of using biological neurons, we're

  • just going to use what we're going to call units--

  • units inside of a neural network--

  • which we can represent kind of like a node in a graph,

  • which will here be represented just by a blue circle like this.

  • And these artificial units-- these artificial neurons--

  • can be connected to one another.

  • So here, for instance, we have two units that

  • are connected by this edge inside of this graph, effectively.

  • And so what we're going to do now is think

  • of this idea as some sort of mapping from inputs to outputs,

  • that we have one unit that is connected to another unit,

  • that we might think of this side as the input and that side of the output.

  • And what we're trying to do then is to figure out how to solve a problem,

  • how to model some sort of mathematical function.

  • And this might take the form of something

  • we saw last time, which was something like, we

  • have certain inputs like variables x1 and x2, and given those inputs,

  • we want to perform some sort of task--

  • a task like predicting whether or not it's going to rain.

  • And ideally, we'd like some way, given these inputs x1 and x2,

  • which stand for some sort of variables to do with the weather,

  • we would like to be able to predict, in this case,

  • a Boolean classification-- is it going to rain, or is it not going to rain?

  • And we did this last time by way of a mathematical function.

  • We defined some function h for our hypothesis function

  • that took as input x1 and x2--

  • the two inputs that we cared about processing-- in order

  • to determine whether we thought it was going to rain, or whether we thought it

  • was not going to rain.

  • The question then becomes, what does this hypothesis function do in order

  • to make that determination?

  • And we decided last time to use a linear combination of these input variables

  • to determine what the output should be.

  • So our hypothesis function was equal to something

  • like this: weight 0 plus weight 1 times x1 plus weight 2 times x2.

  • So what's going on here is that x1 and x2--

  • those are input variables-- the inputs to this hypothesis function--

  • and each of those input variables is being

  • multiplied by some weight, which is just some number.

  • So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2,

  • and we have this additional weight-- weight 0--

  • that doesn't get multiplied by an input variable

  • at all, that just serves to either move the function up or move the function's

  • value down.

  • You can think of this as either a weight that's

  • just multiplied by some dummy value, like the number

  • 1 when it's multiplied by 1, and so it's not multiplied by anything.

  • Or sometimes you'll see in the literature,

  • people call this variable weight 0 a "bias,"

  • so that you can think of these variables as slightly different.

  • We have weights that are multiplied by the input

  • and we separately add some bias to the result as well.

  • You'll hear both of those terminologies used

  • when people talk about neural networks and machine learning.

  • So in effect, what we've done here is that in order

  • to define a hypothesis function, we just need

  • to decide and figure out what these weights should be,

  • to determine what values to multiply by our inputs to get some sort of result.

  • Of course, at the end of this, what we need

  • to do is make some sort of classification

  • like raining or not raining, and to do that, we use some sort of function

  • to define some sort of threshold.

  • And so we saw, for instance, the step function, which is defined as 1

  • if the result of multiplying the weights by the inputs is at least 0;

  • otherwise as 0.

  • You can think of this line down the middle-- it's kind

  • of like a dotted line.

  • Effectively, it stays at 0 all the way up to one point,

  • and then the function steps--

  • or jumps up-- to 1.

  • So it's zero before it reaches some threshold,

  • and then it's 1 after it reaches a particular threshold.

  • And so this was one way we could define what

  • we'll come to call an "activation function," a function that

  • determines when it is that this output becomes active--

  • changes to a 1 instead of being a 0.

  • But we also saw that if we didn't just want a purely binary classification,

  • if we didn't want purely 1 or 0, but we wanted

  • to allow for some in-between real number values,

  • we could use a different function.

  • And there are a number of choices, but the one that we looked at was

  • the logistic sigmoid function that has sort of an S-shaped curve,

  • where we could represent this as a probability--

  • that may be somewhere in between the probability of rain of something like

  • 0.5, and maybe a little bit later the probability of rain is 0.8--

  • and so rather than just have a binary classification of 0 or 1,

  • we can allow for numbers that are in between as well.

  • And it turns out there are many other different types

  • of activation functions, where an activation function just

  • takes the output of multiplying the weights together and adding that bias,

  • and then figuring out what the actual output should be.

  • Another popular one is the rectified linear unit, otherwise known ReLU,

  • and the way that works is that it just takes as input

  • and takes the maximum of that input and 0.

  • So if it's positive, it remains unchanged, but i if it's negative,

  • it goes ahead and levels out at 0.

  • And there are other activation functions that we can choose as well.

  • But in short, each of these activation functions,

  • you can just think of as a function that gets applied to the result of all

  • of this computation.

  • We take some function g and apply it to the result of all of that calculation.

  • And this then is what we saw last time-- the way of defining

  • some hypothesis function that takes on inputs,

  • calculates some linear combination of those inputs,

  • and then passes it through some sort of activation function to get our output.

  • And this actually turns out to be the model

  • for the simplest of neural networks, that we're

  • going to instead represent this mathematical idea graphically, by using

  • a structure like this.

  • Here then is a neural network that has two inputs.

  • We can think of this as x1 and this as x2.

  • And then one output, which you can think of classifying whether or not

  • we think it's going to rain or not rain, for example,

  • in this particular instance.

  • And so how exactly does this model work?

  • Well, each of these two inputs represents one of our input variables--

  • x1 and x2.

  • And notice that these inputs are connected

  • to this output via these edges, which are

  • going to be defined by their weights.

  • So these edges each have a weight associated with them--

  • weight 1 and weight 2--

  • and then this output unit, what it's going to do

  • is it is going to calculate an output based on those inputs

  • and based on those weights.

  • This output unit is going to multiply all the inputs by their weights,

  • add in this bias term, which you can think of as an extra w0 term that

  • gets added into it, and then we pass it through an activation function.

  • So this then is just a graphical way of representing the same idea

  • we saw last time, just mathematically.

  • And we're going to call this a very simple neural network.

  • And we'd like for this neural network to be

  • able to learn how to calculate some function,

  • that we want some function for the neural network to learn,

  • and the neural network is going to learn what

  • should the values of w0, w1, and w2 be.

  • What should the activation function be in order

  • to get the result that we would expect?

  • So we can actually take a look at an example of this.

  • What then is a very simple function that we might calculate?

  • Well, if we recall back from when we were looking at propositional logic,

  • one of the simplest functions we looked at

  • was something like the or function, that takes two inputs--

  • x and y-- and outputs 1, otherwise known as true, if either one of the inputs,

  • or both of them, are 1, and outputs a 0 if both of the inputs are 0, or false.

  • So this then is the or function.

  • And this was the truth table for the or function-- that as long

  • as either of the inputs are 1, the output of the function is 1,

  • and the only case where the output of 0 is where both of the inputs are 0.

  • So the question is, how could we take this and train a neural network to be

  • able to learn this particular function?

  • What would those weights look like?

  • Well, we could do something like this.

  • Here's our neural network, and I'll propose

  • that in order to calculate the or function,

  • we're going to use a value of 1 for each of the weights,

  • and we'll use a bias of negative 1, and then

  • we'll just use this step function as our activation function.

  • How then does this work?

  • Well, if I wanted to calculate something like 0 or 0,

  • which we know to be 0, because false or false is false, then

  • what are we going to do?

  • Well, our output unit is going to calculate

  • this input multiplied by the weight.

  • 0 times 1, that's 0.

  • Same thing here.

  • 0 times 1, that's 0.

  • And we'll add to that the bias, minus 1.

  • So that'll give us some result of negative 1.

  • If we plot that on our activation function-- negative 1 is here--

  • it's before the threshold, which means either 0 or 1.

  • It's only 1 after the threshold.

  • Since negative 1 is before the threshold,

  • the output that this unit provides it is going to be 0.

  • And that's what we would expect it to be, that 0 or 0 should be 0.

  • What if instead we had had 1 or 0, where this is the number 1?

  • Well, in this case, in order to calculate

  • what the output is going to be, we again have to do this weighted sum.

  • 1 times 1, that's 1.

  • 0 times 1, that's 0.

  • Sum of that so far is 1.

  • Add negative 1 to that.

  • Well, then the output of 0.

  • And if we plot 0 on the step function, 0 ends up being here--

  • it's just at the threshold-- and so the output here

  • is going to be 1, because the output of 1 or 0, that's 1.

  • So that's what we would expect as well.

  • And just for one more example, if I had 1 or 1, what would the result be?

  • Well 1 times 1 is 1.

  • 1 times 1 is 1.

  • The sum of those is 2.

  • I add the bias term to that.

  • I get the number 1.

  • 1 plotted on this graph is way over there.

  • That's well beyond the threshold.

  • And so this output is going to be 1 as well.

  • The output is always 0 or 1, depending on whether or not

  • we're past the threshold.

  • And this neural network then models the or function-- a very simple function,

  • definitely-- but it still is able to model it correctly.

  • If I give it the inputs, it will tell me what x1 or x2 happens to be.

  • And you could imagine trying to do this for other functions

  • as well-- a function like the and function, for instance,

  • that takes two inputs and calculates whether both x and y are true.

  • So if x is 1 and y is 1, then the output of x and y is 1,

  • but in all of the other cases, the output is 0.

  • How could we model that inside of a neural network as well?

  • Well, it turns out we could do it in the same way, except instead of negative 1

  • as the bias, we can use negative 2 as the bias instead.

  • What does that end up looking like?

  • Well, if I had 1 and 1, that should be 1, because 1, true and true,

  • is equal to true.

  • Well, I take 1 times 1.

  • That's 1.

  • 1 times 1 is 1.

  • I got a total sum of 2 so far.

  • Now I add the bias of negative 2, and I get the value 0.

  • And 0 when I plotted on the activation function is just past that threshold.

  • And so the output is going to be 1.

  • But if I had any other input, for example, like 1 and 0, well,

  • the weighted sum of these is 1 plus 0.

  • It's going to be 1.

  • Minus 2 is going to give us negative 1, and negative 1

  • is not past that threshold, and so the output is going to be zero.

  • So those then are some very simple functions

  • that we can model using a neural network, that has two inputs and one

  • output, where our goal is to be able to figure out

  • what those weights should be in order to determine what the output should be.

  • And you could imagine generalizing this to calculate more complex functions as

  • well, that maybe given the humidity and the pressure,

  • we want to calculate what's the probability that it's going to rain,

  • for example.

  • Or you might want to do a regression-style problem, where

  • given some amount of advertising and given what month it is maybe,

  • we want to predict what our expected sales are

  • going to be for that particular month.

  • So you could imagine these inputs and outputs being different as well.

  • And it turns out that in some problems, we're not just going to have two

  • inputs, and the nice thing about these neural networks is that we can compose

  • multiple units together-- make our networks more complex--

  • just by adding more units into this particular neural network.

  • So the network we've been looking at has two inputs and one output.

  • But we could just as easily say, let's go ahead

  • and have three inputs in there, or have even more inputs,

  • where we could arbitrarily decide, however many inputs there

  • are to our problem, all going to be calculating some sort of output

  • that we care about figuring out the value of.

  • How then does the math work for figuring out that output?

  • Well, it's going to work in a very similar way.

  • In the case of two inputs, we had two weights indicated by these edges,

  • and we multiplied the weights by the numbers, adding this bias term,

  • and we'll do the same thing in the other cases as well.

  • If I have three inputs, you'll imagine multiplying each of these three inputs

  • by each of these weights.

  • If I had five inputs instead, we're going to do the same thing.

  • Here, I'm saying sum up from 1 to 5.

  • xi multiplied by weight i.

  • So take each of the five input variables,

  • multiply them by their corresponding weight, and then add the bias to that.

  • So this would be a case where there are five inputs into this neural network,

  • for example.

  • But there could be more arbitrarily many nodes

  • that we want inside of this neural network,

  • where each time we're just going to sum up

  • all of those input variables multiplied by the weight,

  • and then add the bias term at the very end.

  • And so this allows us to be able to represent

  • problems that have even more inputs, just by growing

  • the size of our neural network.

  • Now, the next question we might ask is a question

  • about how it is that we train these internal networks?

  • In the case of the or function and the and function,

  • they were simple enough functions that I could just

  • tell you like here what the weights should be,

  • and you could probably reason through it yourself

  • what the weights should be in order to calculate the output that you want.

  • But in general, with functions like predicting sales or predicting

  • whether or not it's going to rain, these are much trickier

  • functions to be able to figure out.

  • We would like the computer to have some mechanism of calculating what it is

  • that the weights should be-- how it is to set the weights--

  • so that our neural network is able to accurately model the function

  • that we care about trying to estimate.

  • And it turns out that the strategy for doing this,

  • inspired by the domain of calculus, is a technique called gradient descent.

  • And what gradient descent is, it is an algorithm for minimizing loss

  • when you're training a neural network.

  • And recall that loss refers to how bad our hypothesis function happens to be,

  • that we can define certain loss functions,

  • and we saw some examples of loss functions

  • last time that just give us a number for any particular hypothesis,

  • saying how poorly does it model the data?

  • How many examples does it get wrong?

  • How are they worse or less bad as compared to other hypothesis functions

  • that we might define?

  • And this loss function is just a mathematical function,

  • and when you have a mathematical function,

  • in calculus, what you could do is calculate

  • something known as the gradient, which you can think of is like a slope.

  • It's the direction the loss function is moving at any particular point.

  • And what it's going to tell us is in which direction

  • should we be moving these weights in order to minimize the amount of loss?

  • And so generally speaking-- we won't get into the calculus of it--

  • but the high-level idea for gradient descent

  • is going to look something like this.

  • If we want to train a neural network, we'll

  • go ahead and start just by choosing the weights randomly.

  • Just pick random weights for all of the weights in the neural network.

  • And then we'll use the input data that we have access to in order

  • to train the network in order to figure out

  • what the weights should actually be.

  • So we'll repeat this process again and again.

  • The first step is we're going to calculate the gradient based

  • on all of the data points.

  • So we'll look at all the data and figure out what the gradient is at the place

  • where we currently are-- for the current setting of the weights--

  • which means that in which direction should we move the weights in order

  • to minimize the total amount of loss in order to make our solution better?

  • And once we've calculated that gradient--

  • which direction we should move in the loss function--

  • well, then we can just update those weights according to the gradient,

  • take a small step in the direction of those weights

  • in order to try to make our solution a little bit better.

  • And the size of the step that we take, that's going to vary,

  • and you can choose that when you're training a particular neural network.

  • But in short, the idea is going to be take all of the data points,

  • figure out based on those data points in what direction the weights should move,

  • and then move the weights one small step in that direction.

  • And if you repeat that process over and over again,

  • adjusting the weights a little bit at a time based on all the data points,

  • eventually, you should end up with a pretty good solution to trying

  • to solve this sort of problem.

  • At least that's what we would hope to happen.

  • Now as you look at this algorithm, a good question

  • to ask anytime you're analyzing an algorithm

  • is, what is going to be the expensive part of doing the calculation?

  • What's going to take a lot of work to try to figure out what

  • is going to be expensive to calculate?

  • And in particular, in the case of gradient descent,

  • the really expensive part is this all data points part right here,

  • having to take all of the data points and using all of those data

  • points to figure out what the gradient is at this particular setting of all

  • of the weights, because odds are, in a big machine learning problem

  • where you're trying to solve a big problem with a lot of data,

  • you have a lot of data points in order to calculate,

  • and figuring out the gradient based on all of those data points

  • is going to be expensive.

  • And you'll have to do it many times, but you'll likely repeat this process

  • again and again and again, going through all the data points,

  • taking one small step over and over, as you try and figure

  • out what the optimal setting of those weights happens to be.

  • It turns out that we would ideally like to be

  • able to train our neural networks faster to be able to more quickly converge

  • to some sort of solution that is going to be a good solution to the problem.

  • So in that case, there are alternatives to just standard gradient descent,

  • which looks at all of the data points at once.

  • We can employ a method like stochastic gradient descent, which will randomly

  • just choose one data point at a time to calculate the gradient based on,

  • instead of calculating it based on all of the data points.

  • So the idea there is that we have some setting of the weights,

  • we pick a data point, and based on that one data point,

  • we figure out in which direction should we move all of the weights,

  • and move the weights in that small direction, then take another data point

  • and do that again, and repeat this process again and again,

  • maybe looking at each of the data points multiple times,

  • but each time, only using one data point to calculate the gradient

  • to calculate which direction we should move in.

  • Now just using one data point instead of all of the data points

  • probably gives us a less accurate estimate

  • of what the gradient actually is.

  • But on the plus side, it's going to be much faster to be able to calculate,

  • that we can much more quickly calculate what the gradient is, based on one data

  • point, instead of calculating based on all of the data points

  • and having to do all of that computational work again and again.

  • So there are trade-offs here between looking at all of the data points

  • and just looking at one data point.

  • And it turns out that a middle ground-- and this is also quite popular--

  • is a technique called mini-batch gradient descent,

  • where the idea there is instead at looking at all of the data versus just

  • a single point, we instead divide our dataset up into small batches--

  • groups of data points-- where you can decide how big a particular batch is,

  • but in short, you're just going to look at a small number of points

  • at any given time, hopefully getting a more accurate estimate of the gradient,

  • but also not requiring all of the computational effort needed

  • to look at every single one of these data points.

  • So gradient descent then is this technique

  • that we can use in order to train these neural networks in order

  • to figure out what the setting of all of these weights

  • should be, if we want some way to try and get an accurate notion of how it is

  • that this function should work, some way of modeling how to transform

  • the inputs into particular outputs.

  • So far, the networks that we've taken a look at

  • have all been structured similar to this.

  • We have some number of inputs-- maybe two or three or five or more--

  • and then we have one output that is just predicting like rain or no rain,

  • or just predicting one particular value.

  • But often in machine learning problems, we don't just care about one output.

  • We might care about an output that has multiple different values associated

  • with it.

  • So in the same way that we could take a neural network

  • and add units to the input layer, we can likewise add outputs

  • to the output layer as well.

  • Instead of just one output, you could imagine we have two outputs,

  • or we could have like four outputs, for example, where in each case,

  • as we add more inputs or add more outputs,

  • if we want to keep this network fully connected between these two layers,

  • we just need to add more weights, that now each of these input nodes

  • have four weights associated with each of the four outputs,

  • and that's true for each of these various different input nodes.

  • So as we add nodes, we add more weights in order

  • to make sure that each of the inputs can somehow

  • be connected to each of the outputs, so that each output

  • value can be calculated based on what the value of the input happens to be.

  • So what might a case be where we want multiple different output values?

  • Well, you might consider that in the case of weather

  • predicting, for example, we might not just care

  • whether it's raining or not raining.

  • There might be multiple different categories of weather

  • that we would like to categorize the weather into.

  • With just a single output variable, we can do a binary classification,

  • like rain or no rain, for instance--

  • 1 or 0-- but it doesn't allow us to do much more than that.

  • With multiple output variables, I might be

  • able to use each one to predict something a little different.

  • Maybe I want to categorize the weather into one

  • of four different categories, something like,

  • is it going to be raining or sunny or cloudy or snowy,

  • and I now have four output variables that

  • can be used to represent maybe the probability that it is raining,

  • as opposed to sunny, as opposed to cloudy, or as opposed to snowy.

  • How then would this neural network work?

  • Well, we have some input variables that represent some data

  • that we have collected about the weather.

  • Each of those inputs gets multiplied by each

  • of these various different weights.

  • We have more multiplications to do, but these

  • are fairly quick mathematical operations to perform.

  • And then what we get is after passing them

  • through some sort of activation function in the outputs,

  • we end up getting some sort of number, where that number, you might imagine,

  • you can interpret as like a probability, like a probability

  • that it is one category, as opposed to another category.

  • So here we're saying that based on the inputs,

  • we think there is a 10% chance that it's raining, a 60% chance that it's sunny,

  • a 20% chance of cloudy, a 10% chance of it's snowy.

  • And given that output, if these represent a probability distribution,

  • well, then you could just pick whichever one has the highest value--

  • in this case, sunny--

  • and say that, well, most likely, we think

  • that this categorization of inputs means that the output should be sunny,

  • and that is what we would expect the weather

  • to be in this particular instance.

  • So this allows us to do these sort of multi-class classifications,

  • where instead of just having a binary classification--

  • 1 or 0-- we can have as many different categories as we

  • want, and we can have our neural network output these probabilities

  • over which categories are most more likely than other categories,

  • and using that data, we're able to draw some sort of inference

  • on what it is that we should do.

  • So this was sort of the idea of supervised machine learning.

  • I can give this neural network a whole bunch of data--

  • whole bunch of input data--

  • corresponding to some label, some output data--

  • like we know that it was raining on this day,

  • we know that it was sunny on that day--

  • and using all of that data, the algorithm

  • can use gradient descent to figure out what all of the weights

  • should be in order to create some sort of model that

  • hopefully allows us a way to predict what

  • we think the weather is going to be.

  • But neural networks have a lot of other applications as well.

  • You can imagine applying the same sort of idea

  • to a reinforcement learning sort of example as well.

  • Well, you remember that in reinforcement learning, we wanted to do

  • is train some sort of agent to learn what action to take depending on what

  • state they currently happen to be in.

  • So depending on the current state of the world,

  • we wanted the agent to pick from one of the available actions that

  • is available to them.

  • And you might model that by having each of these input variables

  • represent some information about the state--

  • some data about what state our agent is currently in--

  • and then the output, for example, could be

  • each of the various different actions that our agent could

  • take-- action 1, 2, 3, and 4, and you might

  • imagine that this network would work in the same way,

  • that based on these particular inputs we go ahead

  • and calculate values for each of these outputs,

  • and those outputs could model which action is better than other actions,

  • and we could just choose, based on looking at those outputs, which

  • actions we should take.

  • And so these neural networks are very broadly applicable,

  • that all they're really doing is modeling some mathematical function.

  • So anything that we can frame as a mathematical function, something

  • like classifying inputs into various different categories,

  • or figuring out based on some input state what

  • action we should take-- these are all mathematical functions that we could

  • attempt to model by taking advantage of this neural network structure,

  • and in particular, taking advantage of this technique, gradient descent,

  • that we can use in order to figure out what the weights should be in order

  • to do this sort of calculation.

  • Now how is it that you would go about training a neural network that has

  • multiple outputs instead of just one?

  • Well, with just a single output, we could

  • see what the output for that value should be,

  • and then you update all of the weights that corresponded to it.

  • And when we have multiple outputs, at least in this particular case,

  • we can really think of this as four separate neural networks,

  • that really we just have one network here

  • that has these three inputs, corresponding with these three weights,

  • corresponding to this one output value.

  • And the same thing is true for this output value.

  • This output value effectively defines yet another neural network

  • that has these same three inputs, but a different set of weights

  • that correspond to this output.

  • And likewise, this output has its own set of weights as well,

  • and the same thing for the fourth output too.

  • And so if you wanted to train a neural network that had four outputs instead

  • of just one, in this case where the inputs are directly connected

  • to the outputs, you could really think of this

  • as just training four independent neural networks.

  • We know what the outputs for each of these four

  • should be based on our input data, and using that data,

  • we can begin to figure out what all of these individual weights should be,

  • and maybe there's an additional step at the end to make sure

  • that turn these values into a probability distribution,

  • such that we can interpret which one is better than another

  • or more likely than another as a category or something like that.

  • So this then seems like it does a pretty good job of taking inputs and trying

  • to predict what outputs should be, and we'll

  • see some real examples of this in just a moment as well.

  • But it's important then to think about what

  • the limitations of this sort of approach is,

  • of just taking some linear combination of inputs

  • and passing it into some sort of activation function.

  • And it turns out that when we do this in the case of binary classification--

  • I'm trying to predict like does it belong to one category or another--

  • we can only predict things that are linearly separable, because we're

  • taking a linear combination of inputs and using that to define some decision

  • boundary or threshold.

  • Then what we get is a situation where if we have this set of data,

  • we can predict a line that separates linearly

  • the red points from the blue points.

  • But a single unit that is making a binary classification,

  • otherwise known as a perceptron, can't deal with a situation like this,

  • where-- we've seen this type of situation before--

  • where there is no straight line that just

  • goes straight through the data that will divide the red points away

  • from the blue points.

  • It's a more complex decision boundary.

  • The decision boundary somehow needs to capture the things

  • inside of the circle, and there isn't really a line

  • that will allow us to deal with that.

  • So this is the limitation of the perceptron--

  • these units that just make these binary decisions based on their inputs--

  • that a single perceptron is only capable of learning

  • a linearly separable decision boundary.

  • It can do is define a line.

  • And sure, it can give us probabilities based

  • on how close to that decision boundary we are,

  • but it can only really decide based on a linear decision boundary.

  • And so this doesn't seem like it's going to generalize well to situations

  • where real-world data is involved, because real-world data often

  • isn't linearly separable.

  • It often isn't the case that we can just draw a line through the data

  • and be able to divide it up into multiple groups.

  • So what then is the solution to this?

  • Well, what was proposed was the idea of a multilayer neural network,

  • that so far, all of the neural networks we've seen have had a set of inputs

  • and a set of outputs, and the inputs are connected to those outputs.

  • But in a multi-layer neural network, this is going to be an artificial

  • neural network that has an input layer still, it has an output layer,

  • but also has one or more hidden layers in between--

  • other layers of artificial neurons, or units, that

  • are going to calculate their own values as well.

  • So instead of a neural network that looks like this,

  • with three inputs and one output, you might imagine, in the middle here,

  • injecting a hidden layer--

  • something like this.

  • This is a hidden layer that has four nodes.

  • You could choose how many nodes or units end up going into the hidden layer,

  • and you have multiple hidden layers as well.

  • And so now each of these inputs isn't directly connected to the output.

  • Each of the inputs is connected to this hidden layer, and then

  • all of the nodes in the hidden layer, those are connected to the one output.

  • And so this is just another step that we can

  • take towards calculating more complex functions.

  • Each of these hidden units will calculate its output value,

  • otherwise known as its activation, based on a linear combination

  • of all the inputs.

  • And once we have values for all of these nodes,

  • as opposed to this just being the output, we do the same thing again--

  • calculate the output for this node, based

  • on multiplying each of the values for these units by their weights as well.

  • So in effect, the way this works is that we start with inputs.

  • They get multiplied by weights in order to calculate

  • values for the hidden nodes.

  • Those get multiplied by weights in order to figure out what

  • the ultimate output is going to be.

  • And the advantage of layering things like this is it gives us an ability

  • to model more complex functions, that instead of just having a single

  • decision boundary-- a single line dividing the red points from the blue

  • points--

  • each of these hidden nodes can learn a different decision boundary,

  • and we can combine those decision boundaries to figure out what

  • the ultimate output is going to be.

  • And as we begin to imagine more complex situations,

  • you could imagine each of these nodes learning some useful property

  • or learning some useful feature of all of the inputs

  • and somehow learning how to combine those features together in order to get

  • the output that we actually want.

  • Now the natural question, when we begin to look at this now,

  • is to ask the question of, how do we train a neural network

  • that has hidden layers inside of it?

  • And this turns out to initially be a bit of a tricky question,

  • because the input data we are given is we are given values for all

  • of the inputs, and we're given what the value of the output should be--

  • what the category is, for example--

  • but the input data doesn't tell us what the values for all of these nodes

  • should be.

  • So we don't know how far off each of these nodes

  • actually is, because we're only given data for the inputs and the outputs.

  • The reason this is called the hidden layer

  • is because the data that is made available to us

  • doesn't tell us what the values for all of these intermediate nodes

  • should actually be.

  • And so the strategy people came up with was to say that if you know what

  • the error or the losses on the output node, well,

  • then based on what these weights are-- if one of these weights is higher than

  • another--

  • you can calculate an estimate for how much the error from this node

  • was due to this part of the hidden node, or this part of the hidden layer,

  • or this part of the hidden layer, based on the values of these weights,

  • in effect saying, that based on the error from the output,

  • I can backpropagate the error and figure out

  • an estimate for what the error is for each of these the hidden layer as well.

  • And there's some more calculus here that we won't get into the details of,

  • but the idea of this algorithm is known as backpropagation.

  • It's an algorithm for training a neural network

  • with multiple different hidden layers.

  • And the idea for this-- the pseudocode for it--

  • will again be, if we want to run gradient descent with backpropagation,

  • we'll start with a random choice of weights as we did before,

  • and now we'll go ahead and repeat the training process again and again.

  • But what we're going to do each time is now

  • we're going to calculate the error for the output layer first.

  • We know the output and what it should be, and we know what we calculated,

  • so we figure out what the error there is.

  • But then we're going to repeat, for every layer,

  • starting with the output layer, moving back into the hidden layer,

  • then the hidden layer before that if there are multiple hidden layers,

  • going back all the way to the very first hidden layer,

  • assuming there are multiple, we're going to propagate the error back one layer--

  • whatever the error was from the output--

  • figure out what the error should be a layer before that based on what

  • the values of those weights are.

  • And then we can update those weights.

  • So graphically, the way you might think about this

  • is that we first start with the output.

  • We know what the output should be.

  • We know what output we calculated.

  • And based on that, we can figure out, all right,

  • how do we need to update those weights, backpropagating

  • the error to these nodes.

  • And using that, we can figure out how we should update these weights.

  • And you might imagine if there are multiple layers,

  • we could repeat this process again and again

  • to begin to figure out how all of these weights should be updated.

  • And this backpropagation algorithm is really

  • the key algorithm that makes neural networks possible,

  • and makes it possible to take these multi-level structures

  • and be able to train those structures, depending

  • on what the values of these weights are in order to figure out

  • how it is that we should go about updating those weights in order

  • to create some function that is able to minimize the total amount of loss,

  • to figure out some good setting of the weights that will take the inputs

  • and translate it into the output that we expect.

  • And this works, as we said, not just for a single hidden layer,

  • but you can imagine multiple hidden layers, where each hidden layer--

  • we just defined however many nodes we want--

  • where each of the nodes in one layer, we can

  • connect to the nodes in the next layer, defining more and more complex

  • networks that are able to model more and more complex types of functions.

  • And so this type of network is what we might call a deep neural network, part

  • of a larger family of deep learning algorithms,

  • if you've ever heard that term.

  • And all deep learning is about is it's using multiple layers to be

  • able to predict and be able to model higher-level features inside

  • of the input, to be able to figure out what the output should be.

  • And so the deep neural network is just a neural network that

  • has multiple of these hidden layers, where we start at the input,

  • calculate values for this layer, then this layer, then this layer,

  • and then ultimately get an output.

  • And this allows us to be able to model more and more sophisticated

  • types of functions, that each of these layers

  • can calculate something a little bit different.

  • And we can combine that information to figure out what the output should be.

  • Of course, as with any situation of machine learning,

  • as we begin to make our models more and more complex,

  • to model more and more complex functions, the risk we run

  • is something like overfitting.

  • And we talked about overfitting last time

  • in the context of overfitting based on when we were training our models to be

  • able to learn some sort of decision boundary, where overfitting happens

  • when we fit too closely to the training data, and as a result,

  • we don't generalize well to other situations as well.

  • And one of the risks we run with a far more complex neural network that

  • has many, many different nodes is that we

  • might overfit based on the input data; we

  • might grow over-reliant on certain nodes to calculate things just purely based

  • on the input data that doesn't allow us to generalize very well to the output.

  • And there are a number of strategies for dealing with overfitting,

  • but one of the most popular in the context of neural networks

  • is a technique known as dropout.

  • And what dropout does is it when we're training the neural network, what we'll

  • do in dropout, is temporarily remove units,

  • temporarily remove these artificial neurons

  • from our network, chosen at random, and the goal here

  • is to prevent over-reliance on certain units.

  • So what generally happens in overfitting is

  • that we begin to over-rely on certain units inside the neural network

  • to be able to tell us how to interpret the input data.

  • What dropout will do is randomly remove some of these units

  • in order to reduce the chance that we over-rely on certain units,

  • to make our neural network more robust, to be

  • able to handle the situations even when we just drop out particular neurons

  • entirely.

  • So the way that might work is we have a network like this,

  • and as we're training it, when we go about trying

  • to update the weights the first time, we'll

  • just randomly pick some percentage of the nodes to drop out of the network.

  • It's as if those nodes aren't there at all.

  • It's as if the weights associated with those nodes aren't there at all.

  • And we'll train in this way.

  • Then the next time we update the weights, we'll pick a different set

  • and just go ahead and train that way, and then again randomly choose

  • and train with other nodes that have been dropped that as well.

  • And the goal of that is that after the training process,

  • if you train by dropping out random nodes inside of this neural network,

  • you hopefully end up with a network that's a little bit more robust, that

  • doesn't rely too heavily on any one particular node,

  • but more generally learns how to approximate a function in general.

  • So that then is a look at some of these techniques

  • that we can use in order to implement a neural network, to get

  • at the idea of taking this input, passing it

  • through these various different layers, in order

  • to produce some sort of output.

  • And what we'd like to do now is take those ideas and put them into code.

  • And to do that, there are a number of different machine learning

  • libraries-- neural network libraries-- that we can use that

  • allow us to get access to someone's implementation of backpropagation

  • and all of these hidden layers.

  • And one of the most popular, developed by Google,

  • is known as TensorFlow, a library that we

  • can use for quickly creating neural networks

  • and modeling them and running them on some sample data

  • to see what the output is going to be.

  • And before we actually start writing code,

  • we'll go ahead and take a look at TensorFlow's Playground, which

  • will be an opportunity for us just to play around

  • with this idea of neural networks in different layers,

  • just to get a sense for what it is that we can do by taking advantage

  • of a neural networks.

  • So let's go ahead and go into TensorFlow's Playground, which you can

  • go to by visiting that URL from before.

  • And what we're going to do now is we're going to try and learn the decision

  • boundary for this particular output.

  • I want to learn to separate the orange points from the blue points,

  • and I'd like to learn some sort of setting of weights

  • inside of a neural network that will be able to separate those from each other.

  • The features we have access to, our input data,

  • are the x value and the y value, so the two values along each of the two axes.

  • And what I'll do now is I can set particular parameters, like what

  • activation function I would like to use, and I'll just go ahead

  • and press Play and see what happens.

  • And what happens here is that you'll see that just by using these two input

  • features-- the x value and the y value, with no hidden layers--

  • just take the input, x and y values, and figure out what the decision boundary

  • is--

  • our neural network learns pretty quickly that in order

  • to divide these two points, we should just use this line.

  • This line acts as the decision boundary that separates this group of points

  • from that group of points, and it does it very well.

  • You can see up here what the loss is.

  • The training loss is zero, meaning we were

  • able to perfectly model separating these two points from each other inside

  • of our training data.

  • So this was a fairly simple case of trying to apply a neural network,

  • because the data is very clean it's very nicely linearly separable.

  • We can just draw a line that separates all of those points from each other.

  • Let's now consider a more complex case.

  • So I'll go ahead and pause the simulation,

  • and we'll go ahead and look at this data set here.

  • This data set is a little bit more complex now.

  • In this data set, we still have blue and orange points

  • that we'd like to separate from each other,

  • but there is no single line that we can draw

  • that is going to be able to figure out how to separate

  • the blue from the orange, because the blue is located in these two quadrants

  • and the orange is located here and here.

  • It's a more complex function to be able to learn.

  • So let's see what happens if we just try and predict based on those inputs--

  • the x- and y-coordinates-- what the output should be.

  • Press Play, and what you'll notice is that we're not really able

  • to draw much of a conclusion, that we're not

  • able to very cleanly see how we should divide

  • the orange points from the blue points, and you don't

  • see a very clean separation there.

  • So it seems like we don't have enough sophistication inside of our network

  • to be able to model something that is that complex.

  • We need a better model for this neural network.

  • And I'll do that by adding a hidden layer.

  • So now I have the hidden layer that has two neurons inside of it.

  • So I have two inputs that then go to two neurons inside of a hidden layer

  • that then go to our output, and now I'll press Play, and what you'll notice here

  • is that we're able to do slightly better.

  • We're able to now say, all right, these points are definitely blue.

  • These points are definitely orange.

  • We're still struggling a little bit with these points up here though,

  • and what we can do is we can see for each

  • of these hidden neurons what is it exactly

  • that these hidden neurons are doing.

  • Each hidden neuron is learning its own decision boundary,

  • and we can see what that boundary is.

  • This first neuron is learning, all right,

  • this line that seems to separate some of the blue points

  • from the rest of the points.

  • This other hidden neuron is learning another line

  • that seems to be separating the orange points in the lower

  • right from the rest of the points.

  • So that's why we're able to sort of figure out

  • these two areas in the bottom region, but we're still not

  • able to perfectly classify all of the points.

  • So let's go ahead and add another neuron--

  • now we've got three neurons inside of our hidden layer--

  • and see what we're able to learn now.

  • All right.

  • Well, now we seem to be doing a better job

  • by learning three different decision boundaries, which

  • each of the three neurons inside of our hidden layer

  • were able to much better figure out how to separate these blue points

  • from the orange points.

  • And you can see what each of these hidden neurons is learning.

  • Each one is learning a slightly different decision boundary,

  • and then we're combining those decision boundaries together

  • to figure out what the overall output should be.

  • And we can try it one more time by adding a fourth neuron there

  • and try learning that.

  • And it seems like now we can do even better

  • at trying to separate the blue points from the orange points,

  • but we were only able to do this by adding a hidden layer,

  • by adding some layer that is learning some other boundaries,

  • and combining those boundaries to determine the output.

  • And the strength-- the size and thickness of these lines--

  • and indicate how high these weights are, how important each of these inputs

  • is, for making this sort of calculation.

  • And we can do maybe one more simulation.

  • Let's go ahead and try this on a data set that looks like this.

  • Go ahead and get rid of the hidden layer.

  • Here now we're trying to separate the blue points

  • from the orange points, where all the blue points are located, again,

  • inside of a circle, effectively.

  • So we're not going to be able to learn a line.

  • Notice I press Play, and we're really not

  • able to draw any sort of classification at all,

  • because there is no line that cleanly separates

  • the blue points from the orange points.

  • So let's try to solve this by introducing a hidden layer.

  • I'll go ahead and press Play.

  • And all right.

  • With two neurons and a hidden layer, we're

  • able to do a little better, because we effectively learned

  • two different decision boundaries.

  • We learned this line here, and we learned this line

  • on the right-hand side.

  • And right now, we're just saying, all right, well, if it's in-between,

  • we'll call it blue, and if it's outside, we'll call it orange.

  • So, not great, but certainly better than before.

  • We're learning one decision boundary and another, and based on those,

  • we can figure out what the output should be.

  • But let's now go ahead and add a third neuron and see what happens now.

  • I go ahead and train it.

  • And now, using three different decision boundaries

  • that are learned by each of these hidden neurons,

  • we're able to much more accurately model this distinction

  • between blue points and orange points.

  • We're able to figure out, maybe with these three decision boundaries,

  • combining them together, you can imagine figuring out what the output should be

  • and how to make that sort of classification.

  • And so the goal here is just to get a sense

  • for having more neurons in these hidden layers that

  • allows us to learn more structure in the data,

  • allows us to figure out what the relevant and important decision

  • boundaries are.

  • And then using this backpropagation algorithm,

  • we're able to figure out what the values of these weights

  • should be in order to train this network to be

  • able to classify one category of points away from another category of points

  • instead.

  • And this is ultimately what we're going to be trying to do whenever

  • we're training a neural network.

  • So let's go ahead and actually see an example of this.

  • You'll recall from last time that we had this banknotes file that

  • included information about counterfeit banknotes as opposed

  • to authentic banknotes, where it had four different values for each banknote

  • and then a categorization of whether that bank note is considered

  • to be authentic or a counterfeit note.

  • And what I wanted to do was, based on that input information,

  • figure out some function that could calculate

  • based on the input information what category it belonged to.

  • And what I've written here in banknotes.py

  • is a neural network that we'll learn just that, a network that learns,

  • based on all of the input, whether or not

  • we should categorize a banknote as authentic or as counterfeit.

  • The first step is the same as what we saw from last time.

  • I'm really just reading the data in and getting it into an appropriate format.

  • And so this is where more of the writing Python code on your own

  • comes in terms of manipulating this data,

  • massaging the data into a format that will

  • be understood by a machine learning library

  • like scikit-learn or like TensorFlow.

  • And so here I separate it into a training and a testing set.

  • And now what I'm doing down below is I'm creating a neural network.

  • Here I'm using tf, which stands for TensorFlow.

  • Up above I said, import TensorFlow as tf.

  • So you have just an abbreviation that we'll often use,

  • so we don't need to write out TensorFlow every time we want

  • to use anything inside of the library.

  • I'm using tf.keras.

  • Keras is an API, a set of functions that we

  • can use in order to manipulate neural networks inside of TensorFlow,

  • and it turns out there are other machine learning

  • libraries that also use the Kersa API.

  • But here, I'm saying, all right, go ahead and give me

  • a model that is a sequential model-- a sequential neural network--

  • meaning one layer after another.

  • And now I'm going to add to that model what layers I want inside

  • of my neural network.

  • So here I'm saying, model.add.

  • Go ahead and add a dense layer--

  • and when we say a dense layer, we mean a layer that

  • is just each of the nodes inside of the layer

  • is going to be connected to each from the previous layer,

  • so we have a densely connected layer.

  • This layer is going to have eight units inside of it.

  • So it's going to be a hidden layer inside of a neural network with eight

  • different units, eight artificial neurons, each of which

  • might learn something different.

  • And I just sort of chose eight arbitrarily.

  • You could choose a different number of hidden nodes inside of the layer.

  • And as we saw before, depending on the number of units

  • there are inside of your head and layer, more units

  • means you can learn more complex functions,

  • so maybe you can more accurately model the training data,

  • but it comes at a cost.

  • More units means more weights that you need to figure out how to update,

  • so it might be more expensive to do that calculation.

  • And you also run the risk of overfitting on the data if you have too many units,

  • and you learn to just overfit on the training data.

  • That's not good either.

  • So there is a balance, and there's often a testing process,

  • where you'll train on some data and maybe validate how well you're

  • doing on a separate set of data--

  • often called a validation set-- to see, all right, which setting of parameters,

  • how many layers should I have, how many units

  • should be in each layer, which one of those

  • performs the best on the validation set?

  • So you can do some testing to figure out what these hyperparameters, so-called,

  • should be equal to.

  • Next I specify what the input_shape is, meaning what does my input look like?

  • My input has four values, and so the input shape

  • is just 4, because we have four inputs.

  • And then I specify what the activation function is.

  • And the activation function, again, we can choose.

  • There a number of different activation functions.

  • Here I'm using relu, which you might recall from earlier.

  • And then I'll add an output layer.

  • So I have my hidden layer.

  • Now I'm adding one more layer that will just

  • have one unit, because all I want to do is predict something

  • like counterfeit bill or authentic bill.

  • So I just need a single unit.

  • And the activation function I'm going to use here

  • is that sigmoid activation function, which

  • again was that S-shaped curve that just gave us like a probability of,

  • what is the probability that this is a counterfeit bill as opposed

  • to an authentic bill?

  • So that then is the structure of my neural network-- sequential neural

  • network that has one hidden layer with eight units inside of it,

  • and then one output layer that just has a single unit inside of it.

  • And I can choose how many units there are.

  • I can choose the activation function.

  • Then I'm going to compile this model.

  • TensorFlow gives you a choice of how you would like to optimize the weights--

  • there are various different algorithms for doing that--

  • what type of loss function you want to use-- again,

  • many different options for doing that--

  • and then how I want to evaluate my model.

  • Well, I care about accuracy.

  • I care about how many of my points am I able to classify correctly

  • versus not correctly of counterfeit or not counterfeit,

  • and I would like it to report to me how accurate my model is performing.

  • Then, now that I've defined that model, I

  • call model.fit to say, go ahead and train the model.

  • Train it on all the training data, plus all of the training labels--

  • so labels for each of those pieces of training data--

  • and I'm saying run it for 20 epochs, meaning go ahead

  • and go through each of these training points 20 times effectively,

  • go through the data 20 times and keep trying to update the weights.

  • If I did it for more, I could train for even longer

  • and maybe get a more accurate result. But then

  • after I fit in on all the data, I'll go ahead and just test it.

  • I'll evaluate my model using model.evaluate,

  • built into TensorFlow, that is just going to tell me,

  • how well do I perform on the testing data?

  • So ultimately, this is just going to give me

  • some numbers that tell me how well we did in this particular case.

  • So now what I'm going to do is go into banknotes

  • and go ahead and run banknotes.py.

  • And what's going to happen now is it's going

  • to read in all of that trading data.

  • It's going to generate a neural network with all my inputs,

  • my eight hidden layers, or eight hidden units inside my layer,

  • and then an output unit, and now what it's doing is it's training.

  • It's training 20 times, and each time, you

  • can see how my accuracy is increasing on my training data.

  • It starts off, the very first time, not very accurate,

  • though better than random, something like 79% of the time,

  • it's able to accurately classify one bill from another.

  • But as I keep training, notice this accuracy value improves and improves

  • and improves, until after I've trained through all of the data points

  • 20 times, it looks like my accuracy is above 99% on the training data.

  • And here's where I tested it on a whole bunch of testing data.

  • And it looks like in this case, I was also like 99.8% accurate.

  • So just using that, I was able to generate a neural network that

  • can detect counterfeit bills from authentic bills

  • based on this input data 99.8% of the time, at least

  • based on this particular testing data.

  • And I might want to test it with more data

  • as well, just to be confident about that.

  • But this is really the value of using a machine learning library

  • like TensorFlow, and there are others available for Python

  • and other languages as well, but all I have to do

  • is define the structure of the network and define the data

  • that I'm going to pass into the network, and then

  • TensorFlow runs the backpropagation algorithm

  • for learning what all of those weights should be,

  • for figuring out how to train this neural network to be able to,

  • as accurately as possible, figure out what the output values should

  • be there as well.

  • And so this then was a look at what it is that neural networks can do, just

  • using these sequences of layer after layer after layer,

  • and you can begin to imagine applying these to much more general problems.

  • And one big problem in computing, and artificial intelligence more generally,

  • is the problem of computer vision.

  • Computer vision is all about computational methods

  • for analyzing and understanding images, that you might have pictures

  • that you want the computer to figure out how to deal with,

  • how to process those images, and figure out how to produce

  • some sort of useful result out of this.

  • You've seen this in the context of social media websites

  • that are able to look at a photo that contains a whole bunch of faces,

  • and it's able to figure out what's a picture of whom

  • and label those and tag them with appropriate people.

  • This is becoming increasingly relevant as we

  • begin to discuss self-driving cars.

  • These cars now have cameras, and we would

  • like for the computer to have some sort of algorithm that

  • looks at the images and figures out, what

  • color is the light, what cars are around us and in what direction, for example.

  • And so computer vision is all about taking an image

  • and figuring out what sort of computation--

  • what sort of calculation-- we can do with that image.

  • It's also relevant in the context of something like handwriting recognition.

  • This, what you're looking at, is an example of the MNIST dataset--

  • it's a big dataset just of handwritten digits--

  • that we could use to, ideally, try and figure out how to predict,

  • given someone's handwriting, given a photo of a digit that they have drawn,

  • can you predict whether it's a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example.

  • So this sort of handwriting recognition is yet another task

  • that we might want to use computer vision tasks and tools to be

  • able to apply it towards.

  • This might be a task that we might care about.

  • So how then can we use neural networks to be

  • able to solve a problem like this?

  • Well, neural networks rely upon some sort of input,

  • where that input is just numerical data.

  • We have a whole bunch of units, where each one of them

  • just represents some sort of number.

  • And so in the context of something like handwriting recognition,

  • or in the context of just an image, you might

  • imagine that an image is really just a grid of pixels, a grid of dots,

  • where each dot has some sort of color, and in the context

  • of something like handwriting recognition,

  • you might imagine that if you just fill in each

  • of these dots in a particular way, you can generate a 2 or an 8,

  • for example, based on which dots happen to be shaded in and which dots are not.

  • And we can represent each of these pixel values just using numbers.

  • So for a particular pixel, for example, 0 might represent entirely black.

  • Depending on how you're representing color,

  • it's often common to represent color values on a 0-to-255 range,

  • so that you can represent a color using eight bits for a particular value,

  • like how much white is in the image?

  • So 0 might represent all black, 255 might represent entirely white

  • as a pixel, and somewhere in between might represent some shade of gray,

  • for example.

  • But you might imagine not just having a single slider that determines how much

  • white is in the image, but if you had a color image,

  • you might imagine three different numerical values-- a red, green,

  • and blue value--

  • where the red value controls how much red is in the image,

  • we have one value for controlling how much green is in the pixel,

  • and one value for how much blue is in the pixel as well.

  • And depending on how it is that you set these values of red, green, and blue,

  • you can get a different color.

  • And so any pixel can really be represented in this case

  • by three numerical values--

  • a red value, a green value, and a blue value.

  • And if you take a whole bunch of these pixels,

  • assemble them together inside of a grid of pixels, then

  • you really just have a whole bunch of numerical values

  • that you can use in order to perform some sort of prediction task.

  • And so what you might imagine doing is using the same techniques

  • we talked about before.

  • Just design a neural network with a lot of inputs, that for each of the pixels,

  • we might have one or three different inputs in the case of a color image--

  • a different input-- that is just connected to a deep neural network,

  • for example.

  • And this deep neural network might take all

  • of the pixels inside of the image of what digit a person drew,

  • and the output might be like 10 neurons that classify it as a 0 or a 1

  • or 2 or 3, or just tells us in some way what that digit happens to be.

  • Now there are a couple of drawbacks to this approach.

  • The first drawback to the approach is just the size of this input array,

  • that we have a whole bunch of inputs.

  • If we have a big image, that is a lot of different channels

  • we're looking at-- a lot of inputs, and therefore, a lot of weights

  • that we have to calculate.

  • And a second problem is the fact that by flattening everything

  • into just the structure of all the pixels,

  • we've lost access to a lot of the information about the structure

  • of the image that's relevant, that really,

  • when a person looks at an image, they're looking

  • at particular features of that image.

  • They're looking at curves.

  • They're looking at shapes.

  • They're looking at what things can you identify

  • in different regions of the image, and maybe put those things together

  • in order to get a better picture of what the overall image was about.

  • And by just turning it into a pixel values for each of the pixels,

  • sure, you might be able to learn that structure,

  • but it might be challenging in order to do so.

  • It might be helpful to take advantage of the fact that you can use properties

  • of the image itself-- the fact that it's structured in a particular way--

  • to be able to improve the way that we learn based on that image too.

  • So in order to figure out how we can train our neural networks to better

  • be able to deal with images, we'll introduce a couple of ideas--

  • a couple of algorithms-- that we can apply that allow us to take the images

  • and extract some useful information out of that image.

  • And the first idea we'll introduce is the notion of image convolution.

  • And what an image convolution is all about is it's about filtering an image,

  • sort of extracting useful or relevant features out of the image.

  • And the way we do that is by applying a particular filter that basically adds

  • the value for every pixel with the values for all of the neighboring

  • pixels to it.

  • According to some sort of kernel matrix, which we'll see in a moment,

  • it's going to allow us to weight these pixels in various different ways.

  • And the goal of image convolution then is

  • to extract some sort of interesting or useful features out of an image,

  • to be able to take a pixel, and based on its neighboring pixels,

  • maybe predict some sort of valuable information, something

  • like taking a pixel and looking at its neighboring pixels,

  • you might be able to predict whether or not

  • there's some sort of curve inside the image,

  • or whether it's forming the outline of a particular line or a shape,

  • for example, and that might be useful if you're

  • trying to use all of these various different features

  • to combine them to say something meaningful about an image as a whole.

  • So how then does image convolution work?

  • Well, we start with a kernel matrix, and the kernel matrix

  • looks something like this.

  • And the idea of this is that given a pixel--

  • that would be the middle pixel--

  • we're going to multiply each of the neighboring pixels by these values

  • in order to get some sort of result by summing up all of the numbers together.

  • So if I take this kernel, which you can think of is like a filter

  • that I'm going to apply to the image.

  • And let's say that I take this image.

  • This is a four-by-four image.

  • We'll think of it as just a black and white image, where each one is just

  • a single pixel value, so somewhere between 0 and 255, for example.

  • So we have a whole bunch of individual pixel values like this,

  • and what I'd like to do is apply this kernel--

  • this filter, so to speak--

  • to this image.

  • And the way I'll do that is, all right, the kernel is three-by-three.

  • So you can imagine a five-by-five kernel or a larger kernel too.

  • And I'll take it and just first apply it to the first three-by-three section

  • of the image.

  • And what I'll do is I'll take each of these pixel values

  • and multiply it by its corresponding value in the filter matrix

  • and add all of the results together.

  • So here, for example, I'll say 10 times 0, plus 20, times negative 1, plus 30,

  • times 0, so on and so forth, doing all of this calculation.

  • And at the end, if I take all these values,

  • multiply them by their corresponding value in the kernel,

  • add the results together, for this particular set of nine pixels,

  • I get the value of 10 for example.

  • And then what I'll do is I'll slide this three-by-three grid effectively over.

  • Slide the kernel by one to look at the next three-by-three section.

  • And here I'm just sliding it over by one pixel,

  • but you might imagine a different slide length,

  • or maybe I jump by multiple pixels at a time if you really wanted to.

  • You have different options here.

  • But here I'm just sliding over, looking at the next three-by-three section.

  • And I'll do the same math 20 times 0, plus 30, times a negative 1, plus 40,

  • times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5.

  • And what I end up getting is the number 20.

  • Then you can imagine shifting over to this one, doing the same thing,

  • calculating like the number 40, for example,

  • and then doing the same thing here and calculating a value there as well.

  • And so what we have now is what we'll call a feature map.

  • We have taken this kernel, applied it to each

  • of these various different regions, and what we get

  • is some representation of a filtered version of that image.

  • And so to give a more concrete example of why it is that this kind of thing

  • could be useful, let's take this kernel matrix,

  • for example, which is quite a famous one, that has an 8 in the middle

  • and then all of the neighboring pixels that get a negative 1.

  • And let's imagine we wanted to apply that

  • to a three-by-three part of an image that looks like this,

  • where all the values are the same.

  • They're all 20, for instance.

  • Well, in this case, if you do 20 times 8, and then subtract 20,

  • subtract 20, subtract 20, for each of the eight neighbors,

  • well, the result of that is you just get that expression,

  • which comes out to be 0.

  • You multiply 20 by 8, but then you subtracted 28 times

  • according to that particular kernel.

  • The result of all of that is just 0.

  • So the takeaway here is that when a lot of the pixels are the same value,

  • we end up getting a value close to 0.

  • If, though, we had something like this, 20s along this first row,

  • then 50s in the second row, and 50s in the third row, well,

  • then when you do this same kind of math--

  • 20 times negative 1, 20 times negative 1, so on and so forth--

  • then I get a higher value-- a value like 90, in this particular case.

  • And so the more general idea here is that

  • by applying this kernel, negative 1s, 8 in the middle,

  • and then negative 1s, what I get is when this middle value is very

  • different from the neighboring values--

  • like 50 is greater than these 20s--

  • then you'll end up with a value higher than 0.

  • Like if this number is higher than its neighbors,

  • you end up getting a bigger output, but if this value is the same as all

  • of its neighbors, then you get a lower output, something like 0.

  • And it turns out that this sort of filter

  • can therefore be used in something like detecting edges in an image,

  • or want to detect like the boundaries between various different objects

  • inside of an image.

  • I might use a filter like this, which is able to tell

  • whether the value of this pixel is different from the values

  • of the neighboring pixel-- if it's like greater than the values of the pixels

  • that happened to surround it.

  • And so we can use this in terms of image filtering.

  • And so I'll show you an example of that.

  • I have here, in filter.py, a file that uses Python's image library, or PIL,

  • to do some image filtering.

  • I go ahead and open an image.

  • And then all I'm going to do is apply a kernel to that image.

  • It's going to be a three-by-three kernel, the same kind of kernel

  • we saw before.

  • And here is the kernel.

  • This is just a list representation of the same matrix

  • that I showed you a moment ago, with it's

  • negative 1, negative 1, negative 1.

  • The second row is negative 1, 8, negative 1.

  • The third row is all negative 1s.

  • And then at the end, I'm going to go ahead and show the filtered image.

  • So if, for example, I go into convolution directory

  • and I open up an image like bridge.png, this

  • is what an input image might look like, just an image of a bridge over a river.

  • Now I'm going to go ahead and run this filter program on the bridge.

  • And what I get is this image here.

  • Just by taking the original image and applying that filter

  • to each three-by-three grid, I've extracted

  • all of the boundaries, all of the edges inside the image that separate

  • one part of the image from another.

  • So here I've got a representation of boundaries

  • between particular parts of the image.

  • And you might imagine that if a machine learning algorithm is

  • trying to learn like what an image is of, a filter like this

  • could be pretty useful.

  • Maybe the machine learning algorithm doesn't care about all

  • of the details of the image.

  • It just cares about certain useful features.

  • It cares about particular shapes that are

  • able to help it determine that based on the image,

  • this is going to be a bridge, for example.

  • And so this type of idea of image convolution

  • can allow us to apply filters to images that

  • allow us to extract useful results out of those images-- taking an image

  • and extracting its edges, for example.

  • You might imagine many other filters that

  • could be applied to an image that are able to extract particular values as

  • well.

  • And a filter might have separate kernels for the red values, the green values,

  • and the blue values that are all summed together at the end,

  • such that you could have particular filters looking for,

  • is there red in this part of the image?

  • Are there green in other parts of the image?

  • You can begin to assemble these relevant and useful filters that are

  • able to do these calculations as well.

  • So that then was the idea of image convolution-- applying

  • some sort of filter to an image to be able to extract

  • some useful features out of that image.

  • But all the while, these images are still pretty big.

  • There's a lot of pixels involved in the image.

  • And realistically speaking, if you've got a really big image,

  • that poses a couple of problems.

  • One, it means a lot of input going into the neural network,

  • but two, it also means that we really have

  • to care about what's in each particular pixel, whereas realistically we often,

  • if you're looking at an image, you don't care

  • whether it's something is in one particular pixel

  • versus the pixel immediately to the right of it.

  • They're pretty close together.

  • You really just care about whether there is

  • a particular feature in some region of the image,

  • and maybe you don't care about exactly which pixel it happens to be.

  • And so there's a technique we can use known as pooling.

  • And what pooling is, is it means reducing the size of an input

  • by sampling from regions inside of the input.

  • So we're going to take a big image and turn it into a smaller image

  • by using pooling.

  • And in particular, one of the most popular types of pooling

  • is called max-pooling.

  • And what max-pooling does is it pools just by choosing the maximum value

  • in a particular region.

  • So, for example, let's imagine I had this four-by-four image,

  • but I wanted to reduce its dimensions.

  • I wanted to make an a smaller image, so that I have fewer inputs to work with.

  • Well, what I could do is I could apply a two-by-two max

  • pool, where the idea would be that I'm going

  • to first look at this two-by-two region and say, what

  • is the maximum value in that region?

  • Well, it's the number 50.

  • So we'll go ahead and just use the number 50.

  • And then we'll look at this two-by-two region.

  • What is the maximum value here?

  • 110.

  • So that's going to be my value.

  • Likewise here, the maximum value looks like 20.

  • Go ahead and put that there.

  • Then for this last region, the maximum value

  • was 40, so we'll go ahead and use that.

  • And what I have now is a smaller representation

  • of this same original image that I obtained just

  • by picking the maximum value from each of these regions.

  • So again, the advantages here are now I only

  • have to deal with a two-by-two input instead of a four-by-four,

  • and you can imagine shrinking the size of an image even more.

  • But in addition to that, I'm now able to make

  • my analysis independent of whether a particular value was

  • in this pixel or this pixel.

  • I don't care if the 50 was here or here.

  • As long as it was generally in this region,

  • I'll still get access to that value.

  • So it makes our algorithms a little bit more robust as well.

  • So that then is pooling--

  • taking the size of the image and reducing it

  • a little bit by just sampling from particular regions inside of the image.

  • And now we can put all of these ideas together-- pooling, image convolution,

  • neural networks-- all together into another type of neural network called

  • a convolutional neural network, or a CNN, which is a neural network that

  • uses this convolution step, usually in the context of analyzing an image,

  • for example.

  • And so the way that a convolutional neural own network works is that we

  • start with some sort of input image-- some grid of pixels--

  • but rather than immediately put that into the neural network layers

  • that we've seen before, we'll start by applying a convolution step, where

  • the convolution step involves applying a number of different image filters

  • to our original image in order to get what

  • we call a feature map, the result of applying some filter to an image.

  • And we could do this once, but in general, we'll

  • do this multiple times getting a whole bunch of different feature

  • maps, each of which might extract some different relevant feature out

  • of the image, some different important characteristic of the image

  • that we might care about using in order to calculate what the result should be.

  • And in the same way to when we train neural networks,

  • we can train neural networks to learn the weights between particular units

  • inside of the neural networks.

  • We can also train neural networks to learn what those filters should be--

  • what the values of the filters should be--

  • in order to get the most useful, most relevant information out

  • of the original image just by figuring out what setting of those filter

  • values-- the values inside of that kernel--

  • results in minimizing the loss function and minimizing how poorly

  • our hypothesis actually performs in figuring out the classification

  • of a particular image, for example.

  • So we first apply this convolution step.

  • Get a whole bunch of these various different feature maps.

  • But these feature maps are quite large.

  • There is a lot of pixel values that happen to be here.

  • And so a logical next step to take is a pooling step,

  • where we reduce the size of these images by using max-pooling,

  • for example, extracting the maximum value from any particular region.

  • There are other pooling methods that exist

  • as well, depending on the situation.

  • You could use something like average-pooling,

  • where instead of taking the maximum value from a region,

  • you take the average value from a region, which has it uses as well.

  • But in effect, what pooling will do is it will take these feature maps

  • and reduce their dimensions, so that we end up

  • with smaller grids with fewer pixels.

  • And this then is going to be easier for us to deal with.

  • It's going to mean fewer inputs that we have to worry about,

  • and it's also going to mean we're more resilient, more robust,

  • against potential movements of particular values just by one pixel,

  • when ultimately, we really don't care about those one pixel differences that

  • might arise in the original image.

  • Now after we've done this pooling step, now we have a whole bunch of values

  • that we can then flatten out and just put

  • into a more traditional neural network.

  • So we go ahead and flatten it, and then we

  • end up with a traditional neural network that

  • has one input for each of these values in each of these resulting feature

  • maps after we do the convolution and after we do the pooling step.

  • And so this then is the general structure of a convolutional network.

  • We begin with the image, apply convolution,

  • apply pooling, flatten the results, and then put that

  • into a more traditional neural network that might itself have hidden layers.

  • You can have deep convolutional networks that

  • have hidden layers in between this flattened layer and the eventual output

  • to be able to calculate various different features of those values.

  • But this then can help us to be able to use convolution and pooling,

  • to use our knowledge about the structure of an image,

  • to be able to get better results, to be able to train our networks faster

  • in order to better capture particular parts of the image.

  • And there's no reason necessarily why you can only use these steps once.

  • In fact, in practice, you'll often use convolution and pooling multiple times

  • in multiple different steps.

  • So what you might imagine doing is starting with an image,

  • first applying convolution to get a whole bunch of maps,

  • then applying pooling, then applying convolution again,

  • because these maps are still pretty big.

  • You can apply convolution to try and extract relevant features

  • out of this result. Then take those results,

  • apply pooling in order to reduce their dimensions, and then take that

  • and feed it into a neural network that maybe has fewer inputs.

  • So here, I have two different convolution and pooling steps.

  • I do convolution and pooling once, and then I

  • do convolution and pooling a second time, each time extracting

  • useful features from the layer before it, each time using

  • pooling to reduce the dimensions of what you're ultimately looking at.

  • And the goal now of this sort of model is that in each of these steps,

  • you can begin to learn different types of features

  • of the original image, that maybe in the first step

  • you learn very low-level features, just learn and look for features like edges

  • and curves and shapes, because based on pixels in their neighboring values,

  • you can figure out, all right, what are the edges?

  • What are the curves?

  • What are the various different shapes that might be present there?

  • But then once you have a mapping that just represents

  • where the edges and curves and shapes happen to be,

  • you can imagine applying the same sort of process

  • again to begin to look for higher-level features-- look for objects,

  • maybe look for people's eyes in facial recognition,

  • for example, maybe look at more complex shapes like the curves

  • on a particular number if you're trying to recognize a digit in a handwriting

  • recognition sort of scenario.

  • And then after all of that, now that you have

  • these results that represent these higher-level features,

  • you can pass them into a neural network, which is really

  • just a deep neural network that looks like this, where you might imagine

  • making a binary classification, or classifying into multiple categories,

  • or performing various different tasks on this sort of model.

  • So convolutional neural networks can be quite powerful and quite popular

  • when it comes to trying to analyze images.

  • We don't strictly need them.

  • We could have just used a vanilla neural network that just operates with layer

  • after layer as we've seen before.

  • But these convolutional neural networks can

  • be quite helpful, in particular, because of the way they

  • model the way a human might look at an image,

  • that instead of a human looking at every single pixel

  • simultaneously and trying to involve all of them by multiplying them together,

  • you might imagine that what convolution is really

  • doing is looking at various different regions of the image

  • and extracting relevant information and features out

  • of those parts of the image the same way that a human might

  • have visual receptors that are looking at particular parts of what they see,

  • and using those, combining them, to figure out

  • what meaning they can draw from all of those various different inputs.

  • And so you might imagine applying this to a situation like handwriting

  • recognition.

  • So we'll go ahead and see an example of that now.

  • I'll go ahead and open up handwriting.py.

  • Again, what we do here is we first import TensorFlow.

  • And then, TensorFlow, it turns out, has a few datasets

  • that are built in-- built into the library

  • that you can just immediately access.

  • And one of the most famous datasets in machine learning

  • is the MNIST dataset, which is just a dataset of a whole bunch of samples

  • of people's handwritten digits.

  • I showed you a slide of that a little while ago.

  • And what we can do is just immediately access that dataset,

  • which is built into the library, so that if I want to do something like train

  • on a whole bunch of digits, I can just use the dataset that is provided to me.

  • Of course, if I had my own dataset of handwritten images,

  • I can apply the same idea.

  • I'd first just need to take those images and turn them into an array of pixels,

  • because that's the way that these are going to be formatted.

  • They're going to be formatted as, effectively,

  • an array of individual pixels.

  • And now there's a bit of reshaping I need to do,

  • just turning the data into a format that I can put

  • into my convolutional neural network.

  • So this is doing things like taking all the values and dividing them by 255.

  • If you remember, these color values tend to range from 0 to 255.

  • So I can divide them by 255, just to put them into a 0-to-1 range,

  • which might be a little bit easier to train on .

  • And then doing various other modifications to the data, just

  • to get it into a nice usable format.

  • But here's the interesting and important part.

  • Here is where I create the convolutional neural network-- the CNN--

  • where here I'm saying, go ahead and use a sequential model.

  • And before I could use model.add to say add a layer, add a layer, add a layer,

  • another way I could define it is just by passing

  • as input to the sequential neural network a list of all of the layers

  • that I want.

  • And so here, the very first layer in my model

  • is a convolutional layer, where I'm first

  • going to apply convolution to my image.

  • I'm going to use 13 different filters, so my model is going to learn--

  • 32, rather-- 32 different filters that I would

  • like to learn on the input image, where each filter is

  • going to be a three-by-three kernel.

  • So we saw those three-by-three kernels before,

  • where we could multiply each value in a three-by-three grid by value,

  • multiply it and add all the results together.

  • So here I'm going to learn 32 different of these three-by-three filters.

  • I can again specify my activation function.

  • And I specify what my input shape is.

  • My input shape in the banknotes case was just 4.

  • I had four inputs.

  • My input shape here is going to be 28, comma, 28, comma 1, because for each

  • of these handwritten digits, it turns out

  • that the MNIST dataset organizes their data.

  • Each image is a 28-by-28 pixel grid.

  • They're going to be a 28-by-28 pixel grid, and each one of those images only

  • has one channel value.

  • These handwritten digits are just black and white,

  • so it's just a single color value representing

  • how much black or how much white.

  • You might imagine that in a color image, if you were doing this sort of thing,

  • you might have three different channels-- a red,

  • a green, and a blue channel, for example.

  • But in the case of just handwriting recognition and recognizing a digit,

  • we're just going to use a single value for shaded-in in or not shaded-in,

  • and it might range, but it's just a single color value.

  • And that then is the very first layer of our neural network,

  • a convolutional layer that will take the input

  • and learn a whole bunch of different filters

  • that we can apply to the input to extract meaningful features.

  • The next step is going to be a max-pooling layer, also built

  • right into TensorFlow, where this is going

  • to be a layer that is going to use a pool size of two by two,

  • meaning we're going to look at two-by-two regions inside of the image,

  • and just extract the maximum value.

  • Again, we've seen why this can be helpful.

  • It'll help to reduce the size of our input.

  • Once we've done that, we'll go ahead and flatten all of the units just

  • into a single layer that we can then pass

  • into the rest of the neural network.

  • And now, here's the rest of the whole network.

  • Here, I'm saying, let's add a hidden layer to my neural network with 128

  • units-- so a whole bunch of hidden units inside of the hidden layer--

  • and just to prevent overfitting, I can add a dropout to that-- say,

  • you know what?

  • When you're training, randomly drop out half from this hidden layer,

  • just to make sure we don't become over-reliant on any particular node.

  • We begin to really generalize and stop ourselves from overfitting.

  • So TensorFlow allows us, just by adding a single line,

  • to add dropout into our model as well, such that when it's training,

  • it will perform this dropout step in order

  • to help make sure that we don't overfit on this particular data.

  • And then finally, I add an output layer.

  • The output layer is going to have 10 units, one

  • for each category, that I would like to classify digits into,

  • so 0 through 9, 10 different categories.

  • And the activation function I'm going to use here

  • is called the softmax activation function.

  • And in short, what the softmax activation function is going to do

  • is it's going to take the output and turn it

  • into a probability distribution.

  • So ultimately, it's going to tell me, what

  • did we estimate the probability is that this is a 2 versus a 3 versus a 4,

  • and so it will turn it into that probability distribution for me.

  • Next up, I'll go ahead and compile my model

  • and fit it on all of my training data.

  • And then I can evaluate how well the neural network performs.

  • And then I've added to my Python program,

  • if I've provided a command line argument, like the name of a file,

  • I'm going to go ahead and save the model to a file.

  • And so this can be quite useful too.

  • Once you've done the training step, which

  • could take some time, in terms of taking all the time--

  • going through the data; running backpropagation with gradient descent;

  • to be able to say, all right, how should we adjust

  • the weight to this particular model--

  • you end up calculating values for these weights,

  • calculating values for these filters, and you'd

  • like to remember that information, so you can use it later.

  • And so TensorFlow allows us to just save a model to a file,

  • such that later if we want to use the model we've learned,

  • use the weights that we've learned, to make some sort of new prediction

  • we can just use the model that already exists.

  • So what we're doing here is after we've done all the calculation,

  • we go ahead and save the model to a file, such

  • that we can use it a little bit later.

  • So for example, if I go into digits, I'm going to run handwriting.py.

  • I won't save it this time.

  • We'll just run it and go ahead and see what happens.

  • What will happen is we need to go through the model

  • in order to train on all of these samples of handwritten digits.

  • So the MNIST dataset gives us thousands and thousands

  • of sample handwritten digits in the same format

  • that we can use in order to train.

  • And so now what you're seeing is this training process,

  • and unlike the banknotes case, where there was much,

  • much fewer data points--

  • the data was very, very simple--

  • here, the data is more complex, and this training process takes time.

  • And so this is another one of those cases where

  • when training neural networks, this is why computational power is

  • so important, that oftentimes, you see people wanting

  • to use a sophisticated GPUs in order to more efficiently be

  • able to do this sort of neural network we're training.

  • It also speaks to the reason why more data can be helpful.

  • The more sample data points you have, the better

  • you can begin to do this training.

  • So here we're going through 60,000 different samples

  • of handwritten digits.

  • And I said that we're going to go through them 10 times.

  • So we're going to go through the dataset 10 times, training each time,

  • hopefully improving upon our weights with every time

  • we run through this dataset.

  • And we can see over here on the right what the accuracy is

  • each time we go ahead and run this model, that the first time,

  • it looks like we got an accuracy of about 92% of the digits

  • correct based on this training set.

  • We increased that to 96% or 97%.

  • And every time we run this, we're going to see,

  • hopefully, the accuracy improve, as we continue to try and use

  • that gradient descent, that process of trying to run the algorithm

  • to minimize the loss that we get in order to more accurately predict

  • what the output should be.

  • And what this process is doing is it's learning not only the weights,

  • but it's learning the features to use-- the kernel

  • matrix to use-- when performing that convolution step, because this

  • is a convolutional neural network, where I'm first performing

  • those convolutions, and then doing the more traditional neural network

  • structure.

  • This is going to learn all of those individual steps as well.

  • So here, we see the TensorFlow provides me with some very nice output, telling

  • me about how many seconds are left with each of these training runs,

  • that allows me to see just how well we're doing.

  • So we'll go ahead and see how this network performs.

  • It looks like we've gone through the dataset seven times.

  • We're going through an eighth time now.

  • And at this point, the accuracy is pretty high.

  • We saw we went from 92% up to 97%.

  • Now it looks like 98%.

  • And at this point, it seems like things are starting to level out.

  • There's probably a limit to how accurate we can ultimately

  • be without running the risk of overfitting.

  • Of course, with enough nodes, you could just memorize the input and overfit

  • upon them.

  • But we'd like to avoid doing that and dropout will help us with this.

  • But now, we see we're almost done finishing our training step.

  • We're at 55,000.

  • All right.

  • We've finished training, and now it's going

  • to go ahead and test for us on 10,000 samples.

  • And it looks like on the testing set, we were 98.8% accurate.

  • So we ended up doing pretty well, it seems,

  • on this testing set to see how accurately can

  • we predict these handwritten digits.

  • And so what we could do then is actually test it out.

  • I've written a program called recognition.py using PyGame.

  • If you pass it a model that's been trained,

  • and I pre-trained an example model using this input data, what we can do

  • is see whether or not we've been able to train

  • this convolutional neural network to be able to predict handwriting,

  • for example.

  • So I can try just like drawing a handwritten digit.

  • I'll go ahead and draw like the number 2, for example.

  • So there's my number 2.

  • Again, this is messy.

  • If you tried to imagine how would you write a program with just like ifs

  • and thens to be able to do this sort of calculation,

  • it would be tricky to do so.

  • But here, I'll press Classify, and all right.

  • It seems it was able to correctly classify that what I drew

  • was the number 2.

  • We'll go ahead and reset it.

  • Try it again.

  • We'll draw like an 8, for example.

  • So here is an 8.

  • I'll press Classify.

  • And all right.

  • It predicts that the digit that I drew was an 8.

  • And the key here is this really begins to show

  • the power of what the neural network is doing, somehow looking

  • at various different features of these different pixels,

  • figuring out what the relevant features are,

  • and figuring out how to combine them to get a classification.

  • And this would be a difficult task to provide explicit instructions

  • to the computer on how to do, like to use a hole punch of if-thens

  • to process all of these pixel values to figure out

  • what the handwritten digit is, like everyone is going to draw

  • their 8 a little bit differently.

  • If I drew the 8 again, it would look a little bit different.

  • And yet ideally, we want to train a network to be robust

  • enough so that it begins to learn these patterns on its own.

  • All I said was, here is the structure of the network,

  • and here is the data on which to train the network,

  • and the network learning algorithm just tries

  • to figure out what is the optimal set of weights,

  • what is the optimal set of filters to use,

  • in order to be able to accurately classify

  • a digit into one category or another.

  • That's going to show the power of these convolutional neural networks.

  • And so that then was a look at how we can use convolutional neural networks

  • to begin to solve problems with regards to computer vision, the ability to take

  • an image and begin to analyze it.

  • And so this is the type of analysis you might

  • imagine that's happening in self-driving cars that

  • are able to figure out what filters to apply to an image to understand what it

  • is that the computer is looking at, or the same type of idea that

  • might be applied to facial recognition and social media

  • to be able to determine how to recognize faces in an image as well.

  • You can imagine a neural network that, instead of classifying

  • into one of 10 different digits, could instead classify like, is this person A

  • or is this person B, trying to tell those people apart just based

  • on convolution.

  • And so now what we'll take a look at is yet another type of neural network

  • that can be quite popular for certain types of tasks.

  • But to do so, we'll try to generalize and think about our neural network

  • a little bit more abstractly, that here we have a sample deep neural network,

  • where we have this input layer, a whole bunch of different hidden layers

  • that are performing certain types of calculations,

  • and then an output layer here that just generates some sort of output

  • that we care about calculating.

  • But we could imagine representing this a little more simply, like this.

  • Here is just a more abstract representation of our neural network.

  • We have some input.

  • That might be like a vector of a whole bunch of different values as our input.

  • That gets passed into a network to perform

  • some sort of calculation or computation, and that network

  • produces some sort of output.

  • That output might be a single value.

  • It might be a whole bunch of different values.

  • But this is the general structure of the neural network that we've seen.

  • There is some sort of input that gets fed into the network,

  • and using that input, the network calculates what the output should be.

  • And this sort of model for an all network

  • is what we might call a feed-forward neural network.

  • Feed-forward neural networks have connections only in one direction;

  • they move from one layer to the next layer to the layer

  • after that, such that the inputs pass through various different hidden layers

  • and then ultimately produce some sort of output.

  • So feed-forward neural networks are very helpful for solving

  • these types of classification problems that we saw before.

  • We have a whole bunch of input.

  • We want to learn what setting of weights will allow

  • us to calculate the output effectively.

  • But there are some limitations on feed-forward neural networks

  • that we'll see in a moment.

  • In particular, the input needs to be of a fixed shape,

  • like a fixed number of neurons are in the input layer,

  • and there's a fixed shape for the output,

  • like a fixed number of neurons in the output layer,

  • and that has some limitations of its own.

  • And a possible solution to this--

  • and we'll see examples of the types of problems we

  • can solve for this in just the second--

  • is instead of just a feed-forward neural network where there are only

  • connections in one direction, from left to right effectively,

  • across the network, we can also imagine a recurrent neural network,

  • where a recurrent neural network generates

  • output that gets fed back into itself as input for future runs of that network.

  • So whereas in a traditional neural network,

  • we have inputs that get fed into the network that get fed into the output,

  • and the only thing that determines the output is based on the original input

  • and based on the calculation we do inside of the network itself,

  • this goes in contrast with a recurrent neural network,

  • where in a recurrent neural network, you can imagine output

  • from the network feeding back to itself into the network

  • again as input for the next time that you do the calculations

  • inside of the network.

  • What this allows is it allows the network to maintain some sort of state,

  • to store some sort of information that can

  • be used on future runs of the network.

  • Previously, the network just defined some weights,

  • and we passed inputs through the network, and it generated outputs,

  • but the network wasn't saving any information based on those inputs

  • to be able to remember for future iterations or for future runs.

  • What a recurrent neural network will let us do

  • is let the network store information that

  • gets passed back in as input to the network again the next time we try

  • and perform some sort of action.

  • And this is particularly helpful when dealing with sequences of data.

  • So we'll see a real-world example of this right now actually.

  • Microsoft has developed an AI known as the CaptionBot,

  • and what the CaptionBot does is it says, I

  • can understand the content of any photograph,

  • and I'll try to describe it as well as any human.

  • I'll analyze your photo, but I won't store it or share it.

  • And so what Microsoft CaptionBot seems to be claiming to do

  • is it can take an image and figure out what's in the image

  • and just give us a caption to describe it.

  • So let's try it out.

  • Here, for example, is an image of Harvard Square

  • and some people walking in front of one of the buildings at Harvard Square.

  • I'll go ahead and take the URL for that image,

  • and I'll paste it into CaptionBot, then just press Go.

  • So CaptionBot is analyzing the image, and then it says,

  • I think it's a group of people walking in front

  • of a building, which seems amazing.

  • The eye is able to look at this image and figure out what's in the image.

  • And the important thing to recognize here

  • is that this is no longer just a classification task.

  • We saw being able to classify images with a convolutional neural network,

  • where the job was to take the images and then figure out, is it a 0, or a 1,

  • or a 2; or is that this person's face or that person's face?

  • What seems to be happening here is the input is an image,

  • and we know how to get networks to take input of images,

  • but the output is text.

  • It's a sentence.

  • It's a phrase, like "a group of people walking in front of a building."

  • And this would seem to pose a challenge for our more traditional

  • feed-forward neural networks, for the reason being

  • that in traditional neural networks, we just

  • have a fixed-size input and a fixed-size output.

  • There are a certain number of neurons in the input to our neural network

  • and a certain number of outputs for our neural network,

  • and then some calculation that goes on in between.

  • But the size of the inputs--

  • the number of values in the input and the number of values in the output--

  • those are always going to be fixed based on the structure of the neural network,

  • and that makes it difficult to imagine how a neural network can

  • take an image like this and say, you know,

  • it's a group of people walking in front of the building,

  • because the output is text.

  • It's a sequence of words.

  • Now it might be possible for a neural network to output one word.

  • One word, you could represent us like a vector of values,

  • and you can imagine ways of doing that.

  • And next time, we'll talk a little bit more about AI

  • as it relates to language and language processing.

  • But a sequence of words is much more challenging,

  • because depending on the image, you might

  • imagine the output is a different number of words.

  • We could have sequences of different lengths,

  • and somehow we still want to be able to generate the appropriate output.

  • And so the strategy here is to use a recurrent neural network,

  • a neural network that can feed its own output back into itself

  • as input for the next time.

  • And this allows us to do what we call a one-to-many relationship for inputs

  • to outputs, that in vanilla, more traditional neural networks--

  • these are what we consider to be one-to-one neural networks--

  • you pass in one set of values as input, you get one vector of values

  • as the output--

  • but in this case, we want to pass in one value as input--

  • the image-- and we want to get a sequence-- many values--

  • as output, where each value is like one of these words that gets produced

  • by this particular algorithm.

  • And so the way we might do this is we might imagine starting

  • by providing input the image into our neural network,

  • and the neural network is going to generate output,

  • but the output is not going to be the whole sequence of words,

  • because we can't represent the whole sequence of words.

  • I'm using just a fixed set of neurons.

  • Instead, the output is just going to be the first word.

  • We're going to train the network to output

  • what the first word of the caption should be.

  • And you could imagine that Microsoft has trained

  • to this by running a whole bunch of training samples through the AI,

  • giving it a whole bunch of pictures and what the appropriate caption was,

  • and having the AI begin to learn from that.

  • But now, because the network generates output

  • that can be fed back into itself, you can

  • imagine the output of the network being fed back into the same network--

  • this here looks like a separate network, but it's really the same network that's

  • just getting different input--

  • that this network's output gets fed back into itself,

  • but it's going to generate another output,

  • and that other output is going to be like the second word in the caption.

  • And this recurrent neural network then, this network

  • is going to generate other output that can be fed back

  • into itself to generate yet another word, fed back

  • into itself to generate another word.

  • And so recurrent neural networks allow us to represent

  • this sort of one-to-many structure.

  • You provide one image as input, and the neural network

  • can pass data into the next run of the network,

  • and then again and again, such that you could run the network multiple times,

  • each time generating a different output, still based on that original input.

  • And this is where recurrent neural networks

  • become particularly useful when dealing with sequences of inputs or outputs.

  • My output is a sequence of words, and since I can't very easily

  • represent outputting an entire sequence of words,

  • I'll instead output that sequence one word at a time,

  • by allowing my network to pass information

  • about what still needs to be said about the photo

  • into the next stage of running the networks.

  • So you could run the network multiple times--

  • the same network with the same weights--

  • just getting different input each time, first getting input from the image,

  • and then getting input from the network itself,

  • as additional information about what additionally

  • needs to be given in a particular caption, for example.

  • So this then is a one-to-many many relationship

  • inside of a recurrent neural network.

  • But it turns out there are other models that we

  • can use-- other ways we can try and use recurrent neural networks-- to be

  • able to represent data that might be stored in other forms as well.

  • We saw how we could use neural networks in order to analyze images,

  • in the context of convolutional neural networks that take an image,

  • figure out various different properties of the image,

  • and are able to draw some sort of conclusion based on that.

  • But you might imagine that something like YouTube,

  • they need to be able to do a lot of learning based on video.

  • They need to look through videos to detect

  • if there are copyright violations, or they

  • need to be able to look through videos to maybe identify

  • what particular items are inside of the video, for example.

  • And video, you might imagine, is much more difficult

  • to put it as input to a neural network, because whereas an image

  • you can just treat each pixel is a different value, videos are sequences.

  • They're sequences of images, and each sequence might be a different length,

  • and so it might be challenging to represent

  • that entire video as a single vector of values

  • that you could pass in to a neural network.

  • And so here too, recurrent neural networks

  • can be a valuable solution for trying to solve this type of problem.

  • Then instead of just passing in a single input into our neural network,

  • we could pass in the input one frame at a time, you might imagine,

  • first taking the first frame of the video, passing it into the network,

  • and then maybe not having the network output anything at all yet.

  • Let it take in another input, and this time, pass it into the network,

  • but the network gets information from the last time

  • we provided an input into the network.

  • Then we pass in a third input and then a fourth input,

  • where each time, with the network gets it gets the most recent input,

  • like each frame of the video, but it also

  • gets information the network processed from all of the previous iterations.

  • So on frame number four, you end up getting

  • the input for frame number four, plus information the network is

  • calculated from the first three frames.

  • And using all of that data combined, this recurrent neural network

  • can begin to learn how to extract patterns from a sequence of data

  • as well.

  • And so you might imagine if you want to classify

  • a video into a number of different genres,

  • like an educational video, or a music video, or different types of videos.

  • That's a classification task, where you want

  • to take input each of the frames of the video,

  • and you want to output something like what it is

  • and what category that it happens to belong to.

  • And you can imagine doing this sort of thing--

  • this sort of many-to-one learning--

  • anytime your input is a sequence.

  • And so input is a sequence in the context of a video.

  • It could be in the context of like, if someone has typed a message,

  • and you want to be able to categorize that message,

  • like if you're trying to take a movie review

  • and trying to classify it as is it a positive review or a negative review.

  • That input is a sequence of words, and the output

  • is a classification-- positive or negative.

  • There too, a recurrent neural network might

  • be helpful for analyzing sequences of words,

  • and they're quite popular when it comes to dealing with language.

  • It could even be used for spoken language

  • as well, that spoken language is an audio waveform that

  • can be segmented into distinct chunks, and each of those

  • can be passed in as an input into a recurrent neural network

  • to be able to classify someone's voice, for instance,

  • if you want to do voice recognition, to say is this one person

  • or is this another?

  • Here are also cases where you might want this many-to-one architecture

  • for a recurrent neural network.

  • And then as one final problem, just to take a look

  • at in terms of what we can do, with these sorts of networks,

  • imagine what Google Translate is doing.

  • So what Google Translate is doing is it's taking some text written in one

  • language and converting it into text written in some other language,

  • for example, where now this input is a sequence of data--

  • it's a sequence of words--

  • and the output is a sequence of words as well.

  • It's also a sequence.

  • So here, we want effectively like a many-to-many relationship.

  • Our input is a sequence, and our output is a sequence as well.

  • And it's not quite going to work to just say, take each word in the input

  • and translate it into a word in the output,

  • because ultimately, different languages put their words in different orders,

  • and maybe one language uses two words for something,

  • whereas another language only uses one.

  • So we really want some way to take this information-- that's input--

  • encode it somehow, and use that encoding to generate what the output ultimately

  • should be.

  • And this has been one of the big advancements

  • in automated translation technology is the ability

  • to use own networks to do this, instead of older, more traditional methods,

  • and this has improved accuracy dramatically.

  • And the way you might imagine doing this is, again,

  • using a recurrent neural network with multiple inputs and multiple outputs.

  • We start by passing in all the input.

  • Input goes into the network.

  • Another input, like another word, goes into network,

  • and we do this multiple times, like once for each word in the input

  • that I'm trying to translate.

  • And only after all of that is done, does the network now

  • start to generate output, like the first word of the translated sentence,

  • and the next word of the translated sentence, so on and so forth,

  • where each time the network passes information

  • to itself by allowing for this model of giving some sort of state

  • from one run in the network to the next run,

  • assembling information about all the inputs,

  • and then passing in information about which part of the output in order

  • to generate next.

  • And there are a number of different types of these sorts

  • of recurrent neural networks.

  • One of the most popular is known as the long short-term memory neural

  • network, otherwise known as LSTM.

  • But in general, these types of networks can be very, very powerful

  • whenever we're dealing with sequences, whether those

  • are sequences of images or especially sequences of words when it comes

  • towards dealing with natural language.

  • So that then were just some of the different types of neural networks

  • that can be used to do all sorts of different computations,

  • and these are incredibly versatile tools that

  • can be applied to a number of different domains.

  • We only looked at a couple of the most popular types of neural networks--

  • the more traditional feed-forward neural networks,

  • convolutional neural networks, and recurrent neural networks.

  • But there are other types as well.

  • There are adversarial networks, where networks compete with each other

  • to try and be able to generate new types of data,

  • as well as other networks that can solve other tasks based on what they happen

  • to be structured and adapted for.

  • And these are very powerful tools in machine learning,

  • from being able to very easily learn based on some set of input data

  • and to be able to therefore figure out how to calculate

  • some function, from inputs to outputs.

  • Whether it's input to some sort of classification, like analyzing an image

  • and getting a digit, or machine translation where

  • the input is in one language and the output is in another,

  • these tools have a lot of applications for machine learning more generally.

  • Next time, we'll look at machine learning and AI

  • in particular in the context of natural language.

  • We talked a little bit about this today, but looking

  • at how it is that our AI can begin to understand natural language

  • and can begin to be able to analyze and do useful tasks with

  • regards to human language, which turns out

  • to be a challenging and interesting task.

  • So we'll see you next time.

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it