Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARTIN GORNER: Hello.

  • Hi, everyone.

  • So thank you for coming in such great numbers

  • to this TensorFlow session.

  • Apologies, it's quite late in the afternoon.

  • I will need all your brains for this session because today,

  • I want with you to build a neural network.

  • So no, I don't need your brains to build on, no brain

  • surgery in the session.

  • But it's a crash course to get developers up

  • to speed on machine learning and deep learning and neural

  • networks.

  • So I need all your attention.

  • The dataset we will be using is a very classical one.

  • It's this one here, hand-written digits.

  • Academia has been working on this dataset for the past 20

  • years.

  • So you should go to the website where it's hosted.

  • You will actually see 20 years of research papers

  • and that's what we will do together today.

  • We'll go on this dataset trying to build a network that

  • recognizes this hand-written digits from the simplest

  • possible network all the way to 99% accuracy.

  • So let's start.

  • Just a question, beforehand.

  • Who has done some work with neural networks before?

  • Oh, wow.

  • OK.

  • Quite a few people.

  • So feel free to help me and I hope this will not

  • be too basic for you and I hope it

  • will at least be a good introduction to TensorFlow.

  • But if you have never done anything with neural networks,

  • that's fine and I will explain everything from the start.

  • So this is the simplest possible neural network

  • we can imagine to recognize our hand-written digits.

  • So the digits, they come as 28 by 28 pixel images

  • and the first thing we do is that we flatten

  • all those pixels into one big vector of pixels

  • and these will be our inputs.

  • Now, we will use exactly 10 neurons.

  • The neurons are the white circles.

  • What a neuron does is always the same thing.

  • A neuron does a weighted sum of all of its inputs,

  • here the pixels.

  • It adds another constant that is called a bias.

  • That's just an additional degree of freedom.

  • And then it will feed this sum through an activation function.

  • And that is just a function-- number in, transform,

  • number out.

  • We will see several of those activation functions

  • and the one thing they have in common in neural networks

  • is that they are non-linear.

  • So why 10 neurons?

  • Well, simply because we are classifying those digits

  • in 10 categories.

  • We are trying to recognize a zero, a one, a two,

  • on to the nine.

  • So what we are hoping for here is that one of those neurons

  • will light up and tell us, with a very strong output,

  • that I have recognized here an eight.

  • All right.

  • And for that, since this is a classification problem,

  • we are going to use a very specific activation

  • function, one that, well, researchers

  • tell us works really well on classification problems.

  • It's called softmax and it's simply

  • an exponential normalized.

  • So what you do is that you make all those weighted sums,

  • then you elevate that to the exponential.

  • And once you have your 10 exponentials,

  • you compute the norm of this vector

  • and divide it by its norm so that you get

  • values between zero and one.

  • And those values, you will be able to interpret them

  • as probabilities, probabilities of this being an eight, a one,

  • or something else.

  • You will be asking which norm?

  • Any norm, doesn't matter--

  • the length of the vector.

  • You pick your favorite norm.

  • There are several.

  • Usually, for softmax, we use L1, but L2

  • which is the Euclidean normal would work just as well.

  • So what does softmax do actually?

  • You see, it's an exponential so it's a very steeply increasing

  • function.

  • It will pull the data apart, increase the differences,

  • and when you divide all of that, when you normalize

  • the whole vector, you usually end up with one of the values

  • being very close to one and all the other values

  • being very close to zero.

  • So it's a way of pulling the winner out on top

  • without actually destroying the information.

  • All right.

  • So now we need to formalize this using a matrix multiply.

  • I will remind you of what a matrix multiply is,

  • but we will do it not one image, we

  • are going to do this for a batch of 100 images at a time.

  • So what we have here in my matrix

  • is 100 images, one image per line.

  • The images are flattened, all the pixels on one line.

  • So I take my matrix of weights, for the time being,

  • I don't know what these weights are,

  • it's just weights so I'm doing weighted sums.

  • And I start the matrix multiplication.

  • So I do a weighted sum of all the pixels of the first image.

  • Here it is.

  • And then if I continue this matrix multiply

  • using the second column of weights,

  • I get a weighted sum of all the pixels

  • of the first image for the second neuron and then

  • for the third neuron and the fourth and so on.

  • What is left is to add the bias's,

  • just an additional constant.

  • Again, we don't know what it is for the time being.

  • And there is one bias per neuron,

  • that's why we have 10 biases.

  • And now if I continue this matrix multiply,

  • I'm going to obtain these weighted sums

  • for the second image, and the third image,

  • and so on, until I have processed all my images.

  • I would like to write this as a simple formula there.

  • You see there is a problem, x times w,

  • you know that's a matrix of 10 columns by 100 images,

  • and I have only 10 biases.

  • I can't simply add them together.

  • Well, never mind.

  • We will redefine addition and it's OK

  • if everybody accepts it.

  • And actually, people have already accepted it.

  • It's called a broadcasting add and that's

  • the way you do additions in NumPy,

  • for instance, which is the numerical library for Python.

  • The way a broadcasting add works is

  • that if you're trying to add two things which don't match, not

  • the same dimensions, you can't do the addition,

  • you try to replicate the small one as much

  • as needed to make the sizes match

  • and then you do the addition.

  • That's exactly what we need to do here.

  • We have only those 10 biases.

  • So it's the same biases on all the lines.

  • We just need to replicate this bias vector on all the lines,

  • and that's exactly what this generalized broadcasting

  • add does.

  • So we will just write it as a plus.

  • And this is where I wanted to get to.

  • I want you to remember this as the formula describing

  • one layer in a neural network.

  • So let's go through this again.

  • In x, we have a batch of images, 100 images,

  • all the pixels on one line.

  • In w, we have all of our weights for the 10 neurons,

  • all the weights in the system.

  • x times w are all about weighted sums.

  • We add the biases, and then we feed this

  • through our activation function, in this case softmax, the way

  • it works is lined by line.

  • Line by line, we take the 10 values,

  • elevate them to the exponential, normalize the line.

  • Next line, 10 values, elevate them to the exponential,

  • normalize the line, and so on.

  • So what we get in the output is, for each image, 10 values

  • which look like probabilities and which are our predictions.

  • So, of course, we still don't know

  • what those weights and biases are

  • and that's where the trick is in neural networks.

  • We are going to train this neural network

  • to actually figure out the correct weights

  • and biases by itself.

  • Well, this is how we write this in TensorFlow.

  • You see, not very different.

  • OK.

  • TensorFlow has this in n library for neural network

  • which has all sorts of very useful functions

  • for neural networks, for example, softmax and so on.

  • So let's go train.

  • When you train, you've got images,

  • but you know what those images are.

  • So your network, you initialize your weights and biases

  • at random value and your network will output some probability.

  • Since you know what this image is,

  • you can tell it that it's not this, it should be that.

  • So that is called a one-hot encoded vector.

  • It's a not very fancy way of encoding numbers.

  • Basically, here are our numbers from zero to nine.

  • We encode them as 10 bits, all at zero and just one of them

  • is a one at the index of the number we want to encode.

  • Here are six.

  • Why?

  • Well, because then, it's in the same shape as our predictions

  • and we can compute a distance between those two.

  • So again, many ways of computing distances.

  • The Euclidean distance, the normal distance, sum

  • of differences squared would work, not a problem.

  • But scientists tell us that for classification problems,

  • this distance, the cross entropy, works slightly better.

  • So we'll use this one.

  • How does it work?

  • It's the sum across the vectors of the values

  • on the top multiplied by the logarithms of the values

  • on the bottom, and then we add in minus sign

  • because all the values on the bottom are less than one,

  • so all the logarithms are negative.

  • So that's the distance.

  • And of course, we will tell the system

  • to minimize the distance between what it thinks is the truth

  • and what we know to be true.

  • So this we will call our error function

  • and the training will be guided by an effort

  • to minimize the error function.

  • So let's see how this works in practice.

  • So in this little visualization, I'm

  • showing you over there, my training images.

  • You see it's training so you see this batches of 100 training

  • images being fed into the system.

  • On the white background, you have the images

  • that have been already correctly recognized by the system.

  • On a red background, images that are still missed.

  • So then, on the middle graph, you

  • see our error function, computed both on the training dataset

  • and we also kept aside a set of images which we have never seen

  • during training for testing.

  • Of course, if you want to test the real world

  • performance of your neural network,

  • you have to do this on a set of images which you have never

  • seen during training.

  • So here we have 60,000 training images

  • and I set aside 10,000 test images which you see

  • in the bottom graph over there.

  • They are a bit small.

  • You see only 1,000 of them here.

  • So imagine, there are nine more screens of pictures like that.

  • But I sorted all the badly recognized one at the top.

  • So you see all the ones that have been badly recognized

  • and below are nine screens of correctly recognized images,

  • here after 2,000 rounds of training.

  • So there is a little scale on the side here.

  • It shows you that it's already capable of recognizing

  • 92% of our images with this very simple model, just 10 neuron's,

  • nothing else.

  • And that's what you get on the top graph, the accuracy graph,

  • as well.

  • That's simply the percentage of correctly recognized images,

  • both on test and training data.

  • So what else do we have?

  • We have our weights and biases, those two diagrams are simply

  • percentiles, so it shows you the spread

  • of all the weights and biases.

  • And that's just useful to see that they are moving.

  • They both started at zero and they took some values

  • for the weights between one and minus one

  • for biasses between two and minus two.

  • It's helpful to keep an eye on those diagrams

  • and see that we are not diverging completely.

  • So that's the training algorithm.

  • You give it training images, it gives you a prediction,

  • you compute the distance between the prediction

  • and what you know to be true.

  • You use that distance as an error

  • function to guide a mechanism that will drive the error down

  • by modifying weights and biases.

  • So now let's write this in TensorFlow.

  • And I'll get more explicit about exactly how

  • this training works.

  • So we need to write this in TensorFlow.

  • The first thing you do in TensorFlow

  • is define variables and placeholders.

  • A variable is a degree of freedom

  • of our system, something we are asking TensorFlow to compute

  • for us through training.

  • So in our case, those are our weights and biases.

  • And we will need to feed in training data.

  • So for this data that will be fed in at training time,

  • we define a placeholder.

  • You see here x is a placeholder for our training images.

  • Let's look at the shape in brackets.

  • What you have is the shape of this multidimensional matrix,

  • which we call a tensor.

  • So the first dimension is none.

  • It says I don't know yet so this will be the number of images

  • in a batch.

  • This will be determined at training time.

  • If we give 100 images, this will be 100.

  • Then 28 by 28 is the size of our images

  • and one is the number of values per pixel.

  • So that's not useful at all because we

  • are handling grayscale images.

  • I just put it there.

  • In case you wanted to handle color images, that would

  • be three values per pixel.

  • So OK.

  • We have our placeholders, we have our variables,

  • now we are ready to write our model.

  • So that line you see on the top is our model.

  • It's what we have determined to be the line representing

  • one layer of a neural network.

  • The only change is that reshape operation.

  • You remember, our images, they come in as 28

  • by 28 pixel images and we want to flatten them

  • as one big vector of pixels.

  • So that's what reshape does.

  • 784 is 28 by 28.

  • It's all the pixels in one line.

  • All right.

  • I need a second placeholder for the known answers,

  • the labels of my training images,

  • labels like this is a one, this is a zero, this is a seven,

  • this is a five.

  • And now that I have my predictions and my known

  • labels, I'm ready to compute my error function, which

  • is the cross entropy using the formula we've seen before.

  • So the sum across the vector of the elements of the labels

  • multiplied by elements of the logarithm of the predictions.

  • So now I have my error function.

  • What do I do with it?

  • What you have on the bottom, I won't go into that.

  • That is simply the computation of the percentage

  • of correctly recognized images.

  • You can skip that.

  • OK.

  • Now we get to the actual heart of what

  • TensorFlow will do for you.

  • So we have our error function.

  • We pick an optimizer.

  • There is a full library of them.

  • They have different characteristics.

  • And we ask the optimizer to minimize our error function.

  • So what is this going to do?

  • When you do this, TensorFlow takes your error function

  • and computes the partial derivatives of that error

  • function relatively to all the weights and all the biases

  • in the system.

  • That's a big vector because there are lots

  • of weights and lots of biases.

  • How many?

  • w, the weights, is a variable of almost 8,000 values.

  • So this vector we get mathematically

  • is called a gradient.

  • And the gradient has one nice property.

  • Who knows what is the nice property of the gradient?

  • It points-- Yeah.

  • Almost.

  • It points up, we had a minus sign, it points down, exactly.

  • Down in which space?

  • We are in the space of all the weight and all the variables

  • and the function we are computing

  • is our error function.

  • So when we say down in this space,

  • it means it gives us a direction in the space of weights

  • and biases into which to go to modify our weights

  • and biases in order to make our error function smaller.

  • So that is the training.

  • You compute this gradient and it gives you an arrow.

  • You take a little step along this arrow.

  • Well, you are in the space of weights and biases,

  • so taking a little step means you modify your weights

  • and biases by this little delta, and you get into a location

  • where the error is now smaller.

  • Well, that's fantastic.

  • That's exactly what you want.

  • Then you repeat this using a second batch

  • of training images.

  • And again, using a third batch of training images, and so on.

  • So it's called gradient descent because you follow

  • the gradient to head down.

  • And so we are ready to write our training loop.

  • There is one more thing I need to explain

  • to you about TensorFlow.

  • TensorFlow has a deferred execution model.

  • So everything we wrote up to now,

  • all the tf dot something here commands,

  • does not actually-- when that is executed,

  • it doesn't produce values.

  • It builds a graph, a computation graph, in memory.

  • Why is that important?

  • Well, first of all, this derivation trick here,

  • the computation of the gradient, that

  • is actually a formal derivation.

  • TensorFlow takes the formula that you

  • give it to define your error function

  • and does a formal derivation on it.

  • So it needs to know the full graph of how you computed this

  • to do this formal derivation.

  • And the second thing it will use this graph for

  • is that TensorFlow is built for distributed computing.

  • And there, as well, to distribute a graph

  • on multiple machine, it helps to know what the graph is.

  • OK.

  • So this is all very useful, but it means for us

  • that we have to go through an additional loop

  • to actually get values from our computations.

  • The way you do this in TensorFlow

  • is that you define a session and then in the session,

  • you call sess.run on one edge of your computation graph.

  • That will give you actual values,

  • but of course, for this to work, you

  • have to fill in all the placeholders

  • that you have defined now with real values.

  • So for this to work, I will need to fill in the training images

  • and the training labels for which

  • I have defined placeholders.

  • And the syntax is simply the train_data dictionary there.

  • You see the keys of the dictionary, x and y underscore,

  • are the placeholders that I have defined.

  • And then I can sess.run on my training step.

  • I pass in this training data and that is

  • where the actual magic happens.

  • Just a reminder, what is this training step?

  • Well it's what you got when you asked the optimizer to minimize

  • your error function.

  • So the training step, when executed,

  • is actually what computes this gradient using

  • the current batch of images, training images and labels,

  • and follows it a little to modify the weights and biases

  • and end up with better weights and biases.

  • I said a little.

  • I come back to this.

  • What is that learning rate over there?

  • Well, I can't make a big step along the gradient.

  • Why not?

  • Imagine you're in the mountains, you know where down is.

  • We have senses for that.

  • We don't know to derive anything.

  • We know where down is.

  • And you want to reach the bottom of the valley.

  • Now, if every step you make is a 10 mile step,

  • you will probably be jumping from one side of the valley

  • to the other without ever reaching the bottom.

  • So if you want to reach the bottom,

  • even if you know where a down is,

  • you have to make small steps in that direction,

  • and then you will reach the bottom.

  • So the same here, when we compute this gradient,

  • we multiplied by this very small value so as to take small steps

  • and be sure that we not jumping from one side of the valley

  • to the other.

  • All right.

  • So let's finish our training.

  • Basically, in a loop, we load a batch of 100 training images

  • and labels.

  • We run this training step which adjusts our weights and biases.

  • And we repeat.

  • All the rest of the stuff on the bottom, it's just for display.

  • I'm computing the accuracy and the cross entropy

  • on my training data and again, on my test data

  • so that I can show you four curves over there.

  • It is just for display.

  • It has nothing to do with the training itself.

  • All right.

  • So that was it.

  • That's the entire code here on one slide.

  • Let's go through this again.

  • At the beginning, you define variables for everything

  • that you want TensorFlow to compute for you.

  • So here are our weights and biasses.

  • You define placeholders for everything

  • that you will be feeding during the training, namely our images

  • and our training labels.

  • Then you define your model.

  • Your model gives you predictions.

  • You can compare those predictions

  • with your known labels, compare the distance

  • between the two, which is the cross entropy here,

  • and use that as an error function.

  • So you pick an optimizer and you ask the optimizer to minimize

  • your error function.

  • That gives all the gradients and all

  • that, it gives you a training step.

  • And now, in a loop, you load a batch of images.

  • You're on your training step.

  • You load a batch of images and labels,

  • you run your training step, and you do this in a loop,

  • and hoping this will converge, and usually it does.

  • You see here, it did converge and with this approach,

  • we got 92% accuracy.

  • Small recap of all the ingredients we put in our pot

  • so far.

  • We have a softmax activation function.

  • We have the cross entropy as an error function.

  • And we did this mini batching thing

  • where we train on 100 images at a time, do one step,

  • and then load another batch of images.

  • So is 92% accuracy good?

  • No, it's horrible.

  • Imagine you're actually using this in production.

  • I don't know, in the post office, your decoding zip

  • codes.

  • 92% out of 100 digits, you have eight bad values?

  • No, not usable in production.

  • Forget it.

  • So how do we fix it?

  • Well deep learning.

  • We'll go deep.

  • We can just stack those layers.

  • How do we do that?

  • Well, it's very simple.

  • Look at the top layer of neurons.

  • It does what we just did.

  • It computes weighted sums of pixels.

  • But we can just as easily add a second layer

  • of neurons that will compute weighted sums all

  • the outputs of the first layer.

  • And that's how you stack layers to produce

  • a deep neural network.

  • Now we are going to change our activation function.

  • We keep softmax for the output layer

  • because softmax has these nice properties

  • of pulling a winner apart and producing

  • numbers between zero and one.

  • But for the rest, we use a very classical activation function.

  • In neural networks, it's called the sigmoid,

  • and it's basically, the simplest possible continuous function

  • that goes from zero to one.

  • OK.

  • All right.

  • Let's write this model.

  • So we have now one set of weights and one set of biasses

  • per layer.

  • That's why we use C5 pairs here.

  • And our model will actually look very familiar to you.

  • Look at the first line.

  • It's exactly what we have seen before for one

  • layer of a neural network.

  • Now what we do with the output, Y1,

  • is that we use it as the input in the second line,

  • and so on, we chain those.

  • It's just that on the last line, the activation

  • function we use is the softmax.

  • So that's all the changes we did.

  • And we can try to run this again.

  • So oops.

  • This one.

  • Run.

  • Run.

  • Run.

  • And it's coming.

  • Well, I don't like this slope here.

  • It shouldn't be shooting up really sharp.

  • It's a bit slow.

  • Actually, I have a solution for that.

  • I lied to you when I said that the sigmoid was the most widely

  • used activation function.

  • That was true in the past, and today, people

  • invented a new activation function, which is called

  • the Relu, and this is a relu.

  • It's even simpler.

  • It's just zero for all negative values and identity

  • for all positive values.

  • Now this actually works better.

  • It has lots of advantages.

  • Why does it work better?

  • We don't know.

  • People tried it, it worked better.

  • [LAUGHTER]

  • I'm being honest here.

  • If you had a researcher here, he would

  • fill your head with equations and prove it,

  • but he would have done those equations after the fact.

  • People already tried it, it worked better.

  • Actually, they got inspiration from biology.

  • It is said, I don't know if it is true,

  • but I heard that the sigmoid was the preferred

  • model of biologists for our actual biological neurons

  • and that today, biologist thinks that neurons in our head

  • work more like this.

  • And the guys in computer science got

  • inspiration from that, tried it, works better.

  • How better?

  • Well, this is just the beginning of the training.

  • This is what we get with our sigmoids, just 300 iterations,

  • so just the beginning.

  • And this is what we get from relus.

  • Well, I prefer this.

  • The accuracy shoots up really sharp.

  • The cross entropy goes down really sharp.

  • It's much faster.

  • And actually, here on this very simple problem,

  • the sigmoid would have recovered, it's not an issue,

  • but in very deep networks, sometimes with the sigmoid,

  • you don't converge at all.

  • And the relu solves that problem to some extent.

  • So the relu it is for most of our issues.

  • OK.

  • So now let's train.

  • Let's do this for 10,000 iterations, five layers,

  • look at that.

  • 98% accuracy.

  • First of all, oh, yeah.

  • We went from 92 to 98 just by adding layers.

  • That's fantastic.

  • But look at those curves.

  • They're all messy.

  • What is all this noise?

  • Well, when you see noise like that,

  • it means that you are going too fast.

  • You're actually jumping from one side of the valley

  • to the other, without critically reaching the bottom

  • of your error function.

  • So we have a solution for that, but it's not just

  • to go slower, because then you would spend 10 times more time

  • training.

  • The solution, actually, is to start fast and then

  • slow down as you train.

  • It's called learning rate decay.

  • We usually decay the learning rates on an exponential curve.

  • So yes, I hear you.

  • It sounds very simple, why this little

  • trick, but let me play you the video of what this does.

  • It's actually quite spectacular.

  • So it's almost there.

  • Should I have the end of it on a slide.

  • Yeah, that's it.

  • So this is what we had using a fixed learning rate

  • and just by switching to a decaying learning rate, look,

  • it's spectacular.

  • All the noise is gone.

  • And for the first--

  • just with this little trick--

  • really, this is not rocket science,

  • it's just going slightly slower towards the end

  • and all the noise is gone.

  • And look at the blue curve, the training accuracy curve.

  • Towards the end, it's stuck at 100%.

  • So here, for the first time, we built a neural network

  • that was capable of learning all of our training set perfectly.

  • It doesn't make one single mistake in the entire training

  • set which doesn't mean that it's perfect in the real world.

  • As you see on the test dataset, it has a 98% accuracy.

  • But, well, it's something.

  • We got 100% at least on the training.

  • All right.

  • So we still have something that is a bit bizarre.

  • Look at those two curves.

  • This is our error function.

  • So the blue curve, the test error function,

  • that is what we minimize.

  • OK?

  • So as expected, it goes down.

  • And the error function computed on our test data

  • at the beginning, well, it follows.

  • That's quite nice.

  • And then it disconnects.

  • So this is not completely unexpected, you know.

  • We are minimizing the training at our function.

  • That's what we are actively minimizing.

  • We are not doing anything at all on the test side.

  • It's just a byproduct of the way neural networks work

  • that the training you do on your training data,

  • actually carries over to your test data to the real world.

  • Well, it carries over or it doesn't.

  • So as you see here, until some point,

  • it does and then, there is a disconnect,

  • it doesn't carry over anymore.

  • You keep optimizing the error on the training data,

  • but it has no positive effect on the test

  • performance, the real work performance, anymore.

  • So if you see curves like this, you take the textbook,

  • you look it up, it's called overfitting.

  • You look at the solutions, they tell you overfitting,

  • you need regularization.

  • OK.

  • Let's regularize.

  • What regularization options do we have?

  • My preferred one is called dropout.

  • It's quite dramatic.

  • You shoot the neurons.

  • No, really.

  • So this is how it works.

  • You take your neural network, and pick a probability,

  • let's say 50%.

  • So at each training iteration, you

  • will shoot, physically remove from the network,

  • 50% percent of your neurons.

  • Do the pass, then put them back, next iteration, again, randomly

  • shoot 50% of your neurons.

  • Of course, when you test , you don't test with a half brain

  • dead neural network, you put all the neurons back.

  • But that's what you do for training.

  • So in TensorFlow, there is a very simple function

  • to do that, which is called dropout, That you apply

  • at the outputs of the layer.

  • And what it simply does is it takes the probability

  • and in the output of that layer, it

  • will replace randomly some values by zeros

  • and small technicality, it will actually

  • boost the remaining values proportionally

  • so that the average stays constant,

  • that's a technicality.

  • So why does shooting neurons help?

  • Well, first of all, let's see if it helps.

  • So let's try to recap all the tricks we tried to play

  • with our neural network.

  • This is what we had initially with our five layers

  • using the sigmoid as an activation function.

  • The accuracy got up to 97.9% using five layers.

  • So first, we replaced the sigmoid by the relu activation

  • function.

  • You see, it's faster to converge at the beginning

  • and we actually gained a couple of fractions

  • of percentage of accuracy.

  • But we have these messy curves.

  • So we train slower using the exponential learning rate decay

  • and we get rid of the noise, and now we are stable or above 98%

  • accuracy.

  • But we have that weird disconnect

  • between the error on our test data

  • and the error on our training data.

  • So let us try to add dropout.

  • This is what you get with dropout.

  • And actually, the cross entropy function,

  • the test cross entropy function, the red one

  • over there on the right, has been largely brought

  • under control.

  • You see, there is still some disconnect,

  • but it's not shooting up as it was before.

  • That's very positive.

  • Let's look at the accuracy.

  • No improvement.

  • Actually, I'm even amazed that it hasn't gone down

  • seeing how brutal this technique is, you shoot neurons

  • while you train.

  • But here, I was very hopeful to get it up.

  • No, nothing.

  • We have to keep digging.

  • So what is really overfitting?

  • Let's go beyond the simple recipe in the textbook.

  • Overfitting, in a neural network,

  • is primarily when you give it too many degrees of freedom.

  • Imagine you have so many neurons and so many

  • weights in a neural network that it's somehow

  • feasible to simply store all the training images

  • in those weights and variables.

  • You have enough room for that.

  • And the neural network could figure out some cheap trick

  • to pattern match the training images in what it has stored

  • and just perfectly recognize your training images

  • because it has stored copies of all of them.

  • Well, if it has enough space to do that,

  • that would not translate to any kind of recognition performance

  • in the real world.

  • And that's the trick about neural networks.

  • You have to constrain their degrees of freedom

  • to force them to generalize.

  • And mostly, when you get overfitting

  • is because you have too many neurons.

  • You need to get that number down to force the network

  • to produce generalizations that will then

  • produce good predictions, even in the real world.

  • So either you get the number of neurons down

  • or you apply some trick, like dropout,

  • that is supposed to mitigate the consequences of too

  • many degrees of freedom.

  • The opposite of too many neurons if you have a very small

  • dataset, well, even if you have only a small number of neurons,

  • if the dataset, the training dataset is very small,

  • it can still fit it all in.

  • So that's a general truth in neural networks.

  • You need big datasets for training.

  • And then what happened here?

  • We have a big data set, 60,000 digits, that's enough.

  • We know that we don't have too many neurons because we added

  • five layers, that's a bit overkill, but I tried,

  • I promise, with four and three and two.

  • And we tried dropout which is supposed

  • to mitigate the fact that you have too many neurons.

  • And it didn't do anything to the accuracy.

  • So the conclusion here that we come to

  • is that our network, the way it is built, is inadequate.

  • It's not capable by its architecture

  • to extract the necessary information from our data.

  • And maybe someone here can pinpoint something really

  • stupid we did at the beginning.

  • Someone has an idea?

  • Remember, we have images?

  • Images with shapes like curves and lines.

  • And we flattened all the pixels in one big vector.

  • So all that shape information is lost.

  • This is terrible.

  • That's why we are performing so badly.

  • We lost all of the shape information.

  • So what is the solution?

  • Well, people have invented a different type

  • of neural networks to handle specifically

  • images and problems where shape is important.

  • It's called convolutional networks.

  • Here we go back to the general case of an image,

  • of a color image.

  • So that's why it has red, green, and blue components.

  • And in a convolutional network, one neuron

  • will still be doing weighted sums of pixels,

  • but only a small patch of pixels above its head, only

  • a small patch.

  • And the next neuron would, again,

  • be doing weighted sum of the small patch of pixels

  • above itself, but using the same weights.

  • OK?

  • That's the fundamental difference

  • from what we have seen before.

  • The second neuron is using the same weights

  • as the first neuron.

  • So we are actually taking just one set of weights

  • and we are scanning the image in both directions,

  • using that set of weights and producing weighted sums.

  • So we scan it in both directions and we obtain

  • one layer of weighted sums.

  • So how many weights do we have?

  • Well, as many weights as we have input values

  • in that little highlighted cube, that's 4 times 4 times

  • 3, which is around 48.

  • What?

  • 48?

  • We had 8,000 degrees of freedom in our simplest network

  • with just 10 neurons.

  • How can it work with such a drastic reduction

  • in the number of weights?

  • Well, it won't work.

  • We need more degrees of freedom.

  • How do we do that?

  • Well, we pick a second set of weights and do this again.

  • And we obtain the second--

  • let's call it a channel of values using different weights.

  • Now since those are multi-dimensional matrices,

  • it's fairly easy to write those two matrices as one

  • by simply adding a dimension of dimension two

  • because we have two sets of values.

  • And this here will be the shape of the weight made matrix

  • for one convolutional layer in a neural network.

  • Now, we still have one problem left

  • which is that we need to bring the amount of information down.

  • At the end, we still want only 10 outputs

  • with our 10 probabilities to recognize what this number is.

  • So traditionally, this was achieved by what

  • we call a subsampling layer.

  • I think it's quite useful to understand

  • how this works because it gives you a good feeling for what

  • this network is doing.

  • So basically, we were scanning the image using

  • a set of weights and during training, these weights

  • will actually specialize in some kind of shape recognizer.

  • There will be some weights that will become

  • very sensitive to horizontal lines

  • and some weights that will become

  • very sensitive to vertical lines, and so on.

  • So basically, when you scan the image, if you simplify,

  • you get an output which is mostly I've seen nothing,

  • I've seen nothing, I've seen nothing,

  • oh, I've seen something, I've seen nothing,

  • I've seen nothing, oh, I've seen something.

  • The subsampling basically takes four of those outputs, two

  • by two, and it takes the maximum value.

  • So it retains the biggest signal of I've seen something

  • and passes that down to the layer below.

  • But actually, there's a much simpler way

  • of condensing information.

  • What if we simply play with the stride of the convolution?

  • Instead of scanning the image pixel by pixel,

  • we scan it every two pixels, we jumped by two pixels

  • between each weighted sum.

  • Well, mechanically, instead of obtaining 28

  • by 28 output values, we obtain only 14 by 14 output values.

  • So we have condensed our information.

  • And mostly today, I'm not saying this is better,

  • but it's just simpler.

  • And mostly today, people who build convolutional networks

  • just use convolutional layers and play

  • with the step to condense the information and it's simpler.

  • You don't need, in this way, to have these subsampling layers.

  • So this is the network that I would like to build with you.

  • Let's go through it.

  • There is a first convolutional layer that uses patches of five

  • by five.

  • I'm reading through the W1 tensor.

  • And we have seen that in this shape,

  • the two first digits is the size of the patch you pass.

  • The third digits is the number of channels

  • it's reading from the input.

  • So here I'm back to my real example.

  • This is a grayscale image.

  • It has one value per pixel.

  • So I'm reading one channel of information.

  • And I will be applying four of those patches to my image.

  • So I obtain four channels of output values.

  • OK?

  • Now second convolutional later, this time, my stride is two.

  • So here, my outputs become plains of 14 by 14 values.

  • So let's go through it.

  • I'm applying patches of four by four.

  • I'm reading in four channels of values

  • because that's why I output in the first layer.

  • And this time, I will be using eight different batches,

  • so I will actually produce eight different channels

  • of weighted sums.

  • Nextly, again, a stride of two.

  • That's why I'm getting down from 14 by 14 to seven by seven.

  • Batch is of four by four, reading

  • in eight channels of values because that's

  • what I had in the previous layer,

  • and outputting 12 channels of values

  • this time because I used 12 different batches.

  • And now I apply a fully connected layer.

  • So the kind of layer we've seen before.

  • OK?

  • This fully connected layer, I remember the differences

  • in this one, each neuron does a weighted sum

  • of all the values in the little cube of values above,

  • not just a batch, all the values.

  • In the next neuron in the fully connected network does,

  • again, a weighted sum of all the values using its own weights.

  • It's not sharing weights.

  • That's the normal neural network layer as we have seen before.

  • And finally, I apply my softmax layer with my 10 outputs.

  • All right.

  • So can we write this in TensorFlow?

  • Well, we need one set of weights and biases for each layer.

  • The only difference is that for the convolutional layers,

  • our weights will have this specific shape

  • that we have seen before.

  • So choose numbers for the filter size,

  • one number for the number of input channels,

  • and one number for the number of batches

  • which corresponds to the number of output channels

  • that you produce.

  • For our normal layers, we have the weights and bias as

  • defined as before.

  • And so you see this truncated normal thingy up there?

  • That's just random.

  • OK?

  • Its a complicated way of saying random.

  • So we initialize those weights to random values, initially.

  • And now this is what our model will look like.

  • So TensorFlow has these helpful conv2d function.

  • If you give it the weights' matrix and a batch of images,

  • it will scan them in both directions.

  • Its just a double loop to scan the image in both directions

  • and produce the weighted sums.

  • So we do those weighted sums.

  • We had a bias .

  • We feed this through an activation function,

  • in this case, the relu, and that's our outputs.

  • And again, the way of stacking these layers

  • is to feed why one, the first output,

  • has the input of the next layer.

  • All right.

  • After our three convolutional layers,

  • we need to do a weighted sum this time

  • of all the values in this seven by seven by 12 little cube.

  • So to achieve that, we will flatten this cube

  • as one big vector of values.

  • That's what the Reshape here does.

  • And then, two additional lines that you should recognize,

  • those are normal neural network layers as we have seen before.

  • All right.

  • How does this work?

  • So this time, it takes a little bit more time

  • to process so I have a video.

  • You see the accuracy's shooting up really fast.

  • I will have to zoom.

  • And the promise to 99% accuracy is actually not too far.

  • We're getting there.

  • We're getting there.

  • Are we getting there?

  • We're not getting there.

  • Oh, damn.

  • I'm so disappointed again.

  • I really wanted to bring this to 99% accuracy.

  • We'll have to do something more, 98.9.

  • Dammit, that was so close.

  • All right.

  • Yes.

  • Exactly.

  • This should be your WTF moment.

  • What is that?

  • On the cross entropy loss curve.

  • OK, let me zoom on it.

  • You see that?

  • That disconnect?

  • Do we have a solution for this?

  • Dropout.

  • Yes.

  • Let's go shooting our neurons.

  • It didn't work last time, maybe this time it will.

  • So actually, what we will do here, it's a little trick.

  • It's almost a methodology for coming up

  • with the ideal neural network for a given situation.

  • And what I like doing is to restrict the degrees of freedom

  • until it's apparent that it's not optimal.

  • It's hurting the performance.

  • Here, I know that I can get about 99%.

  • So I erased a little bit too much.

  • And from that point, I give it a little bit more freedom

  • and apply dropout to make sure that this additional freedom

  • will not result in overfitting.

  • And that's basically how you obtain

  • a pretty optimal neural network for a given problem.

  • So that's what I have done here.

  • You see, the batches are a slightly bigger, six, six,

  • five, five, four, four, instead of five,

  • five, four, four, and so on.

  • And I've used a lot more batches.

  • So six patches in the first layer, 12 in the second layer,

  • and 24 in the third layer, instead of four, eight, and 12.

  • And, I applied dropout in the fully connected layer.

  • So why not in the other layers?

  • I tried both, it's possible to apply dropout

  • in convolutional layers.

  • But actually, if you count the number of neurons,

  • there is a lot more neurons in the fully connected layer.

  • So it's a lot more efficient to be shooting them there.

  • I mean, it hurts a little bit too much

  • to shoot neurons where you have only a few of them.

  • So with this, let's run this again.

  • So again, the accuracy shoots up very fast.

  • I will have to zoom in.

  • Look where the 99% is and we are above!

  • Yes!

  • [APPLAUSE]

  • Thank you.

  • I promised you will get above 99 and we are actually

  • quite comfortably above.

  • We get to 99.3%.

  • In this time, let's see what our dropout actually did.

  • So this is what we had with a five layer network

  • and already a little more degrees of freedom.

  • So more patches in each layer.

  • You see, we are already above 99%.

  • But we have this big disconnect between the test

  • and the training cross entropy.

  • Letters apply dropout, boom.

  • The test cross entropy function is brought in under control.

  • It's not shooting up as much.

  • And look, this time, we actually had

  • a problem and this fixed it.

  • With just applying dropout, we got 2/10 of a percent

  • more accuracy.

  • And here, we are fighting for the last percent,

  • between 99 and 100.

  • So getting 2/10 is enormous with just a little trick.

  • All right.

  • So there we have it.

  • We built this network and brought it all the way

  • to 99% accuracy.

  • The Cliff's Notes is just a summary.

  • And to finish, so this was mostly about TensorFlow.

  • We also have a couple of pre-trained APIs, which

  • you can use just as APIs if your problem is standard enough

  • to fit into one of those Cloud Vision, Cloud Speech, Natural

  • Language, or Translate APIs.

  • And if you want to run your TensorFlow jobs in the cloud,

  • we also have this Cloud ML Engine service

  • that allows you to execute your TensorFlow

  • jobs in the cloud for training.

  • And what is even more important, with just the click

  • of a button, you can take a train model

  • and push it to production behind an API

  • and start serving predictions from the model in the cloud.

  • So I think that's a little technical detail,

  • but from an engineering perspective,

  • it's quite significant that you have a very easy way of pushing

  • something to prod.

  • Thank you.

  • You have the code on GitHub and this slide deck is freely

  • available at that URL.

  • And with that, we have five minutes for questions,

  • if you have any.

  • [APPLAUSE]

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it