Subtitles section Play video Print subtitles [MUSIC PLAYING] MARTIN GORNER: Hello. Hi, everyone. So thank you for coming in such great numbers to this TensorFlow session. Apologies, it's quite late in the afternoon. I will need all your brains for this session because today, I want with you to build a neural network. So no, I don't need your brains to build on, no brain surgery in the session. But it's a crash course to get developers up to speed on machine learning and deep learning and neural networks. So I need all your attention. The dataset we will be using is a very classical one. It's this one here, hand-written digits. Academia has been working on this dataset for the past 20 years. So you should go to the website where it's hosted. You will actually see 20 years of research papers and that's what we will do together today. We'll go on this dataset trying to build a network that recognizes this hand-written digits from the simplest possible network all the way to 99% accuracy. So let's start. Just a question, beforehand. Who has done some work with neural networks before? Oh, wow. OK. Quite a few people. So feel free to help me and I hope this will not be too basic for you and I hope it will at least be a good introduction to TensorFlow. But if you have never done anything with neural networks, that's fine and I will explain everything from the start. So this is the simplest possible neural network we can imagine to recognize our hand-written digits. So the digits, they come as 28 by 28 pixel images and the first thing we do is that we flatten all those pixels into one big vector of pixels and these will be our inputs. Now, we will use exactly 10 neurons. The neurons are the white circles. What a neuron does is always the same thing. A neuron does a weighted sum of all of its inputs, here the pixels. It adds another constant that is called a bias. That's just an additional degree of freedom. And then it will feed this sum through an activation function. And that is just a function-- number in, transform, number out. We will see several of those activation functions and the one thing they have in common in neural networks is that they are non-linear. So why 10 neurons? Well, simply because we are classifying those digits in 10 categories. We are trying to recognize a zero, a one, a two, on to the nine. So what we are hoping for here is that one of those neurons will light up and tell us, with a very strong output, that I have recognized here an eight. All right. And for that, since this is a classification problem, we are going to use a very specific activation function, one that, well, researchers tell us works really well on classification problems. It's called softmax and it's simply an exponential normalized. So what you do is that you make all those weighted sums, then you elevate that to the exponential. And once you have your 10 exponentials, you compute the norm of this vector and divide it by its norm so that you get values between zero and one. And those values, you will be able to interpret them as probabilities, probabilities of this being an eight, a one, or something else. You will be asking which norm? Any norm, doesn't matter-- the length of the vector. You pick your favorite norm. There are several. Usually, for softmax, we use L1, but L2 which is the Euclidean normal would work just as well. So what does softmax do actually? You see, it's an exponential so it's a very steeply increasing function. It will pull the data apart, increase the differences, and when you divide all of that, when you normalize the whole vector, you usually end up with one of the values being very close to one and all the other values being very close to zero. So it's a way of pulling the winner out on top without actually destroying the information. All right. So now we need to formalize this using a matrix multiply. I will remind you of what a matrix multiply is, but we will do it not one image, we are going to do this for a batch of 100 images at a time. So what we have here in my matrix is 100 images, one image per line. The images are flattened, all the pixels on one line. So I take my matrix of weights, for the time being, I don't know what these weights are, it's just weights so I'm doing weighted sums. And I start the matrix multiplication. So I do a weighted sum of all the pixels of the first image. Here it is. And then if I continue this matrix multiply using the second column of weights, I get a weighted sum of all the pixels of the first image for the second neuron and then for the third neuron and the fourth and so on. What is left is to add the bias's, just an additional constant. Again, we don't know what it is for the time being. And there is one bias per neuron, that's why we have 10 biases. And now if I continue this matrix multiply, I'm going to obtain these weighted sums for the second image, and the third image, and so on, until I have processed all my images. I would like to write this as a simple formula there. You see there is a problem, x times w, you know that's a matrix of 10 columns by 100 images, and I have only 10 biases. I can't simply add them together. Well, never mind. We will redefine addition and it's OK if everybody accepts it. And actually, people have already accepted it. It's called a broadcasting add and that's the way you do additions in NumPy, for instance, which is the numerical library for Python. The way a broadcasting add works is that if you're trying to add two things which don't match, not the same dimensions, you can't do the addition, you try to replicate the small one as much as needed to make the sizes match and then you do the addition. That's exactly what we need to do here. We have only those 10 biases. So it's the same biases on all the lines. We just need to replicate this bias vector on all the lines, and that's exactly what this generalized broadcasting add does. So we will just write it as a plus. And this is where I wanted to get to. I want you to remember this as the formula describing one layer in a neural network. So let's go through this again. In x, we have a batch of images, 100 images, all the pixels on one line. In w, we have all of our weights for the 10 neurons, all the weights in the system. x times w are all about weighted sums. We add the biases, and then we feed this through our activation function, in this case softmax, the way it works is lined by line. Line by line, we take the 10 values, elevate them to the exponential, normalize the line. Next line, 10 values, elevate them to the exponential, normalize the line, and so on. So what we get in the output is, for each image, 10 values which look like probabilities and which are our predictions. So, of course, we still don't know what those weights and biases are and that's where the trick is in neural networks. We are going to train this neural network to actually figure out the correct weights and biases by itself. Well, this is how we write this in TensorFlow. You see, not very different. OK. TensorFlow has this in n library for neural network which has all sorts of very useful functions for neural networks, for example, softmax and so on. So let's go train. When you train, you've got images, but you know what those images are. So your network, you initialize your weights and biases at random value and your network will output some probability. Since you know what this image is, you can tell it that it's not this, it should be that. So that is called a one-hot encoded vector. It's a not very fancy way of encoding numbers. Basically, here are our numbers from zero to nine. We encode them as 10 bits, all at zero and just one of them is a one at the index of the number we want to encode. Here are six. Why? Well, because then, it's in the same shape as our predictions and we can compute a distance between those two. So again, many ways of computing distances. The Euclidean distance, the normal distance, sum of differences squared would work, not a problem. But scientists tell us that for classification problems, this distance, the cross entropy, works slightly better. So we'll use this one. How does it work? It's the sum across the vectors of the values on the top multiplied by the logarithms of the values on the bottom, and then we add in minus sign because all the values on the bottom are less than one, so all the logarithms are negative. So that's the distance. And of course, we will tell the system to minimize the distance between what it thinks is the truth and what we know to be true. So this we will call our error function and the training will be guided by an effort to minimize the error function. So let's see how this works in practice. So in this little visualization, I'm showing you over there, my training images. You see it's training so you see this batches of 100 training images being fed into the system. On the white background, you have the images that have been already correctly recognized by the system. On a red background, images that are still missed. So then, on the middle graph, you see our error function, computed both on the training dataset and we also kept aside a set of images which we have never seen during training for testing. Of course, if you want to test the real world performance of your neural network, you have to do this on a set of images which you have never seen during training. So here we have 60,000 training images and I set aside 10,000 test images which you see in the bottom graph over there. They are a bit small. You see only 1,000 of them here. So imagine, there are nine more screens of pictures like that. But I sorted all the badly recognized one at the top. So you see all the ones that have been badly recognized and below are nine screens of correctly recognized images, here after 2,000 rounds of training. So there is a little scale on the side here. It shows you that it's already capable of recognizing 92% of our images with this very simple model, just 10 neuron's, nothing else. And that's what you get on the top graph, the accuracy graph, as well. That's simply the percentage of correctly recognized images, both on test and training data. So what else do we have? We have our weights and biases, those two diagrams are simply percentiles, so it shows you the spread of all the weights and biases.