Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARTIN GORNER: Hello.

  • Hi, everyone.

  • So thank you for coming in such great numbers

  • to this TensorFlow session.

  • Apologies, it's quite late in the afternoon.

  • I will need all your brains for this session because today,

  • I want with you to build a neural network.

  • So no, I don't need your brains to build on, no brain

  • surgery in the session.

  • But it's a crash course to get developers up

  • to speed on machine learning and deep learning and neural

  • networks.

  • So I need all your attention.

  • The dataset we will be using is a very classical one.

  • It's this one here, hand-written digits.

  • Academia has been working on this dataset for the past 20

  • years.

  • So you should go to the website where it's hosted.

  • You will actually see 20 years of research papers

  • and that's what we will do together today.

  • We'll go on this dataset trying to build a network that

  • recognizes this hand-written digits from the simplest

  • possible network all the way to 99% accuracy.

  • So let's start.

  • Just a question, beforehand.

  • Who has done some work with neural networks before?

  • Oh, wow.

  • OK.

  • Quite a few people.

  • So feel free to help me and I hope this will not

  • be too basic for you and I hope it

  • will at least be a good introduction to TensorFlow.

  • But if you have never done anything with neural networks,

  • that's fine and I will explain everything from the start.

  • So this is the simplest possible neural network

  • we can imagine to recognize our hand-written digits.

  • So the digits, they come as 28 by 28 pixel images

  • and the first thing we do is that we flatten

  • all those pixels into one big vector of pixels

  • and these will be our inputs.

  • Now, we will use exactly 10 neurons.

  • The neurons are the white circles.

  • What a neuron does is always the same thing.

  • A neuron does a weighted sum of all of its inputs,

  • here the pixels.

  • It adds another constant that is called a bias.

  • That's just an additional degree of freedom.

  • And then it will feed this sum through an activation function.

  • And that is just a function-- number in, transform,

  • number out.

  • We will see several of those activation functions

  • and the one thing they have in common in neural networks

  • is that they are non-linear.

  • So why 10 neurons?

  • Well, simply because we are classifying those digits

  • in 10 categories.

  • We are trying to recognize a zero, a one, a two,

  • on to the nine.

  • So what we are hoping for here is that one of those neurons

  • will light up and tell us, with a very strong output,

  • that I have recognized here an eight.

  • All right.

  • And for that, since this is a classification problem,

  • we are going to use a very specific activation

  • function, one that, well, researchers

  • tell us works really well on classification problems.

  • It's called softmax and it's simply

  • an exponential normalized.

  • So what you do is that you make all those weighted sums,

  • then you elevate that to the exponential.

  • And once you have your 10 exponentials,

  • you compute the norm of this vector

  • and divide it by its norm so that you get

  • values between zero and one.

  • And those values, you will be able to interpret them

  • as probabilities, probabilities of this being an eight, a one,

  • or something else.

  • You will be asking which norm?

  • Any norm, doesn't matter--

  • the length of the vector.

  • You pick your favorite norm.

  • There are several.

  • Usually, for softmax, we use L1, but L2

  • which is the Euclidean normal would work just as well.

  • So what does softmax do actually?

  • You see, it's an exponential so it's a very steeply increasing

  • function.

  • It will pull the data apart, increase the differences,

  • and when you divide all of that, when you normalize

  • the whole vector, you usually end up with one of the values

  • being very close to one and all the other values

  • being very close to zero.

  • So it's a way of pulling the winner out on top

  • without actually destroying the information.

  • All right.

  • So now we need to formalize this using a matrix multiply.

  • I will remind you of what a matrix multiply is,

  • but we will do it not one image, we

  • are going to do this for a batch of 100 images at a time.

  • So what we have here in my matrix

  • is 100 images, one image per line.

  • The images are flattened, all the pixels on one line.

  • So I take my matrix of weights, for the time being,

  • I don't know what these weights are,

  • it's just weights so I'm doing weighted sums.

  • And I start the matrix multiplication.

  • So I do a weighted sum of all the pixels of the first image.

  • Here it is.

  • And then if I continue this matrix multiply

  • using the second column of weights,

  • I get a weighted sum of all the pixels

  • of the first image for the second neuron and then

  • for the third neuron and the fourth and so on.

  • What is left is to add the bias's,

  • just an additional constant.

  • Again, we don't know what it is for the time being.

  • And there is one bias per neuron,

  • that's why we have 10 biases.

  • And now if I continue this matrix multiply,

  • I'm going to obtain these weighted sums

  • for the second image, and the third image,

  • and so on, until I have processed all my images.

  • I would like to write this as a simple formula there.

  • You see there is a problem, x times w,

  • you know that's a matrix of 10 columns by 100 images,

  • and I have only 10 biases.

  • I can't simply add them together.

  • Well, never mind.

  • We will redefine addition and it's OK

  • if everybody accepts it.

  • And actually, people have already accepted it.

  • It's called a broadcasting add and that's

  • the way you do additions in NumPy,

  • for instance, which is the numerical library for Python.

  • The way a broadcasting add works is

  • that if you're trying to add two things which don't match, not

  • the same dimensions, you can't do the addition,

  • you try to replicate the small one as much

  • as needed to make the sizes match

  • and then you do the addition.

  • That's exactly what we need to do here.

  • We have only those 10 biases.

  • So it's the same biases on all the lines.

  • We just need to replicate this bias vector on all the lines,

  • and that's exactly what this generalized broadcasting

  • add does.

  • So we will just write it as a plus.

  • And this is where I wanted to get to.

  • I want you to remember this as the formula describing

  • one layer in a neural network.

  • So let's go through this again.

  • In x, we have a batch of images, 100 images,

  • all the pixels on one line.

  • In w, we have all of our weights for the 10 neurons,

  • all the weights in the system.

  • x times w are all about weighted sums.

  • We add the biases, and then we feed this

  • through our activation function, in this case softmax, the way

  • it works is lined by line.

  • Line by line, we take the 10 values,

  • elevate them to the exponential, normalize the line.

  • Next line, 10 values, elevate them to the exponential,

  • normalize the line, and so on.

  • So what we get in the output is, for each image, 10 values

  • which look like probabilities and which are our predictions.

  • So, of course, we still don't know

  • what those weights and biases are

  • and that's where the trick is in neural networks.

  • We are going to train this neural network

  • to actually figure out the correct weights

  • and biases by itself.

  • Well, this is how we write this in TensorFlow.

  • You see, not very different.

  • OK.

  • TensorFlow has this in n library for neural network

  • which has all sorts of very useful functions

  • for neural networks, for example, softmax and so on.

  • So let's go train.

  • When you train, you've got images,

  • but you know what those images are.

  • So your network, you initialize your weights and biases

  • at random value and your network will output some probability.

  • Since you know what this image is,

  • you can tell it that it's not this, it should be that.

  • So that is called a one-hot encoded vector.

  • It's a not very fancy way of encoding numbers.

  • Basically, here are our numbers from zero to nine.

  • We encode them as 10 bits, all at zero and just one of them

  • is a one at the index of the number we want to encode.

  • Here are six.

  • Why?

  • Well, because then, it's in the same shape as our predictions

  • and we can compute a distance between those two.

  • So again, many ways of computing distances.

  • The Euclidean distance, the normal distance, sum

  • of differences squared would work, not a problem.

  • But scientists tell us that for classification problems,

  • this distance, the cross entropy, works slightly better.

  • So we'll use this one.

  • How does it work?

  • It's the sum across the vectors of the values

  • on the top multiplied by the logarithms of the values

  • on the bottom, and then we add in minus sign

  • because all the values on the bottom are less than one,

  • so all the logarithms are negative.

  • So that's the distance.

  • And of course, we will tell the system

  • to minimize the distance between what it thinks is the truth

  • and what we know to be true.

  • So this we will call our error function

  • and the training will be guided by an effort

  • to minimize the error function.

  • So let's see how this works in practice.

  • So in this little visualization, I'm

  • showing you over there, my training images.

  • You see it's training so you see this batches of 100 training

  • images being fed into the system.

  • On the white background, you have the images

  • that have been already correctly recognized by the system.

  • On a red background, images that are still missed.

  • So then, on the middle graph, you

  • see our error function, computed both on the training dataset

  • and we also kept aside a set of images which we have never seen

  • during training for testing.

  • Of course, if you want to test the real world

  • performance of your neural network,

  • you have to do this on a set of images which you have never

  • seen during training.

  • So here we have 60,000 training images

  • and I set aside 10,000 test images which you see

  • in the bottom graph over there.

  • They are a bit small.

  • You see only 1,000 of them here.

  • So imagine, there are nine more screens of pictures like that.

  • But I sorted all the badly recognized one at the top.

  • So you see all the ones that have been badly recognized

  • and below are nine screens of correctly recognized images,

  • here after 2,000 rounds of training.

  • So there is a little scale on the side here.

  • It shows you that it's already capable of recognizing

  • 92% of our images with this very simple model, just 10 neuron's,

  • nothing else.

  • And that's what you get on the top graph, the accuracy graph,

  • as well.

  • That's simply the percentage of correctly recognized images,

  • both on test and training data.

  • So what else do we have?

  • We have our weights and biases, those two diagrams are simply

  • percentiles, so it shows you the spread

  • of all the weights and biases.