Subtitles section Play video
-
[MUSIC PLAYING]
-
MARTIN GORNER: Hello.
-
Hi, everyone.
-
So thank you for coming in such great numbers
-
to this TensorFlow session.
-
Apologies, it's quite late in the afternoon.
-
I will need all your brains for this session because today,
-
I want with you to build a neural network.
-
So no, I don't need your brains to build on, no brain
-
surgery in the session.
-
But it's a crash course to get developers up
-
to speed on machine learning and deep learning and neural
-
networks.
-
So I need all your attention.
-
The dataset we will be using is a very classical one.
-
It's this one here, hand-written digits.
-
Academia has been working on this dataset for the past 20
-
years.
-
So you should go to the website where it's hosted.
-
You will actually see 20 years of research papers
-
and that's what we will do together today.
-
We'll go on this dataset trying to build a network that
-
recognizes this hand-written digits from the simplest
-
possible network all the way to 99% accuracy.
-
So let's start.
-
Just a question, beforehand.
-
Who has done some work with neural networks before?
-
Oh, wow.
-
OK.
-
Quite a few people.
-
So feel free to help me and I hope this will not
-
be too basic for you and I hope it
-
will at least be a good introduction to TensorFlow.
-
But if you have never done anything with neural networks,
-
that's fine and I will explain everything from the start.
-
So this is the simplest possible neural network
-
we can imagine to recognize our hand-written digits.
-
So the digits, they come as 28 by 28 pixel images
-
and the first thing we do is that we flatten
-
all those pixels into one big vector of pixels
-
and these will be our inputs.
-
Now, we will use exactly 10 neurons.
-
The neurons are the white circles.
-
What a neuron does is always the same thing.
-
A neuron does a weighted sum of all of its inputs,
-
here the pixels.
-
It adds another constant that is called a bias.
-
That's just an additional degree of freedom.
-
And then it will feed this sum through an activation function.
-
And that is just a function-- number in, transform,
-
number out.
-
We will see several of those activation functions
-
and the one thing they have in common in neural networks
-
is that they are non-linear.
-
So why 10 neurons?
-
Well, simply because we are classifying those digits
-
in 10 categories.
-
We are trying to recognize a zero, a one, a two,
-
on to the nine.
-
So what we are hoping for here is that one of those neurons
-
will light up and tell us, with a very strong output,
-
that I have recognized here an eight.
-
All right.
-
And for that, since this is a classification problem,
-
we are going to use a very specific activation
-
function, one that, well, researchers
-
tell us works really well on classification problems.
-
It's called softmax and it's simply
-
an exponential normalized.
-
So what you do is that you make all those weighted sums,
-
then you elevate that to the exponential.
-
And once you have your 10 exponentials,
-
you compute the norm of this vector
-
and divide it by its norm so that you get
-
values between zero and one.
-
And those values, you will be able to interpret them
-
as probabilities, probabilities of this being an eight, a one,
-
or something else.
-
You will be asking which norm?
-
Any norm, doesn't matter--
-
the length of the vector.
-
You pick your favorite norm.
-
There are several.
-
Usually, for softmax, we use L1, but L2
-
which is the Euclidean normal would work just as well.
-
So what does softmax do actually?
-
You see, it's an exponential so it's a very steeply increasing
-
function.
-
It will pull the data apart, increase the differences,
-
and when you divide all of that, when you normalize
-
the whole vector, you usually end up with one of the values
-
being very close to one and all the other values
-
being very close to zero.
-
So it's a way of pulling the winner out on top
-
without actually destroying the information.
-
All right.
-
So now we need to formalize this using a matrix multiply.
-
I will remind you of what a matrix multiply is,
-
but we will do it not one image, we
-
are going to do this for a batch of 100 images at a time.
-
So what we have here in my matrix
-
is 100 images, one image per line.
-
The images are flattened, all the pixels on one line.
-
So I take my matrix of weights, for the time being,
-
I don't know what these weights are,
-
it's just weights so I'm doing weighted sums.
-
And I start the matrix multiplication.
-
So I do a weighted sum of all the pixels of the first image.
-
Here it is.
-
And then if I continue this matrix multiply
-
using the second column of weights,
-
I get a weighted sum of all the pixels
-
of the first image for the second neuron and then
-
for the third neuron and the fourth and so on.
-
What is left is to add the bias's,
-
just an additional constant.
-
Again, we don't know what it is for the time being.
-
And there is one bias per neuron,
-
that's why we have 10 biases.
-
And now if I continue this matrix multiply,
-
I'm going to obtain these weighted sums
-
for the second image, and the third image,
-
and so on, until I have processed all my images.
-
I would like to write this as a simple formula there.
-
You see there is a problem, x times w,
-
you know that's a matrix of 10 columns by 100 images,
-
and I have only 10 biases.
-
I can't simply add them together.
-
Well, never mind.
-
We will redefine addition and it's OK
-
if everybody accepts it.
-
And actually, people have already accepted it.
-
It's called a broadcasting add and that's
-
the way you do additions in NumPy,
-
for instance, which is the numerical library for Python.
-
The way a broadcasting add works is
-
that if you're trying to add two things which don't match, not
-
the same dimensions, you can't do the addition,
-
you try to replicate the small one as much
-
as needed to make the sizes match
-
and then you do the addition.
-
That's exactly what we need to do here.
-
We have only those 10 biases.
-
So it's the same biases on all the lines.
-
We just need to replicate this bias vector on all the lines,
-
and that's exactly what this generalized broadcasting
-
add does.
-
So we will just write it as a plus.
-
And this is where I wanted to get to.
-
I want you to remember this as the formula describing
-
one layer in a neural network.
-
So let's go through this again.
-
In x, we have a batch of images, 100 images,
-
all the pixels on one line.
-
In w, we have all of our weights for the 10 neurons,
-
all the weights in the system.
-
x times w are all about weighted sums.
-
We add the biases, and then we feed this
-
through our activation function, in this case softmax, the way
-
it works is lined by line.
-
Line by line, we take the 10 values,
-
elevate them to the exponential, normalize the line.
-
Next line, 10 values, elevate them to the exponential,
-
normalize the line, and so on.
-
So what we get in the output is, for each image, 10 values
-
which look like probabilities and which are our predictions.
-
So, of course, we still don't know
-
what those weights and biases are
-
and that's where the trick is in neural networks.
-
We are going to train this neural network
-
to actually figure out the correct weights
-
and biases by itself.
-
Well, this is how we write this in TensorFlow.
-
You see, not very different.
-
OK.
-
TensorFlow has this in n library for neural network
-
which has all sorts of very useful functions
-
for neural networks, for example, softmax and so on.
-
So let's go train.
-
When you train, you've got images,
-
but you know what those images are.
-
So your network, you initialize your weights and biases
-
at random value and your network will output some probability.
-
Since you know what this image is,
-
you can tell it that it's not this, it should be that.
-
So that is called a one-hot encoded vector.
-
It's a not very fancy way of encoding numbers.
-
Basically, here are our numbers from zero to nine.
-
We encode them as 10 bits, all at zero and just one of them
-
is a one at the index of the number we want to encode.
-
Here are six.
-
Why?
-
Well, because then, it's in the same shape as our predictions
-
and we can compute a distance between those two.
-
So again, many ways of computing distances.
-
The Euclidean distance, the normal distance, sum
-
of differences squared would work, not a problem.
-
But scientists tell us that for classification problems,
-
this distance, the cross entropy, works slightly better.
-
So we'll use this one.
-
How does it work?
-
It's the sum across the vectors of the values
-
on the top multiplied by the logarithms of the values
-
on the bottom, and then we add in minus sign
-
because all the values on the bottom are less than one,
-
so all the logarithms are negative.
-
So that's the distance.
-
And of course, we will tell the system
-
to minimize the distance between what it thinks is the truth
-
and what we know to be true.
-
So this we will call our error function
-
and the training will be guided by an effort
-
to minimize the error function.
-
So let's see how this works in practice.
-
So in this little visualization, I'm
-
showing you over there, my training images.
-
You see it's training so you see this batches of 100 training
-
images being fed into the system.
-
On the white background, you have the images
-
that have been already correctly recognized by the system.
-
On a red background, images that are still missed.
-
So then, on the middle graph, you
-
see our error function, computed both on the training dataset
-
and we also kept aside a set of images which we have never seen
-
during training for testing.
-
Of course, if you want to test the real world
-
performance of your neural network,
-
you have to do this on a set of images which you have never
-
seen during training.
-
So here we have 60,000 training images
-
and I set aside 10,000 test images which you see
-
in the bottom graph over there.
-
They are a bit small.
-
You see only 1,000 of them here.
-
So imagine, there are nine more screens of pictures like that.
-
But I sorted all the badly recognized one at the top.
-
So you see all the ones that have been badly recognized
-
and below are nine screens of correctly recognized images,
-
here after 2,000 rounds of training.
-
So there is a little scale on the side here.
-
It shows you that it's already capable of recognizing
-
92% of our images with this very simple model, just 10 neuron's,
-
nothing else.
-
And that's what you get on the top graph, the accuracy graph,
-
as well.
-
That's simply the percentage of correctly recognized images,
-
both on test and training data.
-
So what else do we have?
-
We have our weights and biases, those two diagrams are simply
-
percentiles, so it shows you the spread
-
of all the weights and biases.