Subtitles section Play video
-
[MUSIC PLAYING]
-
MARTIN GORNER: So thank you for filling the house.
-
I'm really impressed that TensorFlow
-
is getting so much attention.
-
But I think the technology deserves it.
-
So I'm happy about that.
-
So today, or yesterday, we built a neural network
-
for recognizing handwritten digits.
-
And we went through dense neural networks
-
and convolutional neural networks.
-
And today, I want to build with you another kind of network,
-
a recurrent neural network.
-
So let's go.
-
A couple of reminders from yesterday.
-
You remember, we started with this initial one layer neural
-
network.
-
And I need you to remember the formula
-
for one layer of a neural network
-
because we will be using it.
-
So we were reading in pixels.
-
But it works for any input vector, of course.
-
And you remember, we said that the neurons
-
do weighted sums of all of their inputs, they add in bias,
-
and they feed that through some activation function.
-
Here, softmax.
-
It can be another function.
-
It's just a function, value in, value out.
-
But usually, in neural networks, it's a non-linear function.
-
And we wrote this one layer neural network
-
using a matrix multiply, blah, blah, blah,
-
we've seen all that, as this formula.
-
And you remember, we did this not for just one image,
-
but we actually wrote this formula
-
processing 100 images at a time.
-
So in x, we have a batch of images,
-
a whole batch, 100 images.
-
And then x times w are all the weighted sums for our neurons.
-
We add the biases.
-
We feed that through our activation function.
-
And we obtain a batch of predictions.
-
So in our case, since we were classifying handwritten digits,
-
those predictions are 10 numbers,
-
which are the probabilities of these digits being a 0, a 1, 2,
-
3, and so on.
-
And so we obtained those probabilities
-
as the outputs of our 10 neurons.
-
OK, so whenever you see this formula,
-
we will see it again a lot today,
-
you think, one layer of a neural network.
-
OK?
-
And then also, what I need you to remember
-
is that once we get our output from the neural network,
-
the way we train it is that we give it examples.
-
It produces some prediction.
-
And then we say, no, no, no, no, no, that's not what we wanted.
-
This is what you should predict.
-
We give it the correct answer.
-
And to do that, we have to encode this correct answer
-
in a similar format.
-
So it's called-- it's a very basic type of encoding.
-
It's called one hot encoding.
-
And basically here, if we have 10 categories,
-
specifying one answer category means encoding it as 10 0s,
-
but just one 1 somewhere in the middle, and the index of the 1
-
here, it's at index 6.
-
Means that the correct answer was a 6, OK?
-
So in this shape, it becomes possible to compute
-
a distance between what the network predicts
-
and what we know to be true.
-
And that distance, we call that our error function.
-
Or sometimes, it's called the loss function.
-
That's what we use to guide our training.
-
So during training, we give it an example, produces an output.
-
We say, no, no, no, that's not what we wanted.
-
Compute the distance between what the network says
-
and what we know to be true.
-
And from that distance, we derive the gradient.
-
And then, we follow the gradient.
-
And that modifies the weights and biases.
-
And that's what training is about, OK?
-
So now, let's look at this neural network.
-
So it should look familiar.
-
It's a vector as an input.
-
It has a middle layer using the hyperbolic tangent
-
as an activation function.
-
So we've seen the sigmoid last time,
-
which is let's say the simplest possible function going
-
from 0 to 1 continuously.
-
The hyperbolic tangent is the simplest possible function
-
going from minus 1 to 1 continuously.
-
It's just a sigmoid shifted.
-
And then, a second layer, which is a softmax
-
layer so that we read something out.
-
But the specificity is here that the output
-
of this intermediate green layer is actually
-
fed back in the next time step in the inputs.
-
So the real input into one cell of a recurrent neural network
-
is the input concatenated to the output of the inner layer
-
from the previous step.
-
And we call this the state.
-
So it's actually a state machine.
-
You feed it inputs.
-
It produces outputs.
-
But you also feed it a state.
-
It produces an output state, which you feed back in,
-
in the next time step.
-
And that's why it's called a recurrent neural network,
-
it's because it is applied on time sequences.
-
At each step in time, you feed in one input vector,
-
concatenate it to the previous state.
-
Turn the crank once.
-
That produces some outputs from this middle layer, as well as
-
a result. And you feed that back as the new input
-
state for the next x input, which
-
you have in your sequence.
-
So it can be represented.
-
I'm showing you the neurons inside.
-
But here, it's basically the API of one recurrent neural network
-
cell.
-
It has an input.
-
It has an output, which you then usually feed
-
into a softmax layer to make sense of it,
-
to produce predictions.
-
I mean, probabilities.
-
And it has an input state that produces an output
-
state that you loop back in as the input state.
-
That's the state machine part.
-
OK?
-
So now, well yes, and the parameter for this
-
is the internal size of this middle layer.
-
That's what is adjustable.
-
Usually, your input is whatever your input is.
-
And your output is whatever you're trying to predict.
-
So those are not adjustable parameters.
-
So here it is written in equations.
-
Again, the input is the real input
-
at time t concatenated to the previous state.
-
Then, we feed that through.
-
Here, you should recognize one layer of a neural network.
-
You should recognize this formula
-
using the hyperbolic tangent as an activation function.
-
So I put it over there.
-
And this produces an output, Ht, which
-
is both used as our new state and as the output that
-
will be fed into the softmax layer
-
to actually produce a vector of probabilities between 0 and 1.
-
OK?
-
So now, how do we train this thing?
-
So typically, this is used for natural language processing,
-
for instance.
-
So a typical input will be a character.
-
And a character will be, again, one-hot encoded into let's say
-
100 competent vectors if we are using--
-
we will be using here an alphabet
-
of 100 possible characters.
-
So one character is encoded into a 100 element vector, so 99 0s
-
and a 1 at the ASCII index of that character.
-
So we put a character in.
-
We propagate through the neural networks.
-
We propagate through the softmax layer.
-
We obtain a character as an output.
-
If that is not the character we wanted, well,
-
we compute the difference between what he said
-
and what we know to be true and use retro propagation to fix
-
the weights and biases inside of the cell to get better results.
-
That is very classical training.
-
But what if the result was wrong not
-
because the weights and biases inside of the cell was wrong,
-
but because the input, the state input, H minus 1, was wrong?
-
That input is a constant in this problem.
-
There's not much you can do about it.
-
So here, we are stuck.
-
What is the solution?
-
Well, the solution is to replicate the cell.
-
And now, so this is a replica.
-
It's reusing the exact same weights, OK?
-
Now, let's see the output, Y1, is bad.
-
I say, no, that's not it.
-
This was the correct output I'm training.
-
So I know what the correct output is supposed to be.
-
So from that, I compute the enter, the gradient.
-
I retro propagate.
-
I can fix the weights and biases in the cells
-
to get a better output.
-
And if needed, I can fix the weights and biases
-
to get a better H0, the state flowing between those two
-
stages of the cell.
-
So now, I have a handle on at least H0.
-
I still have no handle at all on H minus 1.
-
If it is H minus 1 that was wrong,
-
there is nothing I can do.
-
So that is how you train recurrent neural networks.
-
You have to unroll them across a certain length
-
and give them a sequence of, let's say, characters.
-
It will produce a sequence of output characters.
-
If you are training, you know what
-
the answer was supposed to be.
-
So you use that to compute your error function,
-
do you retro propagation, adjust the weights and biases.
-
And it will work to a certain extent.
-
To what extent?
-
Oh, yes, small detail, if you want to go deep,
-
you can actually stack the cells.
-
Why?
-
Well, two cells stacked like this, the API remains the same.
-
It's still an input.
-
It's still an output that feeds into a softmax layer.
-
And there is still an input state and an output state
-
that you feedback in.
-
It's just that the output state now is slightly bigger.
-
So that's how you go deep in the recurrent neural network.
-
You stack those cells, and that becomes
-
a new cell which still has input, output,
-
input state, output state.
-
And of course, you unroll it.
-
So let's take this sentence.
-
Let's say now, we use not characters but words
-
as our inputs.
-
Of course, there are technical problems doing that.
-
A typical alphabet is maybe 100 characters.
-
A typical vocabulary is around 30,000 words.
-
So here, one-hot encoding gives you
-
a vector of 30,000 components for each word.
-
It's a bit heavy.
-
I won't go into the details of how you handle that.
-
It's called embedding.
-
Whatever, let's just assume that we solved this problem.
-
So we have this sentence.
-
Michael was born in Paris, blah, blah, blah, blah, blah.
-
And at the end, we have his mother tongue is.
-
So if we train this model on English,
-
probably it will have figured out
-
that "his mother tongue is" is followed
-
by the name of a language, like English,
-
German, or Russian, something.
-
Here, however, the correct answer
-
is French because this guy was born in France.
-
So let's imagine that we feed in,
-
we have unrolled this neural network over,
-
let's say, 30 words.
-
Or let's say 10 words here, 10 words.
-
And at the end, we have his mother tongue is,
-
and we are asking the network to predict what is the next word.
-
And the network says, English.
-
So now, what we want to do is put on the outputs
-
a sentence that says, blah, blah, blah,
-
his mother tongue is French.
-
And do retro propagation.
-
But for this to work, the beginning
-
of the sentence, the part where the information