Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARTIN GORNER: So thank you for filling the house.

  • I'm really impressed that TensorFlow

  • is getting so much attention.

  • But I think the technology deserves it.

  • So I'm happy about that.

  • So today, or yesterday, we built a neural network

  • for recognizing handwritten digits.

  • And we went through dense neural networks

  • and convolutional neural networks.

  • And today, I want to build with you another kind of network,

  • a recurrent neural network.

  • So let's go.

  • A couple of reminders from yesterday.

  • You remember, we started with this initial one layer neural

  • network.

  • And I need you to remember the formula

  • for one layer of a neural network

  • because we will be using it.

  • So we were reading in pixels.

  • But it works for any input vector, of course.

  • And you remember, we said that the neurons

  • do weighted sums of all of their inputs, they add in bias,

  • and they feed that through some activation function.

  • Here, softmax.

  • It can be another function.

  • It's just a function, value in, value out.

  • But usually, in neural networks, it's a non-linear function.

  • And we wrote this one layer neural network

  • using a matrix multiply, blah, blah, blah,

  • we've seen all that, as this formula.

  • And you remember, we did this not for just one image,

  • but we actually wrote this formula

  • processing 100 images at a time.

  • So in x, we have a batch of images,

  • a whole batch, 100 images.

  • And then x times w are all the weighted sums for our neurons.

  • We add the biases.

  • We feed that through our activation function.

  • And we obtain a batch of predictions.

  • So in our case, since we were classifying handwritten digits,

  • those predictions are 10 numbers,

  • which are the probabilities of these digits being a 0, a 1, 2,

  • 3, and so on.

  • And so we obtained those probabilities

  • as the outputs of our 10 neurons.

  • OK, so whenever you see this formula,

  • we will see it again a lot today,

  • you think, one layer of a neural network.

  • OK?

  • And then also, what I need you to remember

  • is that once we get our output from the neural network,

  • the way we train it is that we give it examples.

  • It produces some prediction.

  • And then we say, no, no, no, no, no, that's not what we wanted.

  • This is what you should predict.

  • We give it the correct answer.

  • And to do that, we have to encode this correct answer

  • in a similar format.

  • So it's called-- it's a very basic type of encoding.

  • It's called one hot encoding.

  • And basically here, if we have 10 categories,

  • specifying one answer category means encoding it as 10 0s,

  • but just one 1 somewhere in the middle, and the index of the 1

  • here, it's at index 6.

  • Means that the correct answer was a 6, OK?

  • So in this shape, it becomes possible to compute

  • a distance between what the network predicts

  • and what we know to be true.

  • And that distance, we call that our error function.

  • Or sometimes, it's called the loss function.

  • That's what we use to guide our training.

  • So during training, we give it an example, produces an output.

  • We say, no, no, no, that's not what we wanted.

  • Compute the distance between what the network says

  • and what we know to be true.

  • And from that distance, we derive the gradient.

  • And then, we follow the gradient.

  • And that modifies the weights and biases.

  • And that's what training is about, OK?

  • So now, let's look at this neural network.

  • So it should look familiar.

  • It's a vector as an input.

  • It has a middle layer using the hyperbolic tangent

  • as an activation function.

  • So we've seen the sigmoid last time,

  • which is let's say the simplest possible function going

  • from 0 to 1 continuously.

  • The hyperbolic tangent is the simplest possible function

  • going from minus 1 to 1 continuously.

  • It's just a sigmoid shifted.

  • And then, a second layer, which is a softmax

  • layer so that we read something out.

  • But the specificity is here that the output

  • of this intermediate green layer is actually

  • fed back in the next time step in the inputs.

  • So the real input into one cell of a recurrent neural network

  • is the input concatenated to the output of the inner layer

  • from the previous step.

  • And we call this the state.

  • So it's actually a state machine.

  • You feed it inputs.

  • It produces outputs.

  • But you also feed it a state.

  • It produces an output state, which you feed back in,

  • in the next time step.

  • And that's why it's called a recurrent neural network,

  • it's because it is applied on time sequences.

  • At each step in time, you feed in one input vector,

  • concatenate it to the previous state.

  • Turn the crank once.

  • That produces some outputs from this middle layer, as well as

  • a result. And you feed that back as the new input

  • state for the next x input, which

  • you have in your sequence.

  • So it can be represented.

  • I'm showing you the neurons inside.

  • But here, it's basically the API of one recurrent neural network

  • cell.

  • It has an input.

  • It has an output, which you then usually feed

  • into a softmax layer to make sense of it,

  • to produce predictions.

  • I mean, probabilities.

  • And it has an input state that produces an output

  • state that you loop back in as the input state.

  • That's the state machine part.

  • OK?

  • So now, well yes, and the parameter for this

  • is the internal size of this middle layer.

  • That's what is adjustable.

  • Usually, your input is whatever your input is.

  • And your output is whatever you're trying to predict.

  • So those are not adjustable parameters.

  • So here it is written in equations.

  • Again, the input is the real input

  • at time t concatenated to the previous state.

  • Then, we feed that through.

  • Here, you should recognize one layer of a neural network.

  • You should recognize this formula

  • using the hyperbolic tangent as an activation function.

  • So I put it over there.

  • And this produces an output, Ht, which

  • is both used as our new state and as the output that

  • will be fed into the softmax layer

  • to actually produce a vector of probabilities between 0 and 1.

  • OK?

  • So now, how do we train this thing?

  • So typically, this is used for natural language processing,

  • for instance.

  • So a typical input will be a character.

  • And a character will be, again, one-hot encoded into let's say

  • 100 competent vectors if we are using--

  • we will be using here an alphabet

  • of 100 possible characters.

  • So one character is encoded into a 100 element vector, so 99 0s

  • and a 1 at the ASCII index of that character.

  • So we put a character in.

  • We propagate through the neural networks.

  • We propagate through the softmax layer.

  • We obtain a character as an output.

  • If that is not the character we wanted, well,

  • we compute the difference between what he said

  • and what we know to be true and use retro propagation to fix

  • the weights and biases inside of the cell to get better results.

  • That is very classical training.

  • But what if the result was wrong not

  • because the weights and biases inside of the cell was wrong,

  • but because the input, the state input, H minus 1, was wrong?

  • That input is a constant in this problem.

  • There's not much you can do about it.

  • So here, we are stuck.

  • What is the solution?

  • Well, the solution is to replicate the cell.

  • And now, so this is a replica.

  • It's reusing the exact same weights, OK?

  • Now, let's see the output, Y1, is bad.

  • I say, no, that's not it.

  • This was the correct output I'm training.

  • So I know what the correct output is supposed to be.

  • So from that, I compute the enter, the gradient.

  • I retro propagate.

  • I can fix the weights and biases in the cells

  • to get a better output.

  • And if needed, I can fix the weights and biases

  • to get a better H0, the state flowing between those two

  • stages of the cell.

  • So now, I have a handle on at least H0.

  • I still have no handle at all on H minus 1.

  • If it is H minus 1 that was wrong,

  • there is nothing I can do.

  • So that is how you train recurrent neural networks.

  • You have to unroll them across a certain length

  • and give them a sequence of, let's say, characters.

  • It will produce a sequence of output characters.

  • If you are training, you know what

  • the answer was supposed to be.

  • So you use that to compute your error function,

  • do you retro propagation, adjust the weights and biases.

  • And it will work to a certain extent.

  • To what extent?

  • Oh, yes, small detail, if you want to go deep,

  • you can actually stack the cells.

  • Why?

  • Well, two cells stacked like this, the API remains the same.

  • It's still an input.

  • It's still an output that feeds into a softmax layer.

  • And there is still an input state and an output state

  • that you feedback in.

  • It's just that the output state now is slightly bigger.

  • So that's how you go deep in the recurrent neural network.

  • You stack those cells, and that becomes

  • a new cell which still has input, output,

  • input state, output state.

  • And of course, you unroll it.

  • So let's take this sentence.

  • Let's say now, we use not characters but words

  • as our inputs.

  • Of course, there are technical problems doing that.

  • A typical alphabet is maybe 100 characters.

  • A typical vocabulary is around 30,000 words.

  • So here, one-hot encoding gives you

  • a vector of 30,000 components for each word.

  • It's a bit heavy.

  • I won't go into the details of how you handle that.

  • It's called embedding.

  • Whatever, let's just assume that we solved this problem.

  • So we have this sentence.

  • Michael was born in Paris, blah, blah, blah, blah, blah.

  • And at the end, we have his mother tongue is.

  • So if we train this model on English,

  • probably it will have figured out

  • that "his mother tongue is" is followed

  • by the name of a language, like English,

  • German, or Russian, something.

  • Here, however, the correct answer

  • is French because this guy was born in France.

  • So let's imagine that we feed in,

  • we have unrolled this neural network over,

  • let's say, 30 words.

  • Or let's say 10 words here, 10 words.

  • And at the end, we have his mother tongue is,

  • and we are asking the network to predict what is the next word.

  • And the network says, English.

  • So now, what we want to do is put on the outputs

  • a sentence that says, blah, blah, blah,

  • his mother tongue is French.

  • And do retro propagation.

  • But for this to work, the beginning

  • of the sentence, the part where the information