Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARTIN GORNER: So thank you for filling the house.

  • I'm really impressed that TensorFlow

  • is getting so much attention.

  • But I think the technology deserves it.

  • So I'm happy about that.

  • So today, or yesterday, we built a neural network

  • for recognizing handwritten digits.

  • And we went through dense neural networks

  • and convolutional neural networks.

  • And today, I want to build with you another kind of network,

  • a recurrent neural network.

  • So let's go.

  • A couple of reminders from yesterday.

  • You remember, we started with this initial one layer neural

  • network.

  • And I need you to remember the formula

  • for one layer of a neural network

  • because we will be using it.

  • So we were reading in pixels.

  • But it works for any input vector, of course.

  • And you remember, we said that the neurons

  • do weighted sums of all of their inputs, they add in bias,

  • and they feed that through some activation function.

  • Here, softmax.

  • It can be another function.

  • It's just a function, value in, value out.

  • But usually, in neural networks, it's a non-linear function.

  • And we wrote this one layer neural network

  • using a matrix multiply, blah, blah, blah,

  • we've seen all that, as this formula.

  • And you remember, we did this not for just one image,

  • but we actually wrote this formula

  • processing 100 images at a time.

  • So in x, we have a batch of images,

  • a whole batch, 100 images.

  • And then x times w are all the weighted sums for our neurons.

  • We add the biases.

  • We feed that through our activation function.

  • And we obtain a batch of predictions.

  • So in our case, since we were classifying handwritten digits,

  • those predictions are 10 numbers,

  • which are the probabilities of these digits being a 0, a 1, 2,

  • 3, and so on.

  • And so we obtained those probabilities

  • as the outputs of our 10 neurons.

  • OK, so whenever you see this formula,

  • we will see it again a lot today,

  • you think, one layer of a neural network.

  • OK?

  • And then also, what I need you to remember

  • is that once we get our output from the neural network,

  • the way we train it is that we give it examples.

  • It produces some prediction.

  • And then we say, no, no, no, no, no, that's not what we wanted.

  • This is what you should predict.

  • We give it the correct answer.

  • And to do that, we have to encode this correct answer

  • in a similar format.

  • So it's called-- it's a very basic type of encoding.

  • It's called one hot encoding.

  • And basically here, if we have 10 categories,

  • specifying one answer category means encoding it as 10 0s,

  • but just one 1 somewhere in the middle, and the index of the 1

  • here, it's at index 6.

  • Means that the correct answer was a 6, OK?

  • So in this shape, it becomes possible to compute

  • a distance between what the network predicts

  • and what we know to be true.

  • And that distance, we call that our error function.

  • Or sometimes, it's called the loss function.

  • That's what we use to guide our training.

  • So during training, we give it an example, produces an output.

  • We say, no, no, no, that's not what we wanted.

  • Compute the distance between what the network says

  • and what we know to be true.

  • And from that distance, we derive the gradient.

  • And then, we follow the gradient.

  • And that modifies the weights and biases.

  • And that's what training is about, OK?

  • So now, let's look at this neural network.

  • So it should look familiar.

  • It's a vector as an input.

  • It has a middle layer using the hyperbolic tangent

  • as an activation function.

  • So we've seen the sigmoid last time,

  • which is let's say the simplest possible function going

  • from 0 to 1 continuously.

  • The hyperbolic tangent is the simplest possible function

  • going from minus 1 to 1 continuously.

  • It's just a sigmoid shifted.

  • And then, a second layer, which is a softmax

  • layer so that we read something out.

  • But the specificity is here that the output

  • of this intermediate green layer is actually

  • fed back in the next time step in the inputs.

  • So the real input into one cell of a recurrent neural network

  • is the input concatenated to the output of the inner layer

  • from the previous step.

  • And we call this the state.

  • So it's actually a state machine.

  • You feed it inputs.

  • It produces outputs.

  • But you also feed it a state.

  • It produces an output state, which you feed back in,

  • in the next time step.

  • And that's why it's called a recurrent neural network,

  • it's because it is applied on time sequences.

  • At each step in time, you feed in one input vector,

  • concatenate it to the previous state.

  • Turn the crank once.

  • That produces some outputs from this middle layer, as well as

  • a result. And you feed that back as the new input

  • state for the next x input, which

  • you have in your sequence.

  • So it can be represented.

  • I'm showing you the neurons inside.

  • But here, it's basically the API of one recurrent neural network

  • cell.

  • It has an input.

  • It has an output, which you then usually feed

  • into a softmax layer to make sense of it,

  • to produce predictions.

  • I mean, probabilities.

  • And it has an input state that produces an output

  • state that you loop back in as the input state.

  • That's the state machine part.

  • OK?

  • So now, well yes, and the parameter for this

  • is the internal size of this middle layer.

  • That's what is adjustable.

  • Usually, your input is whatever your input is.

  • And your output is whatever you're trying to predict.

  • So those are not adjustable parameters.

  • So here it is written in equations.

  • Again, the input is the real input

  • at time t concatenated to the previous state.

  • Then, we feed that through.

  • Here, you should recognize one layer of a neural network.

  • You should recognize this formula

  • using the hyperbolic tangent as an activation function.

  • So I put it over there.

  • And this produces an output, Ht, which

  • is both used as our new state and as the output that

  • will be fed into the softmax layer

  • to actually produce a vector of probabilities between 0 and 1.

  • OK?

  • So now, how do we train this thing?

  • So typically, this is used for natural language processing,

  • for instance.

  • So a typical input will be a character.

  • And a character will be, again, one-hot encoded into let's say

  • 100 competent vectors if we are using--

  • we will be using here an alphabet

  • of 100 possible characters.

  • So one character is encoded into a 100 element vector, so 99 0s

  • and a 1 at the ASCII index of that character.

  • So we put a character in.

  • We propagate through the neural networks.

  • We propagate through the softmax layer.

  • We obtain a character as an output.

  • If that is not the character we wanted, well,

  • we compute the difference between what he said

  • and what we know to be true and use retro propagation to fix

  • the weights and biases inside of the cell to get better results.

  • That is very classical training.

  • But what if the result was wrong not

  • because the weights and biases inside of the cell was wrong,

  • but because the input, the state input, H minus 1, was wrong?

  • That input is a constant in this problem.

  • There's not much you can do about it.

  • So here, we are stuck.

  • What is the solution?

  • Well, the solution is to replicate the cell.

  • And now, so this is a replica.