Subtitles section Play video Print subtitles [MUSIC PLAYING] MARTIN GORNER: So thank you for filling the house. I'm really impressed that TensorFlow is getting so much attention. But I think the technology deserves it. So I'm happy about that. So today, or yesterday, we built a neural network for recognizing handwritten digits. And we went through dense neural networks and convolutional neural networks. And today, I want to build with you another kind of network, a recurrent neural network. So let's go. A couple of reminders from yesterday. You remember, we started with this initial one layer neural network. And I need you to remember the formula for one layer of a neural network because we will be using it. So we were reading in pixels. But it works for any input vector, of course. And you remember, we said that the neurons do weighted sums of all of their inputs, they add in bias, and they feed that through some activation function. Here, softmax. It can be another function. It's just a function, value in, value out. But usually, in neural networks, it's a non-linear function. And we wrote this one layer neural network using a matrix multiply, blah, blah, blah, we've seen all that, as this formula. And you remember, we did this not for just one image, but we actually wrote this formula processing 100 images at a time. So in x, we have a batch of images, a whole batch, 100 images. And then x times w are all the weighted sums for our neurons. We add the biases. We feed that through our activation function. And we obtain a batch of predictions. So in our case, since we were classifying handwritten digits, those predictions are 10 numbers, which are the probabilities of these digits being a 0, a 1, 2, 3, and so on. And so we obtained those probabilities as the outputs of our 10 neurons. OK, so whenever you see this formula, we will see it again a lot today, you think, one layer of a neural network. OK? And then also, what I need you to remember is that once we get our output from the neural network, the way we train it is that we give it examples. It produces some prediction. And then we say, no, no, no, no, no, that's not what we wanted. This is what you should predict. We give it the correct answer. And to do that, we have to encode this correct answer in a similar format. So it's called-- it's a very basic type of encoding. It's called one hot encoding. And basically here, if we have 10 categories, specifying one answer category means encoding it as 10 0s, but just one 1 somewhere in the middle, and the index of the 1 here, it's at index 6. Means that the correct answer was a 6, OK? So in this shape, it becomes possible to compute a distance between what the network predicts and what we know to be true. And that distance, we call that our error function. Or sometimes, it's called the loss function. That's what we use to guide our training. So during training, we give it an example, produces an output. We say, no, no, no, that's not what we wanted. Compute the distance between what the network says and what we know to be true. And from that distance, we derive the gradient. And then, we follow the gradient. And that modifies the weights and biases. And that's what training is about, OK? So now, let's look at this neural network. So it should look familiar. It's a vector as an input. It has a middle layer using the hyperbolic tangent as an activation function. So we've seen the sigmoid last time, which is let's say the simplest possible function going from 0 to 1 continuously. The hyperbolic tangent is the simplest possible function going from minus 1 to 1 continuously. It's just a sigmoid shifted. And then, a second layer, which is a softmax layer so that we read something out. But the specificity is here that the output of this intermediate green layer is actually fed back in the next time step in the inputs. So the real input into one cell of a recurrent neural network is the input concatenated to the output of the inner layer from the previous step. And we call this the state. So it's actually a state machine. You feed it inputs. It produces outputs. But you also feed it a state. It produces an output state, which you feed back in, in the next time step. And that's why it's called a recurrent neural network, it's because it is applied on time sequences. At each step in time, you feed in one input vector, concatenate it to the previous state. Turn the crank once. That produces some outputs from this middle layer, as well as a result. And you feed that back as the new input state for the next x input, which you have in your sequence. So it can be represented. I'm showing you the neurons inside. But here, it's basically the API of one recurrent neural network cell. It has an input. It has an output, which you then usually feed into a softmax layer to make sense of it, to produce predictions. I mean, probabilities. And it has an input state that produces an output state that you loop back in as the input state. That's the state machine part. OK? So now, well yes, and the parameter for this is the internal size of this middle layer. That's what is adjustable. Usually, your input is whatever your input is. And your output is whatever you're trying to predict. So those are not adjustable parameters. So here it is written in equations. Again, the input is the real input at time t concatenated to the previous state. Then, we feed that through. Here, you should recognize one layer of a neural network. You should recognize this formula using the hyperbolic tangent as an activation function. So I put it over there. And this produces an output, Ht, which is both used as our new state and as the output that will be fed into the softmax layer to actually produce a vector of probabilities between 0 and 1. OK? So now, how do we train this thing? So typically, this is used for natural language processing, for instance. So a typical input will be a character. And a character will be, again, one-hot encoded into let's say 100 competent vectors if we are using-- we will be using here an alphabet of 100 possible characters. So one character is encoded into a 100 element vector, so 99 0s and a 1 at the ASCII index of that character. So we put a character in. We propagate through the neural networks. We propagate through the softmax layer. We obtain a character as an output. If that is not the character we wanted, well, we compute the difference between what he said and what we know to be true and use retro propagation to fix the weights and biases inside of the cell to get better results. That is very classical training. But what if the result was wrong not because the weights and biases inside of the cell was wrong, but because the input, the state input, H minus 1, was wrong? That input is a constant in this problem. There's not much you can do about it. So here, we are stuck. What is the solution? Well, the solution is to replicate the cell. And now, so this is a replica.