Subtitles section Play video Print subtitles [MUSIC PLAYING] MARTIN GORNER: So thank you for filling the house. I'm really impressed that TensorFlow is getting so much attention. But I think the technology deserves it. So I'm happy about that. So today, or yesterday, we built a neural network for recognizing handwritten digits. And we went through dense neural networks and convolutional neural networks. And today, I want to build with you another kind of network, a recurrent neural network. So let's go. A couple of reminders from yesterday. You remember, we started with this initial one layer neural network. And I need you to remember the formula for one layer of a neural network because we will be using it. So we were reading in pixels. But it works for any input vector, of course. And you remember, we said that the neurons do weighted sums of all of their inputs, they add in bias, and they feed that through some activation function. Here, softmax. It can be another function. It's just a function, value in, value out. But usually, in neural networks, it's a non-linear function. And we wrote this one layer neural network using a matrix multiply, blah, blah, blah, we've seen all that, as this formula. And you remember, we did this not for just one image, but we actually wrote this formula processing 100 images at a time. So in x, we have a batch of images, a whole batch, 100 images. And then x times w are all the weighted sums for our neurons. We add the biases. We feed that through our activation function. And we obtain a batch of predictions. So in our case, since we were classifying handwritten digits, those predictions are 10 numbers, which are the probabilities of these digits being a 0, a 1, 2, 3, and so on. And so we obtained those probabilities as the outputs of our 10 neurons. OK, so whenever you see this formula, we will see it again a lot today, you think, one layer of a neural network. OK? And then also, what I need you to remember is that once we get our output from the neural network, the way we train it is that we give it examples. It produces some prediction. And then we say, no, no, no, no, no, that's not what we wanted. This is what you should predict. We give it the correct answer. And to do that, we have to encode this correct answer in a similar format. So it's called-- it's a very basic type of encoding. It's called one hot encoding. And basically here, if we have 10 categories, specifying one answer category means encoding it as 10 0s, but just one 1 somewhere in the middle, and the index of the 1 here, it's at index 6. Means that the correct answer was a 6, OK? So in this shape, it becomes possible to compute a distance between what the network predicts and what we know to be true. And that distance, we call that our error function. Or sometimes, it's called the loss function. That's what we use to guide our training. So during training, we give it an example, produces an output. We say, no, no, no, that's not what we wanted. Compute the distance between what the network says and what we know to be true. And from that distance, we derive the gradient. And then, we follow the gradient. And that modifies the weights and biases. And that's what training is about, OK? So now, let's look at this neural network. So it should look familiar. It's a vector as an input. It has a middle layer using the hyperbolic tangent as an activation function. So we've seen the sigmoid last time, which is let's say the simplest possible function going from 0 to 1 continuously. The hyperbolic tangent is the simplest possible function going from minus 1 to 1 continuously. It's just a sigmoid shifted. And then, a second layer, which is a softmax layer so that we read something out. But the specificity is here that the output of this intermediate green layer is actually fed back in the next time step in the inputs. So the real input into one cell of a recurrent neural network is the input concatenated to the output of the inner layer from the previous step. And we call this the state. So it's actually a state machine. You feed it inputs. It produces outputs. But you also feed it a state. It produces an output state, which you feed back in, in the next time step. And that's why it's called a recurrent neural network, it's because it is applied on time sequences. At each step in time, you feed in one input vector, concatenate it to the previous state. Turn the crank once. That produces some outputs from this middle layer, as well as a result. And you feed that back as the new input state for the next x input, which you have in your sequence. So it can be represented. I'm showing you the neurons inside. But here, it's basically the API of one recurrent neural network cell. It has an input. It has an output, which you then usually feed into a softmax layer to make sense of it, to produce predictions. I mean, probabilities. And it has an input state that produces an output state that you loop back in as the input state. That's the state machine part. OK? So now, well yes, and the parameter for this is the internal size of this middle layer. That's what is adjustable. Usually, your input is whatever your input is. And your output is whatever you're trying to predict. So those are not adjustable parameters. So here it is written in equations. Again, the input is the real input at time t concatenated to the previous state. Then, we feed that through. Here, you should recognize one layer of a neural network. You should recognize this formula using the hyperbolic tangent as an activation function. So I put it over there. And this produces an output, Ht, which is both used as our new state and as the output that will be fed into the softmax layer to actually produce a vector of probabilities between 0 and 1. OK? So now, how do we train this thing? So typically, this is used for natural language processing, for instance. So a typical input will be a character. And a character will be, again, one-hot encoded into let's say 100 competent vectors if we are using-- we will be using here an alphabet of 100 possible characters. So one character is encoded into a 100 element vector, so 99 0s and a 1 at the ASCII index of that character. So we put a character in. We propagate through the neural networks. We propagate through the softmax layer. We obtain a character as an output. If that is not the character we wanted, well, we compute the difference between what he said and what we know to be true and use retro propagation to fix the weights and biases inside of the cell to get better results. That is very classical training. But what if the result was wrong not because the weights and biases inside of the cell was wrong, but because the input, the state input, H minus 1, was wrong? That input is a constant in this problem. There's not much you can do about it. So here, we are stuck. What is the solution? Well, the solution is to replicate the cell. And now, so this is a replica. It's reusing the exact same weights, OK? Now, let's see the output, Y1, is bad. I say, no, that's not it. This was the correct output I'm training. So I know what the correct output is supposed to be. So from that, I compute the enter, the gradient. I retro propagate. I can fix the weights and biases in the cells to get a better output. And if needed, I can fix the weights and biases to get a better H0, the state flowing between those two stages of the cell. So now, I have a handle on at least H0. I still have no handle at all on H minus 1. If it is H minus 1 that was wrong, there is nothing I can do. So that is how you train recurrent neural networks. You have to unroll them across a certain length and give them a sequence of, let's say, characters. It will produce a sequence of output characters. If you are training, you know what the answer was supposed to be. So you use that to compute your error function, do you retro propagation, adjust the weights and biases. And it will work to a certain extent. To what extent? Oh, yes, small detail, if you want to go deep, you can actually stack the cells. Why? Well, two cells stacked like this, the API remains the same. It's still an input. It's still an output that feeds into a softmax layer. And there is still an input state and an output state that you feedback in. It's just that the output state now is slightly bigger. So that's how you go deep in the recurrent neural network. You stack those cells, and that becomes a new cell which still has input, output, input state, output state. And of course, you unroll it. So let's take this sentence. Let's say now, we use not characters but words as our inputs. Of course, there are technical problems doing that. A typical alphabet is maybe 100 characters. A typical vocabulary is around 30,000 words. So here, one-hot encoding gives you a vector of 30,000 components for each word. It's a bit heavy. I won't go into the details of how you handle that. It's called embedding. Whatever, let's just assume that we solved this problem. So we have this sentence. Michael was born in Paris, blah, blah, blah, blah, blah. And at the end, we have his mother tongue is. So if we train this model on English, probably it will have figured out that "his mother tongue is" is followed by the name of a language, like English, German, or Russian, something. Here, however, the correct answer is French because this guy was born in France. So let's imagine that we feed in, we have unrolled this neural network over, let's say, 30 words. Or let's say 10 words here, 10 words. And at the end, we have his mother tongue is, and we are asking the network to predict what is the next word. And the network says, English. So now, what we want to do is put on the outputs a sentence that says, blah, blah, blah, his mother tongue is French. And do retro propagation. But for this to work, the beginning of the sentence, the part where the information