Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARTIN GORNER: So thank you for filling the house.

  • I'm really impressed that TensorFlow

  • is getting so much attention.

  • But I think the technology deserves it.

  • So I'm happy about that.

  • So today, or yesterday, we built a neural network

  • for recognizing handwritten digits.

  • And we went through dense neural networks

  • and convolutional neural networks.

  • And today, I want to build with you another kind of network,

  • a recurrent neural network.

  • So let's go.

  • A couple of reminders from yesterday.

  • You remember, we started with this initial one layer neural

  • network.

  • And I need you to remember the formula

  • for one layer of a neural network

  • because we will be using it.

  • So we were reading in pixels.

  • But it works for any input vector, of course.

  • And you remember, we said that the neurons

  • do weighted sums of all of their inputs, they add in bias,

  • and they feed that through some activation function.

  • Here, softmax.

  • It can be another function.

  • It's just a function, value in, value out.

  • But usually, in neural networks, it's a non-linear function.

  • And we wrote this one layer neural network

  • using a matrix multiply, blah, blah, blah,

  • we've seen all that, as this formula.

  • And you remember, we did this not for just one image,

  • but we actually wrote this formula

  • processing 100 images at a time.

  • So in x, we have a batch of images,

  • a whole batch, 100 images.

  • And then x times w are all the weighted sums for our neurons.

  • We add the biases.

  • We feed that through our activation function.

  • And we obtain a batch of predictions.

  • So in our case, since we were classifying handwritten digits,

  • those predictions are 10 numbers,

  • which are the probabilities of these digits being a 0, a 1, 2,

  • 3, and so on.

  • And so we obtained those probabilities

  • as the outputs of our 10 neurons.

  • OK, so whenever you see this formula,

  • we will see it again a lot today,

  • you think, one layer of a neural network.

  • OK?

  • And then also, what I need you to remember

  • is that once we get our output from the neural network,

  • the way we train it is that we give it examples.

  • It produces some prediction.

  • And then we say, no, no, no, no, no, that's not what we wanted.

  • This is what you should predict.

  • We give it the correct answer.

  • And to do that, we have to encode this correct answer

  • in a similar format.

  • So it's called-- it's a very basic type of encoding.

  • It's called one hot encoding.

  • And basically here, if we have 10 categories,

  • specifying one answer category means encoding it as 10 0s,

  • but just one 1 somewhere in the middle, and the index of the 1

  • here, it's at index 6.

  • Means that the correct answer was a 6, OK?

  • So in this shape, it becomes possible to compute

  • a distance between what the network predicts

  • and what we know to be true.

  • And that distance, we call that our error function.

  • Or sometimes, it's called the loss function.

  • That's what we use to guide our training.

  • So during training, we give it an example, produces an output.

  • We say, no, no, no, that's not what we wanted.

  • Compute the distance between what the network says

  • and what we know to be true.

  • And from that distance, we derive the gradient.

  • And then, we follow the gradient.

  • And that modifies the weights and biases.

  • And that's what training is about, OK?

  • So now, let's look at this neural network.

  • So it should look familiar.

  • It's a vector as an input.

  • It has a middle layer using the hyperbolic tangent

  • as an activation function.

  • So we've seen the sigmoid last time,

  • which is let's say the simplest possible function going

  • from 0 to 1 continuously.

  • The hyperbolic tangent is the simplest possible function

  • going from minus 1 to 1 continuously.

  • It's just a sigmoid shifted.

  • And then, a second layer, which is a softmax

  • layer so that we read something out.

  • But the specificity is here that the output

  • of this intermediate green layer is actually

  • fed back in the next time step in the inputs.

  • So the real input into one cell of a recurrent neural network

  • is the input concatenated to the output of the inner layer

  • from the previous step.

  • And we call this the state.

  • So it's actually a state machine.

  • You feed it inputs.

  • It produces outputs.

  • But you also feed it a state.

  • It produces an output state, which you feed back in,

  • in the next time step.

  • And that's why it's called a recurrent neural network,

  • it's because it is applied on time sequences.

  • At each step in time, you feed in one input vector,

  • concatenate it to the previous state.

  • Turn the crank once.

  • That produces some outputs from this middle layer, as well as

  • a result. And you feed that back as the new input

  • state for the next x input, which

  • you have in your sequence.

  • So it can be represented.

  • I'm showing you the neurons inside.

  • But here, it's basically the API of one recurrent neural network

  • cell.

  • It has an input.

  • It has an output, which you then usually feed

  • into a softmax layer to make sense of it,

  • to produce predictions.

  • I mean, probabilities.

  • And it has an input state that produces an output

  • state that you loop back in as the input state.

  • That's the state machine part.

  • OK?

  • So now, well yes, and the parameter for this

  • is the internal size of this middle layer.

  • That's what is adjustable.

  • Usually, your input is whatever your input is.

  • And your output is whatever you're trying to predict.

  • So those are not adjustable parameters.

  • So here it is written in equations.

  • Again, the input is the real input

  • at time t concatenated to the previous state.

  • Then, we feed that through.

  • Here, you should recognize one layer of a neural network.

  • You should recognize this formula

  • using the hyperbolic tangent as an activation function.

  • So I put it over there.

  • And this produces an output, Ht, which

  • is both used as our new state and as the output that

  • will be fed into the softmax layer

  • to actually produce a vector of probabilities between 0 and 1.

  • OK?

  • So now, how do we train this thing?

  • So typically, this is used for natural language processing,

  • for instance.

  • So a typical input will be a character.

  • And a character will be, again, one-hot encoded into let's say

  • 100 competent vectors if we are using--

  • we will be using here an alphabet

  • of 100 possible characters.

  • So one character is encoded into a 100 element vector, so 99 0s

  • and a 1 at the ASCII index of that character.

  • So we put a character in.

  • We propagate through the neural networks.

  • We propagate through the softmax layer.

  • We obtain a character as an output.

  • If that is not the character we wanted, well,

  • we compute the difference between what he said

  • and what we know to be true and use retro propagation to fix

  • the weights and biases inside of the cell to get better results.

  • That is very classical training.

  • But what if the result was wrong not

  • because the weights and biases inside of the cell was wrong,

  • but because the input, the state input, H minus 1, was wrong?

  • That input is a constant in this problem.

  • There's not much you can do about it.

  • So here, we are stuck.

  • What is the solution?

  • Well, the solution is to replicate the cell.

  • And now, so this is a replica.

  • It's reusing the exact same weights, OK?

  • Now, let's see the output, Y1, is bad.

  • I say, no, that's not it.

  • This was the correct output I'm training.

  • So I know what the correct output is supposed to be.

  • So from that, I compute the enter, the gradient.

  • I retro propagate.

  • I can fix the weights and biases in the cells

  • to get a better output.

  • And if needed, I can fix the weights and biases

  • to get a better H0, the state flowing between those two

  • stages of the cell.

  • So now, I have a handle on at least H0.

  • I still have no handle at all on H minus 1.

  • If it is H minus 1 that was wrong,

  • there is nothing I can do.

  • So that is how you train recurrent neural networks.

  • You have to unroll them across a certain length

  • and give them a sequence of, let's say, characters.

  • It will produce a sequence of output characters.

  • If you are training, you know what

  • the answer was supposed to be.

  • So you use that to compute your error function,

  • do you retro propagation, adjust the weights and biases.

  • And it will work to a certain extent.

  • To what extent?

  • Oh, yes, small detail, if you want to go deep,

  • you can actually stack the cells.

  • Why?

  • Well, two cells stacked like this, the API remains the same.

  • It's still an input.

  • It's still an output that feeds into a softmax layer.

  • And there is still an input state and an output state

  • that you feedback in.

  • It's just that the output state now is slightly bigger.

  • So that's how you go deep in the recurrent neural network.

  • You stack those cells, and that becomes

  • a new cell which still has input, output,

  • input state, output state.

  • And of course, you unroll it.

  • So let's take this sentence.

  • Let's say now, we use not characters but words

  • as our inputs.

  • Of course, there are technical problems doing that.

  • A typical alphabet is maybe 100 characters.

  • A typical vocabulary is around 30,000 words.

  • So here, one-hot encoding gives you

  • a vector of 30,000 components for each word.

  • It's a bit heavy.

  • I won't go into the details of how you handle that.

  • It's called embedding.

  • Whatever, let's just assume that we solved this problem.

  • So we have this sentence.

  • Michael was born in Paris, blah, blah, blah, blah, blah.

  • And at the end, we have his mother tongue is.

  • So if we train this model on English,

  • probably it will have figured out

  • that "his mother tongue is" is followed

  • by the name of a language, like English,

  • German, or Russian, something.

  • Here, however, the correct answer

  • is French because this guy was born in France.

  • So let's imagine that we feed in,

  • we have unrolled this neural network over,

  • let's say, 30 words.

  • Or let's say 10 words here, 10 words.

  • And at the end, we have his mother tongue is,

  • and we are asking the network to predict what is the next word.

  • And the network says, English.

  • So now, what we want to do is put on the outputs

  • a sentence that says, blah, blah, blah,

  • his mother tongue is French.

  • And do retro propagation.

  • But for this to work, the beginning

  • of the sentence, the part where the information

  • about Paris and where he's born is,

  • has to be part of that example.

  • And that example is longer than 10 words,

  • which is our unroll size.

  • There is simply no way whatsoever

  • of putting that correct example plus correct output

  • into a network that we unrolled over only 10 words

  • because the distance is more than 10.

  • And that's a fundamental limitation.

  • If you want to capture this information, this behavior

  • that, if he was born in France, probably his mother language

  • is French, you will have to unroll

  • this network over a long enough sequence

  • to be able to input this full example into it.

  • And if you do that, you will probably unroll it here over,

  • how many, 50 words?

  • Something like that.

  • If you do that, the problem is that you end up

  • with a very deep neural network.

  • Yesterday, we've seen neural networks of five layers.

  • The big ones, like Inception and so on,

  • are 40, 50, 60, 70 layers.

  • You see here, we have a toy example.

  • And we already see that we should

  • be going to 50 or 100 layers just to solve this.

  • So in recurrent neural networks, you always

  • end up using very deep neural networks.

  • And when I say deep, it's because the state signal has

  • to go through all those cells.

  • And remember, in each cell, the state signal

  • is concatenated to the input which

  • goes through a neural network layer,

  • produces a new state, which goes to the next cell, that

  • is concatenated to the input.

  • Goes to another neural network layer.

  • So from here to the end, we traverse at least one

  • neural network layer per cell.

  • That's how wide and deep.

  • Deep neural networks have a technical problem.

  • They tend not to converge when you train them.

  • I won't go into the mathematical details.

  • It's called the vanishing gradient problem.

  • Basically, your gradient becomes 0.

  • And since you use your gradient to go forward,

  • that's a bit of a problem.

  • So a solution was invented.

  • I won't go into the mathematical explanations

  • of why this solution works.

  • I just want you to understand how it works.

  • So would you [INAUDIBLE] an explanation using the arrow

  • soup of a diagram on the left?

  • Or the incomprehensible equations on the right?

  • Which one do you prefer?

  • AUDIENCE: Arrows.

  • MARTIN GORNER: Arrows?

  • I'm a developer.

  • And those equations look a little bit like code.

  • And I do code.

  • Sorry.

  • But on the arrows, you see at least one thing.

  • So I'll do some hand-waving mathematics again.

  • You see that the state is actually split into two.

  • You have the H state and the C state.

  • And the C line there is actually configured in such a way

  • that the network can decide to persist information

  • on it, to leave it unchanged from iteration to iteration.

  • And that is somehow what explains

  • that even if you line up many of those,

  • since it has the possibility of leaving some part of the state

  • unchanged, it goes around those vanishing gradient problems.

  • End of hand-waving mathematics.

  • So let's do it.

  • Let's see how it works in practice.

  • And actually, it's based on a concept of gates.

  • So again, we concatenate the real input to the state

  • from the previous step.

  • And we compute three, you recognize the formulas,

  • neural network layers.

  • The sigma is for the sigmoid activation function.

  • So the sigma outputs values between 0 and 1.

  • And we call those gates because we will actually

  • be multiplying these numbers to another vector to gate it.

  • You know, if you multiply something

  • by a very small value, there's not much that goes through.

  • If you multiply something by something that is close to 1,

  • almost all of the information goes through.

  • So that's how we will be using them.

  • Now, our input becomes, well, we have to size adapt our input.

  • I put on the side the sizes of all the vectors

  • we are working with.

  • That's just to tell you that there is nothing to see there.

  • Inside of the cell, everything is of size n.

  • That's the parameter that you decide

  • as the size of your cell, OK?

  • But our inputs, they are what they are.

  • So we first need one neural network layer

  • to adapt the size of our inputs to size n.

  • So that becomes our new input.

  • And now, the C line, the way you read this,

  • this is a kind of memory.

  • So the new state of the memory is

  • the old state of the memory without what

  • we chose to forget.

  • We multiply by this forget gate.

  • This is a series of numbers between 0 and 1.

  • Plus what we chose to remember from our new input.

  • That's the way to read it.

  • So we multiply our new input by the update gate.

  • Again, numbers between 0 and 1 that

  • shows which part of the information

  • we want to retain from this input into our internal memory.

  • And then, our new state is simply the memory--

  • the hyperbolic tangent here, that's

  • not a neural network layer.

  • That's just a size adaptation to put it between minus 1 and 1.

  • So it's basically the memory cell

  • multiplied by the result gate.

  • So here, we choose what part of our internal memory

  • we want to expose to the outside as a result.

  • So that's the physical interpretation

  • of these equations.

  • We have those three gates.

  • We size adapt our input.

  • And then, the new memory is the old memory minus

  • what we want to forget plus what we want

  • to remember from the input.

  • And the result is this memory cell modulo

  • what we want to actually expose as an output at that step.

  • OK?

  • And now, this HT will actually become part of the new state,

  • and also drive the softmax layer if we

  • add a softmax layer, which is represented here

  • by this yellow circle.

  • We usually represent the softmax layer as external to a cell.

  • So this is called an LSTM.

  • And this was invented specifically

  • to make recurrent neural networks work

  • and to solve this depth problem that, if you

  • are unrolling over a large sequence,

  • they tended not to converge.

  • You will have to believe me on the mathematics with this.

  • They converge.

  • But you will have to--

  • I'm sure someone noticed that this choice of equations

  • and this choice of arrows was somehow arbitrary.

  • I mean, why point them here and not there?

  • Many combinations exist.

  • Lots of different variations of those R and N cells

  • have been devised.

  • And someone published a paper, a recap paper,

  • where he tested all of them and found them to do

  • all exactly the same thing.

  • So in the end, the one we use is called the GRU.

  • And I won't go into the details.

  • It's basically a cheaper LSTM.

  • Here are the equations.

  • Not very different.

  • Same API.

  • But only two gates instead of three gates.

  • And each gate has weights and biases.

  • So we save a part of our computational cycle

  • not computing those third weights and biases.

  • OK, so we will use the GRU.

  • And now, let's implement a neural network

  • that does a language model.

  • So we will be training on sequences of characters.

  • And when I say language model, it's actually a network

  • that we will use to predict--

  • we will train it to predict what the next character is.

  • Like here, St. Joh, I will teach it

  • to produce the same sequence shifted by one.

  • So actually, I will teach it to understand

  • that the next character should be an "n" because this

  • is St. John.

  • So how do we do that?

  • In TensorFlow, now I'm using a higher level API of TensorFlow

  • than what I had been using yesterday.

  • I just call GRUCell.

  • That creates a GRU cell.

  • And I call this higher level, because you've

  • seen this GRU cell has actually a couple of neural network

  • layers inside.

  • It has two gates.

  • That's at least two layers.

  • So it has a host of weights and biases

  • which are actually defined in the background when

  • I call this.

  • That's why it's a higher level API.

  • It does its own weights and bias declarations in the background.

  • Now, I said we want to go deep.

  • So let's stack this cell three high.

  • That's how we do deep recurrent neural networks.

  • Three is a TensorFlow call for that.

  • It's called MultiRNNCell.

  • Give it a cell.

  • You say how many times you want to stack it.

  • And that gives you another cell.

  • Because we have seen already that these three stacked

  • cells actually have the same API as one cell.

  • So you can use it as a new cell.

  • And now, we need to unroll this.

  • For that, we call in TensorFlow flow

  • this dynamic RNN function, which is a bit of magic.

  • And that's what will unroll this sequence.

  • So how many times?

  • You don't see it in the parameters

  • because it's actually specified in the shape of the input

  • tensor, x.

  • If this input tensor has eight or let's

  • say 30 characters in it, it will be

  • unrolled over a sequence of 30 characters.

  • And actually, the little part of magic,

  • really it's magic, we will not be using it here,

  • but this dynamic RNN, what it can do also,

  • remember that we will be training this on batches,

  • as always.

  • We always train on batches.

  • So in this case, all my batches will

  • be sequences of the same size.

  • That's the case in my model.

  • In other models, I might not have

  • sequences of the same size.

  • Dynamic RNN can handle that.

  • If you pass it a batch of sequences,

  • even if they are not of the same size,

  • alongside that you pass the actual sizes,

  • and it will, for each sentence in the batch,

  • unroll your network the correct number of times.

  • And then, also pass the output from the correct stage.

  • It's super helpful.

  • Will not be using it here because all of our sequences

  • have the same size.

  • But that is super helpful.

  • All right, so now, we need to implement our softmax layer

  • from those H double second 0 to H second 8.

  • Well, basically the outputs at the bottom.

  • We know how to do a softmax layer, OK?

  • But here, since we have unrolled,

  • remember each stack here is a copy of the previous one.

  • We are sharing the weights.

  • So on the softmax side, we have to share the weights, as well.

  • So we could do this using the TensorFlow APIs.

  • You know, define one softmax layer.

  • And then, for the next one, call an API

  • that retrieves the weights of the previous one

  • and reuses them.

  • That's too complicated here.

  • Actually, there is a little hack that you can use.

  • Remember, we are always training on batches, OK?

  • So this will be taking a batch of sequences,

  • outputting a batch of sequences.

  • Each sequence is a sequence of characters.

  • So what is the difference between having,

  • let's say, 8 softmax cells that each process a batch of 100

  • characters, or having just one that process 800 of them?

  • That's the same thing.

  • Let's just do one.

  • And we will put all of those outputs in the same bag

  • and just use that one cell.

  • Anyway, we were supposed to be sharing the weights,

  • so defining just one cell is a very good way of doing that.

  • So that's what I do with my reshape operation there.

  • I take all of those outputs.

  • and you have to remember that there is a batch of outputs

  • on each of those arrows.

  • And I put them in the same bag.

  • Feed them through just one softmax layer.

  • And then, I will reshape them back into the correct shape

  • to finish.

  • Again, using higher level APIs in TensorFlow,

  • so when I call linear, that just does the weighted sums.

  • One layer, it computes simply the weighted sums.

  • No activation function.

  • And then, I called softmax.

  • And that applies the softmax activation function.

  • And linear, again, defines the weights and biases

  • in the background.

  • That's why I call it a higher level function.

  • And now, I'm ready to compute my loss function

  • and derive it and actually train the network.

  • It's just as complicated to understand how

  • recurrent neural networks work.

  • And it's just as complicated to actually feed

  • them data correctly.

  • You see lots of arrows.

  • So we will have to do quite a bit of plumbing

  • to make this happen.

  • Let's try to get our inputs and outputs right, OK?

  • So we will be inputting sequences of characters

  • by batches.

  • So my inputs are a batch of sequences.

  • The sequence length that I have chosen is 30.

  • I will be unrolling over 30 characters.

  • Usually, on the diagrams, I only represent 8 of them

  • because 30 would not fit on my slide.

  • But in the code, it was 30.

  • Now, I need to one-hot encode them.

  • So I'm adding a new size.

  • Each character becomes a vector of 100 components

  • because I am working with an alphabet of 100

  • possible characters.

  • So now, it's batch size, sequence length,

  • and alpha size.

  • Those are my actual inputs.

  • My state, again, I have a batch of states.

  • Since I'm feeding in a batch of inputs,

  • I will produce a batch of output states.

  • And the states, each of those state vectors

  • is, of course, of size n, cell size, whatever

  • cell size I have chosen to use.

  • Remember, each cell has this one configuration parameter,

  • which is its internal size.

  • But since I have stacked those cells three high,

  • it will actually-- the actual output state here will

  • be three times the cell size.

  • OK, we are ready to write this model.

  • So I define a placeholder for my input sequences,

  • a batch of sequences of size, sequence, length.

  • I one-hot encode them, which is why

  • I'm adding a new size to this tensor, which

  • is the size of my alphabet.

  • Again, each character becomes a vector of 100 components.

  • To be really precise, my alpha size is 98, so 98 components.

  • I'm working with an alphabet of 98 characters here.

  • I need to define a placeholder for my correct answers.

  • And actually, the correct answers

  • are very easy to obtain here.

  • I'm just teaching it to output the same sequence shifted

  • by one.

  • So basically, to predict what the last character will be.

  • So again, the correct answers will

  • be a batch of sequences of 30 characters, which

  • I one-hot encode.

  • I need a placeholder also for my input state.

  • And we have seen that the batch of input states,

  • we have seen that the input state is made of three

  • of those internal vectors.

  • So that's three times cell size.

  • And now, I'm ready to write my model.

  • So the model is what was here, OK?

  • That's the model.

  • This model, with this little trick that we have seen before,

  • this model at the output of its softmax layer

  • actually produces an output that is batch size multiplied

  • by sequence length.

  • You remember, we put all the characters from the batches

  • and from the different stages of the unrolled sequence

  • in the same bag.

  • And now, to determine characters from those probabilities,

  • I use argmax.

  • Why?

  • Because each of those vectors is 100 components

  • with probabilities.

  • Argmax is a function that gives me the index of the biggest

  • number in this vector.

  • So the index in this victory is actually

  • the ASCII code of the character that has been predicted.

  • So these are my predictions now in ASCII

  • encoding in characters.

  • And I just need to reshape them back to have, again,

  • a batch of sequences of 30 predicted characters.

  • And now, I'm ready to input-- to give my loss to an optimizer

  • and ask TensorFlow to optimize to actually train my network.

  • So this is the step, as yesterday, with this loss.

  • TensorFlow computes a gradient.

  • From this gradient, it can--

  • sorry.

  • And this loss is, of course, the difference

  • between the sequence of characters that was predicted

  • and the sequence of characters that I wanted to predict.

  • This difference becomes a loss.

  • That loss is derived, becomes a gradient.

  • We take a small step along this gradient,

  • which is actually in the space of weights and biases.

  • So taking a small step means we modify slightly our weights

  • and biases and continue.

  • That's the training.

  • One last little gotcha.

  • So we have to take our input text

  • and actually cut it up in those sequences of 30 characters.

  • So initially, I thought, well, that's easy, you know?

  • You take a piece of text.

  • How do you cut it up in sequences of characters?

  • Well, you cut, and cut, and cut, and cut.

  • And then, if you need a batch of them,

  • you take the first 100 sequences you have.

  • And you put that in a batch.

  • That did not work.

  • Why?

  • Let's see here.

  • That's my first batch.

  • Let's see the first sequence in the batch?

  • The quick-- you know what that is going to be.

  • The quick brown fox something.

  • Well, when my neural network processes the quick,

  • it also outputs an output state.

  • And in the next iteration, that output state

  • will become the input state for the next sequence.

  • If I want this to be correct, that input state

  • must correspond to the continuation

  • of the quick brown fox, and so on,

  • which means that the sentence has

  • to continue over all of the first slots of all

  • of my batches.

  • It's a not completely trivial way of batching here.

  • You cut up your text in batches, in sequences.

  • But the way to batch them together,

  • since you have to pass the correct state at each stage,

  • is that the beginning of the text

  • has to be split across the first item in batches.

  • And then, from some point far, far, far later in the text,

  • you can start filling the second line of the batches.

  • It's just plumbing.

  • I wrote for you the five lines of code that does this.

  • It's five lines.

  • I spent four hours doing it, including tests.

  • I don't do arithmetic.

  • It's full of modulos and divides.

  • And I wrote unit tests and hacked it

  • until the unit tests passed.

  • It's called test-driven debugging.

  • Sorry, test-driven development.

  • That's what developers do.

  • All right, so yeah, small gotcha on the batching.

  • But whatever.

  • Just use the code on--

  • this is not actually important.

  • Just use the function that will cut up the text correctly

  • for you.

  • And you're ready to train.

  • And this is actually the full code of this neural network

  • on one slide.

  • So let's go through this again.

  • A placeholder for my input sequences.

  • I one-hot encode them.

  • I'm actually inputting sequences of characters, OK?

  • And all the people with cameras, this is on GitHub.

  • And the GitHub link is on the last slide.

  • So please take pictures.

  • My Twitter handle is over there.

  • Tweet them.

  • But then, you will be able to go and GitHub and actually

  • retrieve this.

  • Then, my expected outputs, why underscore?

  • Again, I define a placeholder for them.

  • I will need to feed them during training.

  • And the first thing I do is that I one-hot encode them.

  • I will also need, and this is different

  • from normal neural networks, I will also

  • need a placeholder for my input state.

  • Remember?

  • RNNs have an input and an input state.

  • Two inputs.

  • Now, I'm ready to write my model.

  • So I chose the GRU cell.

  • I stack it three high.

  • And I unroll it as many times as x has components in it.

  • So here, my unroll size is sequence length.

  • And that's 30.

  • I chose 30 characters as the unroll size

  • of my recurring neural network.

  • I do my little trick with the softmax

  • so that I can implement just one softmax node.

  • I feed the output through my softmax node.

  • Here, I apply argmax to retrieve from the softmax probabilities

  • the highest probability.

  • And that's the character I'm predicting.

  • I reshape this back to have a batch of predicted sequences.

  • Also, somewhere in the middle in there,

  • I had those probabilities.

  • I take those probabilities.

  • And I compute the distance between what it says

  • and what I wanted.

  • That's my loss.

  • I give my loss to the optimizer.

  • I obtain a training step.

  • And this training step is actually

  • that gradient, which is computed on this batch of training

  • characters.

  • And which, if I follow it by a little step,

  • will modify my weights and biases

  • and bring me to somewhere where this network works better,

  • where it has a smaller error function.

  • And now, my training loop.

  • You will see, this is very familiar to what

  • we had previously.

  • We use this magic plumbing function

  • that I gave you to load sequences of characters

  • in the correct way.

  • And once I have a sequence of characters,

  • I run session.run of my training step.

  • I have to give it the input characters.

  • I have to give it the expected output.

  • And, since this is a recurrent neural network,

  • I have to give it the input state.

  • And this will give me an output state.

  • And you see the magic line, why this

  • is a recurrent neural network.

  • That's the last line there in the red.

  • Input state becomes-- sorry, the output state

  • becomes an input state.

  • That's why it's recurrent.

  • They're passing the state around.

  • All right.

  • So we are done.

  • We've built a recurrent neural network.

  • Now, we want to actually train it.

  • So let's go to a demo.

  • I will be training this on the complete works of William

  • Shakespeare.

  • That's not quite big data.

  • The complete works of William Shakespeare are five megabytes.

  • Yes, that puts things in perspective.

  • But it's good stuff.

  • So here, we see it's training on sequences.

  • So here are those sequences of 30 characters.

  • And here's a batch of them.

  • It's actually training on those sequences.

  • Here, predicting not much at all.

  • It's just the beginning.

  • And from time to time, I stop the training.

  • And I take just my one cell.

  • Remember, I have just one cell.

  • It's replicated for the purpose of training,

  • but it's just one cell.

  • And this one cell, it has become--

  • well, once it will be trained, it will have become--

  • a language model.

  • So what I can do with it is generate a new Shakespeare

  • play.

  • How do I do that?

  • Well, I take the cell.

  • I put in garbage, a random character.

  • That gives me an output character.

  • Probability of an output character,

  • which is the next character, and an output state.

  • You feed back the output state in the input,

  • and I feed back the output character as the new input.

  • And I continue.

  • And this is a state machine that will start generating text.

  • You see here, it's--

  • yeah.

  • That's not quite Shakespeare yet.

  • It's training.

  • It's a bit slow on my machine.

  • I usually have a GPU connected here

  • to-- it brings me a nice 10x speed, or 6x roughly.

  • But still, well, it has done 50 more batches.

  • I will leave it running.

  • Let's go and see, sorry, here.

  • On this slide, what it is.

  • So at the beginning, it gives you this.

  • As I said, not quite Shakespeare.

  • But after only an epoch, what we call an epoch,

  • it's when you have seen the entire training data set once.

  • So after having seen only a tenth

  • of what Shakespeare produced in his life, this is what we have.

  • Still not quite Shakespeare, but you see

  • there is some structure to it.

  • After two tenths, hey, this looks better!

  • It's starting to actually spell English almost correctly.

  • And there are those things in capital letters

  • at the beginning that are starting

  • to look like characters, like character names.

  • Even slightly later, oh, look!

  • And you have to remember that this

  • is a neural network that is predicting character

  • by character.

  • It first has to learn to spell English before going

  • to higher orders of structure.

  • So it's still not completely exact English.

  • But it's starting to look like English.

  • At least, Shakespearean English.

  • And you see it has character names.

  • And it's actually inventing new character names.

  • Here, Pordia and Henry Blutius--

  • who can tell me, no, seriously, who

  • can tell me if Shakespeare actually

  • used Henry Blutius in his work?

  • What do you call it?

  • I'm giving you the answer.

  • He didn't.

  • But it's a very credible Shakespearean character name.

  • And this is what you get after 30 epochs.

  • So it actually has a title.

  • There is an act.

  • There's a scene.

  • After the scene, it tells you where this is happening.

  • And look, it knows how to put scenic indications

  • in brackets, who enters, with whom, and so on.

  • It has even picked up stuff like, character names are

  • all caps, and when the character is a function, like Lord

  • or Chamberlain, it's only the first character

  • that is a capital.

  • It has picked up completely correctly as well.

  • And it's actually English.

  • So now that we have this, let's try to--

  • so this was on slides.

  • I will stop this.

  • What I have done previously is that I trained

  • this for actually 30 epochs.

  • And I saved my weights and biases.

  • So I'm ready to just replay it and generate a new Shakespeare

  • play.

  • Let's generate a new one live in front of you.

  • Here it is.

  • Let me stop it.

  • Whoops, sorry about that.

  • [APPLAUSE]

  • MARTIN GORNER: Is someone brave enough

  • to come and play hallucinated Shakespeare on the stage

  • with me?

  • Come on.

  • Yes!

  • Thank you.

  • Big applause.

  • SPEAKER: Come on up?

  • MARTIN GORNER: Thank you.

  • Please, come up.

  • You will have to speak loudly.

  • But that's how it is in a theater.

  • You don't have a microphone.

  • You speak.

  • So you can read off the screen here.

  • We will alternate.

  • So maybe I start, and then you do the next one.

  • So let's say enter Bardolph and Boult.

  • The manner off with my bestowers that you shall not see him,

  • and we are now to be the brother's wife and force,

  • to be so many and most grave.

  • SPEAKER: What art thou again?

  • What needs thy life?

  • Then, what they do not dote on thee.

  • The word will be at thee.

  • And take my heart to thee.

  • And they distemper.

  • Will thou beat me well to say god save my son?

  • [APPLAUSE]

  • MARTIN GORNER: Thank you so much.

  • SPEAKER: Thank you.

  • MARTIN GORNER: Thank you.

  • That was fantastic.

  • Thank you.

  • Actually, I tried to do this also on the Python code

  • of TensorFlow itself.

  • That was fun.

  • So in the beginning, you had this.

  • Looks like Python?

  • Maybe.

  • But very, very quickly, it actually

  • picks up Pythonic structures, like those keywords,

  • and it's generating something that looks like function calls.

  • Slightly later, it actually correctly uses the keywords

  • with function names-- a hallucinated function name.

  • It's actually quite inventive in the function names.

  • And a colon at the end.

  • It's still getting the nested parenthesis wrong.

  • And after a longer while, it can recite the Apache license

  • in full.

  • Yes.

  • It's open source compliant, open source compliant.

  • And more interestingly for us, designers

  • of recurrent neural networks, it can actually close and open

  • the nested parentheses right to a depth of three,

  • which is quite impressive.

  • And what I find fantastic, it's that it has figured out

  • how to do Python comments.

  • And it's giving me TensorFlow advice in those comments.

  • But look, it makes sense!

  • Check that we have both scalar tensors

  • for being invalid to a vector of one

  • indicating the total loss of the same shape

  • as the shape of the tensor.

  • I'm sure this makes just as much sense as everything

  • that I've been saying since the beginning here.

  • All right, and small credits to a gentleman

  • called Andrej Karpathy who actually

  • wrote this neural network for the first time.

  • He published a blog about it.

  • He tried it on many different things.

  • He generated a business book for startups.

  • And he tried to generate an algebra book in LaTeX.

  • Actually, after training, this produced almost valid LaTeX.

  • So he had to hack it a little bit to make it compile.

  • But then, this looks like an algebra book.

  • That is even an attempt at a diagram.

  • And the line I prefer is how the neural network

  • solved how to write a proof.

  • Look at the very top.

  • "Proof omitted."

  • That's so clever.

  • All right.

  • So that's basically all I wanted to show you.

  • Well, this is how we generate it.

  • So we, I take just one cell.

  • And basically, in a loop, I feed in a character.

  • I take the output, feed it back as the input,

  • and feed the output state as the input state,

  • and just do this in a loop.

  • A couple of applications of this.

  • Oh, yes.

  • Actually, we still have a little bit of time.

  • This time, I've been using TensorBoard to visualize

  • my inputs and outputs.

  • Where is my TensorBoard?

  • Somewhere.

  • Sorry, I'll find it.

  • Here.

  • In the last session, I was just throwing the outputs

  • into matplotlib, which is the very standard Python plotting

  • library.

  • But there is a tool dedicated to visualizing training

  • in TensorFlow.

  • It's called TensorBoard.

  • And I advise you to use it, especially

  • if you do a distributed training, or training

  • on remote servers.

  • It can connect to a bucket and get the information from there

  • and visualize it.

  • So here again, I have configured,

  • when I was training this network,

  • I configured it to actually do training and validation.

  • I put one Shakespeare play aside for validation

  • to test my network.

  • And if you'll remember the session from yesterday,

  • I find it very important to follow my loss curves, both

  • the training and the test loss curve, on the screen.

  • This is what I got.

  • And actually, first of all, who sees something wrong?

  • Overfit, yeah.

  • And so now, the question is, why is it overfitting here?

  • I will give you the answer because you can't guess it.

  • But here, I was actually training on a small subset

  • of the Shakespeare corpus.

  • So here, it was overfitting because of lack of data.

  • And since I had this on the curves,

  • I wanted to show it to you because you certainly remember

  • that somewhere, where is it?

  • This one.

  • Somewhere here, I had this helpful engineering chart.

  • Which allows you to interpret what overfitting is.

  • And we went yesterday through the bad network.

  • We went through too many neurons.

  • We never had not enough data.

  • So I tried with not enough data.

  • And yes, it also gives you this very recognizable pattern

  • in the curves.

  • And as soon as I train with more data, this is what I have.

  • So here, the two curves follow each other closely.

  • And I know that I have solved the problem.

  • So actually, I was doing this because I

  • was trying to add dropout into my network

  • to make it work better.

  • No.

  • It was misbehaving just because of lack of data.

  • Dropout would not have solved that.

  • All right, and so a couple of applications,

  • practical applications to finish.

  • We've seen how to produce a character by character model.

  • We can also use this not character by character,

  • but word by word.

  • So as I said previously, with a word,

  • it's a bit more complicated because to one-hot encode

  • a word, you need to encode it on a vector of, this time,

  • 30,000 components.

  • Because that's the typical size of a vocabulary,

  • typical language.

  • So those are big.

  • So on the inputs, there is actually

  • a very simple solution.

  • How do you reduce the size of a big vector?

  • Well, you use one layer of a neural network

  • and produce less outputs.

  • It's called embeddings.

  • And that layer can either be part of your training--

  • then, your embeddings are learned

  • as part of the training.

  • Or, you can use some neural network

  • that has been already trained, typically trained

  • on the English language generically,

  • and that just encodes words into smaller vectors.

  • There is a very famous, what is it,

  • neural network that has been built for that.

  • It's called Word2Vec.

  • Already trained, available on GitHub.

  • You can use that to encode your English words

  • if your problem deals with English words

  • as smaller vectors.

  • And so once we have solved this problem of how

  • to input words instead of characters,

  • you can, for example, use a recurrent neural network

  • like this to predict not what the next word is,

  • but a categorization of a sequence.

  • And this is used in newspapers to automatically categorize

  • articles as geopolitics, science, sports, and so on.

  • Works very well.

  • How do you do translation?

  • Well, to do translation, that's how Google Translate works,

  • you tack two of those recurrent networks end to end.

  • To the first one, you apply an English sentence

  • plus a stop symbol.

  • And then, you continue.

  • And you ask it to output the French sentence.

  • And what you have on the input, there is a choice.

  • Normally, you should be inputting

  • what your network outputs.

  • But people have also tried to input

  • what the network should output.

  • So both options exist.

  • And they give you different results.

  • You can read about this in literature.

  • So this is how translation works.

  • Of course, you have a big problem at the end.

  • I won't go into that.

  • Because to do the softmax layers there,

  • you actually want to produce a vector of 30,000 probabilities.

  • That's a bit heavy.

  • So there are ways of mitigating that.

  • But that's an active area of research.

  • One that is implemented in TensorFlow

  • is called sample softmax.

  • But there are many others because this

  • is an active area of research.

  • How to do this softmax layer to produce 30,000 probabilities

  • each time, which is a bit heavy.

  • And one more is image labeling.

  • So here, it's a very simplified version of image labeling.

  • Image labeling is you take an image,

  • and you want to produce a sentence.

  • Like, this a little girl holding a teddy bear.

  • This is a truck in the desert.

  • So this is actually also a translation problem.

  • You take vectors from an image.

  • And you apply a recurrent neural network

  • to produce a sequence of words which

  • you want to be the description of this image.

  • How do you encode an image as a vector?

  • Well, there are plenty of solutions.

  • One of them is to take an off the shelf

  • image recognition neural network, like Inception,

  • and just chop off the last couple of layers.

  • Normally, what Inception gives is categories.

  • This is a truck.

  • This is a beach.

  • This is a lizard.

  • That's not what you want.

  • But all the top layers are actually

  • encoding an image in some meaningful way into a vector.

  • You can use that as a fixed encoding function.

  • And input the vector corresponding to the image

  • here.

  • Produce this output sequence.

  • And sometimes, it works really well.

  • This is what was generated.

  • A herd of elephants walking across a dry grass field,

  • and so on.

  • And then sometimes, yeah, not quite.

  • Thank you.

  • [APPLAUSE]

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it