B1 Intermediate Other 1085 Folder Collection
After playing the video, you can click or select the word to look it up in the dictionary.
Loading...
Report Subtitle Errors
[MUSIC PLAYING]
MARTIN GORNER: So thank you for filling the house.
I'm really impressed that TensorFlow
is getting so much attention.
But I think the technology deserves it.
So I'm happy about that.
So today, or yesterday, we built a neural network
for recognizing handwritten digits.
And we went through dense neural networks
and convolutional neural networks.
And today, I want to build with you another kind of network,
a recurrent neural network.
So let's go.
A couple of reminders from yesterday.
You remember, we started with this initial one layer neural
network.
And I need you to remember the formula
for one layer of a neural network
because we will be using it.
So we were reading in pixels.
But it works for any input vector, of course.
And you remember, we said that the neurons
do weighted sums of all of their inputs, they add in bias,
and they feed that through some activation function.
Here, softmax.
It can be another function.
It's just a function, value in, value out.
But usually, in neural networks, it's a non-linear function.
And we wrote this one layer neural network
using a matrix multiply, blah, blah, blah,
we've seen all that, as this formula.
And you remember, we did this not for just one image,
but we actually wrote this formula
processing 100 images at a time.
So in x, we have a batch of images,
a whole batch, 100 images.
And then x times w are all the weighted sums for our neurons.
We add the biases.
We feed that through our activation function.
And we obtain a batch of predictions.
So in our case, since we were classifying handwritten digits,
those predictions are 10 numbers,
which are the probabilities of these digits being a 0, a 1, 2,
3, and so on.
And so we obtained those probabilities
as the outputs of our 10 neurons.
OK, so whenever you see this formula,
we will see it again a lot today,
you think, one layer of a neural network.
OK?
And then also, what I need you to remember
is that once we get our output from the neural network,
the way we train it is that we give it examples.
It produces some prediction.
And then we say, no, no, no, no, no, that's not what we wanted.
This is what you should predict.
We give it the correct answer.
And to do that, we have to encode this correct answer
in a similar format.
So it's called-- it's a very basic type of encoding.
It's called one hot encoding.
And basically here, if we have 10 categories,
specifying one answer category means encoding it as 10 0s,
but just one 1 somewhere in the middle, and the index of the 1
here, it's at index 6.
Means that the correct answer was a 6, OK?
So in this shape, it becomes possible to compute
a distance between what the network predicts
and what we know to be true.
And that distance, we call that our error function.
Or sometimes, it's called the loss function.
That's what we use to guide our training.
So during training, we give it an example, produces an output.
We say, no, no, no, that's not what we wanted.
Compute the distance between what the network says
and what we know to be true.
And from that distance, we derive the gradient.
And then, we follow the gradient.
And that modifies the weights and biases.
And that's what training is about, OK?
So now, let's look at this neural network.
So it should look familiar.
It's a vector as an input.
It has a middle layer using the hyperbolic tangent
as an activation function.
So we've seen the sigmoid last time,
which is let's say the simplest possible function going
from 0 to 1 continuously.
The hyperbolic tangent is the simplest possible function
going from minus 1 to 1 continuously.
It's just a sigmoid shifted.
And then, a second layer, which is a softmax
layer so that we read something out.
But the specificity is here that the output
of this intermediate green layer is actually
fed back in the next time step in the inputs.
So the real input into one cell of a recurrent neural network
is the input concatenated to the output of the inner layer
from the previous step.
And we call this the state.
So it's actually a state machine.
You feed it inputs.
It produces outputs.
But you also feed it a state.
It produces an output state, which you feed back in,
in the next time step.
And that's why it's called a recurrent neural network,
it's because it is applied on time sequences.
At each step in time, you feed in one input vector,
concatenate it to the previous state.
Turn the crank once.
That produces some outputs from this middle layer, as well as
a result. And you feed that back as the new input
state for the next x input, which
you have in your sequence.
So it can be represented.
I'm showing you the neurons inside.
But here, it's basically the API of one recurrent neural network
cell.
It has an input.
It has an output, which you then usually feed
into a softmax layer to make sense of it,
to produce predictions.
I mean, probabilities.
And it has an input state that produces an output
state that you loop back in as the input state.
That's the state machine part.
OK?
So now, well yes, and the parameter for this
is the internal size of this middle layer.
That's what is adjustable.
Usually, your input is whatever your input is.
And your output is whatever you're trying to predict.
So those are not adjustable parameters.
So here it is written in equations.
Again, the input is the real input
at time t concatenated to the previous state.
Then, we feed that through.
Here, you should recognize one layer of a neural network.
You should recognize this formula
using the hyperbolic tangent as an activation function.
So I put it over there.
And this produces an output, Ht, which
is both used as our new state and as the output that
will be fed into the softmax layer
to actually produce a vector of probabilities between 0 and 1.
OK?
So now, how do we train this thing?
So typically, this is used for natural language processing,
for instance.
So a typical input will be a character.
And a character will be, again, one-hot encoded into let's say
100 competent vectors if we are using--
we will be using here an alphabet
of 100 possible characters.
So one character is encoded into a 100 element vector, so 99 0s
and a 1 at the ASCII index of that character.
So we put a character in.
We propagate through the neural networks.
We propagate through the softmax layer.
We obtain a character as an output.
If that is not the character we wanted, well,
we compute the difference between what he said
and what we know to be true and use retro propagation to fix
the weights and biases inside of the cell to get better results.
That is very classical training.
But what if the result was wrong not
because the weights and biases inside of the cell was wrong,
but because the input, the state input, H minus 1, was wrong?
That input is a constant in this problem.
There's not much you can do about it.
So here, we are stuck.
What is the solution?
Well, the solution is to replicate the cell.
And now, so this is a replica.
It's reusing the exact same weights, OK?
Now, let's see the output, Y1, is bad.
I say, no, that's not it.
This was the correct output I'm training.
So I know what the correct output is supposed to be.
So from that, I compute the enter, the gradient.
I retro propagate.
I can fix the weights and biases in the cells
to get a better output.
And if needed, I can fix the weights and biases
to get a better H0, the state flowing between those two
stages of the cell.
So now, I have a handle on at least H0.
I still have no handle at all on H minus 1.
If it is H minus 1 that was wrong,
there is nothing I can do.
So that is how you train recurrent neural networks.
You have to unroll them across a certain length
and give them a sequence of, let's say, characters.
It will produce a sequence of output characters.
If you are training, you know what
the answer was supposed to be.
So you use that to compute your error function,
do you retro propagation, adjust the weights and biases.
And it will work to a certain extent.
To what extent?
Oh, yes, small detail, if you want to go deep,
you can actually stack the cells.
Why?
Well, two cells stacked like this, the API remains the same.
It's still an input.
It's still an output that feeds into a softmax layer.
And there is still an input state and an output state
that you feedback in.
It's just that the output state now is slightly bigger.
So that's how you go deep in the recurrent neural network.
You stack those cells, and that becomes
a new cell which still has input, output,
input state, output state.
And of course, you unroll it.
So let's take this sentence.
Let's say now, we use not characters but words
as our inputs.
Of course, there are technical problems doing that.
A typical alphabet is maybe 100 characters.
A typical vocabulary is around 30,000 words.
So here, one-hot encoding gives you
a vector of 30,000 components for each word.
It's a bit heavy.
I won't go into the details of how you handle that.
It's called embedding.
Whatever, let's just assume that we solved this problem.
So we have this sentence.
Michael was born in Paris, blah, blah, blah, blah, blah.
And at the end, we have his mother tongue is.
So if we train this model on English,
probably it will have figured out
that "his mother tongue is" is followed
by the name of a language, like English,
German, or Russian, something.
Here, however, the correct answer
is French because this guy was born in France.
So let's imagine that we feed in,
we have unrolled this neural network over,
let's say, 30 words.
Or let's say 10 words here, 10 words.
And at the end, we have his mother tongue is,
and we are asking the network to predict what is the next word.
And the network says, English.
So now, what we want to do is put on the outputs
a sentence that says, blah, blah, blah,
his mother tongue is French.
And do retro propagation.
But for this to work, the beginning
of the sentence, the part where the information
about Paris and where he's born is,
has to be part of that example.
And that example is longer than 10 words,
which is our unroll size.
There is simply no way whatsoever
of putting that correct example plus correct output
into a network that we unrolled over only 10 words
because the distance is more than 10.
And that's a fundamental limitation.
If you want to capture this information, this behavior
that, if he was born in France, probably his mother language
is French, you will have to unroll
this network over a long enough sequence
to be able to input this full example into it.
And if you do that, you will probably unroll it here over,
how many, 50 words?
Something like that.
If you do that, the problem is that you end up
with a very deep neural network.
Yesterday, we've seen neural networks of five layers.
The big ones, like Inception and so on,
are 40, 50, 60, 70 layers.
You see here, we have a toy example.
And we already see that we should
be going to 50 or 100 layers just to solve this.
So in recurrent neural networks, you always
end up using very deep neural networks.
And when I say deep, it's because the state signal has
to go through all those cells.
And remember, in each cell, the state signal
is concatenated to the input which
goes through a neural network layer,
produces a new state, which goes to the next cell, that
is concatenated to the input.
Goes to another neural network layer.
So from here to the end, we traverse at least one
neural network layer per cell.
That's how wide and deep.
Deep neural networks have a technical problem.
They tend not to converge when you train them.
I won't go into the mathematical details.
It's called the vanishing gradient problem.
Basically, your gradient becomes 0.
And since you use your gradient to go forward,
that's a bit of a problem.
So a solution was invented.
I won't go into the mathematical explanations
of why this solution works.
I just want you to understand how it works.
So would you [INAUDIBLE] an explanation using the arrow
soup of a diagram on the left?
Or the incomprehensible equations on the right?
Which one do you prefer?
AUDIENCE: Arrows.
MARTIN GORNER: Arrows?
I'm a developer.
And those equations look a little bit like code.
And I do code.
Sorry.
But on the arrows, you see at least one thing.
So I'll do some hand-waving mathematics again.
You see that the state is actually split into two.
You have the H state and the C state.
And the C line there is actually configured in such a way
that the network can decide to persist information
on it, to leave it unchanged from iteration to iteration.
And that is somehow what explains
that even if you line up many of those,
since it has the possibility of leaving some part of the state
unchanged, it goes around those vanishing gradient problems.
End of hand-waving mathematics.
So let's do it.
Let's see how it works in practice.
And actually, it's based on a concept of gates.
So again, we concatenate the real input to the state
from the previous step.
And we compute three, you recognize the formulas,
neural network layers.
The sigma is for the sigmoid activation function.
So the sigma outputs values between 0 and 1.
And we call those gates because we will actually
be multiplying these numbers to another vector to gate it.
You know, if you multiply something
by a very small value, there's not much that goes through.
If you multiply something by something that is close to 1,
almost all of the information goes through.
So that's how we will be using them.
Now, our input becomes, well, we have to size adapt our input.
I put on the side the sizes of all the vectors
we are working with.
That's just to tell you that there is nothing to see there.
Inside of the cell, everything is of size n.
That's the parameter that you decide
as the size of your cell, OK?
But our inputs, they are what they are.
So we first need one neural network layer
to adapt the size of our inputs to size n.
So that becomes our new input.
And now, the C line, the way you read this,
this is a kind of memory.
So the new state of the memory is
the old state of the memory without what
we chose to forget.
We multiply by this forget gate.
This is a series of numbers between 0 and 1.
Plus what we chose to remember from our new input.
That's the way to read it.
So we multiply our new input by the update gate.
Again, numbers between 0 and 1 that
shows which part of the information
we want to retain from this input into our internal memory.
And then, our new state is simply the memory--
the hyperbolic tangent here, that's
not a neural network layer.
That's just a size adaptation to put it between minus 1 and 1.
So it's basically the memory cell
multiplied by the result gate.
So here, we choose what part of our internal memory
we want to expose to the outside as a result.
So that's the physical interpretation
of these equations.
We have those three gates.
We size adapt our input.
And then, the new memory is the old memory minus
what we want to forget plus what we want
to remember from the input.
And the result is this memory cell modulo
what we want to actually expose as an output at that step.
OK?
And now, this HT will actually become part of the new state,
and also drive the softmax layer if we
add a softmax layer, which is represented here
by this yellow circle.
We usually represent the softmax layer as external to a cell.
So this is called an LSTM.
And this was invented specifically
to make recurrent neural networks work
and to solve this depth problem that, if you
are unrolling over a large sequence,
they tended not to converge.
You will have to believe me on the mathematics with this.
They converge.
But you will have to--
I'm sure someone noticed that this choice of equations
and this choice of arrows was somehow arbitrary.
I mean, why point them here and not there?
Many combinations exist.
Lots of different variations of those R and N cells
have been devised.
And someone published a paper, a recap paper,
where he tested all of them and found them to do
all exactly the same thing.
So in the end, the one we use is called the GRU.
And I won't go into the details.
It's basically a cheaper LSTM.
Here are the equations.
Not very different.
Same API.
But only two gates instead of three gates.
And each gate has weights and biases.
So we save a part of our computational cycle
not computing those third weights and biases.
OK, so we will use the GRU.
And now, let's implement a neural network
that does a language model.
So we will be training on sequences of characters.
And when I say language model, it's actually a network
that we will use to predict--
we will train it to predict what the next character is.
Like here, St. Joh, I will teach it
to produce the same sequence shifted by one.
So actually, I will teach it to understand
that the next character should be an "n" because this
is St. John.
So how do we do that?
In TensorFlow, now I'm using a higher level API of TensorFlow
than what I had been using yesterday.
I just call GRUCell.
That creates a GRU cell.
And I call this higher level, because you've
seen this GRU cell has actually a couple of neural network
layers inside.
It has two gates.
That's at least two layers.
So it has a host of weights and biases
which are actually defined in the background when
I call this.
That's why it's a higher level API.
It does its own weights and bias declarations in the background.
Now, I said we want to go deep.
So let's stack this cell three high.
That's how we do deep recurrent neural networks.
Three is a TensorFlow call for that.
It's called MultiRNNCell.
Give it a cell.
You say how many times you want to stack it.
And that gives you another cell.
Because we have seen already that these three stacked
cells actually have the same API as one cell.
So you can use it as a new cell.
And now, we need to unroll this.
For that, we call in TensorFlow flow
this dynamic RNN function, which is a bit of magic.
And that's what will unroll this sequence.
So how many times?
You don't see it in the parameters
because it's actually specified in the shape of the input
tensor, x.
If this input tensor has eight or let's
say 30 characters in it, it will be
unrolled over a sequence of 30 characters.
And actually, the little part of magic,
really it's magic, we will not be using it here,
but this dynamic RNN, what it can do also,
remember that we will be training this on batches,
as always.
We always train on batches.
So in this case, all my batches will
be sequences of the same size.
That's the case in my model.
In other models, I might not have
sequences of the same size.
Dynamic RNN can handle that.
If you pass it a batch of sequences,
even if they are not of the same size,
alongside that you pass the actual sizes,
and it will, for each sentence in the batch,
unroll your network the correct number of times.
And then, also pass the output from the correct stage.
It's super helpful.
Will not be using it here because all of our sequences
have the same size.
But that is super helpful.
All right, so now, we need to implement our softmax layer
from those H double second 0 to H second 8.
Well, basically the outputs at the bottom.
We know how to do a softmax layer, OK?
But here, since we have unrolled,
remember each stack here is a copy of the previous one.
We are sharing the weights.
So on the softmax side, we have to share the weights, as well.
So we could do this using the TensorFlow APIs.
You know, define one softmax layer.
And then, for the next one, call an API
that retrieves the weights of the previous one
and reuses them.
That's too complicated here.
Actually, there is a little hack that you can use.
Remember, we are always training on batches, OK?
So this will be taking a batch of sequences,
outputting a batch of sequences.
Each sequence is a sequence of characters.
So what is the difference between having,
let's say, 8 softmax cells that each process a batch of 100
characters, or having just one that process 800 of them?
That's the same thing.
Let's just do one.
And we will put all of those outputs in the same bag
and just use that one cell.
Anyway, we were supposed to be sharing the weights,
so defining just one cell is a very good way of doing that.
So that's what I do with my reshape operation there.
I take all of those outputs.
and you have to remember that there is a batch of outputs
on each of those arrows.
And I put them in the same bag.
Feed them through just one softmax layer.
And then, I will reshape them back into the correct shape
to finish.
Again, using higher level APIs in TensorFlow,
so when I call linear, that just does the weighted sums.
One layer, it computes simply the weighted sums.
No activation function.
And then, I called softmax.
And that applies the softmax activation function.
And linear, again, defines the weights and biases
in the background.
That's why I call it a higher level function.
And now, I'm ready to compute my loss function
and derive it and actually train the network.
It's just as complicated to understand how
recurrent neural networks work.
And it's just as complicated to actually feed
them data correctly.
You see lots of arrows.
So we will have to do quite a bit of plumbing
to make this happen.
Let's try to get our inputs and outputs right, OK?
So we will be inputting sequences of characters
by batches.
So my inputs are a batch of sequences.
The sequence length that I have chosen is 30.
I will be unrolling over 30 characters.
Usually, on the diagrams, I only represent 8 of them
because 30 would not fit on my slide.
But in the code, it was 30.
Now, I need to one-hot encode them.
So I'm adding a new size.
Each character becomes a vector of 100 components
because I am working with an alphabet of 100
possible characters.
So now, it's batch size, sequence length,
and alpha size.
Those are my actual inputs.
My state, again, I have a batch of states.
Since I'm feeding in a batch of inputs,
I will produce a batch of output states.
And the states, each of those state vectors
is, of course, of size n, cell size, whatever
cell size I have chosen to use.
Remember, each cell has this one configuration parameter,
which is its internal size.
But since I have stacked those cells three high,
it will actually-- the actual output state here will
be three times the cell size.
OK, we are ready to write this model.
So I define a placeholder for my input sequences,
a batch of sequences of size, sequence, length.
I one-hot encode them, which is why
I'm adding a new size to this tensor, which
is the size of my alphabet.
Again, each character becomes a vector of 100 components.
To be really precise, my alpha size is 98, so 98 components.
I'm working with an alphabet of 98 characters here.
I need to define a placeholder for my correct answers.
And actually, the correct answers
are very easy to obtain here.
I'm just teaching it to output the same sequence shifted
by one.
So basically, to predict what the last character will be.
So again, the correct answers will
be a batch of sequences of 30 characters, which
I one-hot encode.
I need a placeholder also for my input state.
And we have seen that the batch of input states,
we have seen that the input state is made of three
of those internal vectors.
So that's three times cell size.
And now, I'm ready to write my model.
So the model is what was here, OK?
That's the model.
This model, with this little trick that we have seen before,
this model at the output of its softmax layer
actually produces an output that is batch size multiplied
by sequence length.
You remember, we put all the characters from the batches
and from the different stages of the unrolled sequence
in the same bag.
And now, to determine characters from those probabilities,
I use argmax.
Why?
Because each of those vectors is 100 components
with probabilities.
Argmax is a function that gives me the index of the biggest
number in this vector.
So the index in this victory is actually
the ASCII code of the character that has been predicted.
So these are my predictions now in ASCII
encoding in characters.
And I just need to reshape them back to have, again,
a batch of sequences of 30 predicted characters.
And now, I'm ready to input-- to give my loss to an optimizer
and ask TensorFlow to optimize to actually train my network.
So this is the step, as yesterday, with this loss.
TensorFlow computes a gradient.
From this gradient, it can--
sorry.
And this loss is, of course, the difference
between the sequence of characters that was predicted
and the sequence of characters that I wanted to predict.
This difference becomes a loss.
That loss is derived, becomes a gradient.
We take a small step along this gradient,
which is actually in the space of weights and biases.
So taking a small step means we modify slightly our weights
and biases and continue.
That's the training.
One last little gotcha.
So we have to take our input text
and actually cut it up in those sequences of 30 characters.
So initially, I thought, well, that's easy, you know?
You take a piece of text.
How do you cut it up in sequences of characters?
Well, you cut, and cut, and cut, and cut.
And then, if you need a batch of them,
you take the first 100 sequences you have.
And you put that in a batch.
That did not work.
Why?
Let's see here.
That's my first batch.
Let's see the first sequence in the batch?
The quick-- you know what that is going to be.
The quick brown fox something.
Well, when my neural network processes the quick,
it also outputs an output state.
And in the next iteration, that output state
will become the input state for the next sequence.
If I want this to be correct, that input state
must correspond to the continuation
of the quick brown fox, and so on,
which means that the sentence has
to continue over all of the first slots of all
of my batches.
It's a not completely trivial way of batching here.
You cut up your text in batches, in sequences.
But the way to batch them together,
since you have to pass the correct state at each stage,
is that the beginning of the text
has to be split across the first item in batches.
And then, from some point far, far, far later in the text,
you can start filling the second line of the batches.
It's just plumbing.
I wrote for you the five lines of code that does this.
It's five lines.
I spent four hours doing it, including tests.
I don't do arithmetic.
It's full of modulos and divides.
And I wrote unit tests and hacked it
until the unit tests passed.
It's called test-driven debugging.
Sorry, test-driven development.
That's what developers do.
All right, so yeah, small gotcha on the batching.
But whatever.
Just use the code on--
this is not actually important.
Just use the function that will cut up the text correctly
for you.
And you're ready to train.
And this is actually the full code of this neural network
on one slide.
So let's go through this again.
A placeholder for my input sequences.
I one-hot encode them.
I'm actually inputting sequences of characters, OK?
And all the people with cameras, this is on GitHub.
And the GitHub link is on the last slide.
So please take pictures.
My Twitter handle is over there.
Tweet them.
But then, you will be able to go and GitHub and actually
retrieve this.
Then, my expected outputs, why underscore?
Again, I define a placeholder for them.
I will need to feed them during training.
And the first thing I do is that I one-hot encode them.
I will also need, and this is different
from normal neural networks, I will also
need a placeholder for my input state.
Remember?
RNNs have an input and an input state.
Two inputs.
Now, I'm ready to write my model.
So I chose the GRU cell.
I stack it three high.
And I unroll it as many times as x has components in it.
So here, my unroll size is sequence length.
And that's 30.
I chose 30 characters as the unroll size
of my recurring neural network.
I do my little trick with the softmax
so that I can implement just one softmax node.
I feed the output through my softmax node.
Here, I apply argmax to retrieve from the softmax probabilities
the highest probability.
And that's the character I'm predicting.
I reshape this back to have a batch of predicted sequences.
Also, somewhere in the middle in there,
I had those probabilities.
I take those probabilities.
And I compute the distance between what it says
and what I wanted.
That's my loss.
I give my loss to the optimizer.
I obtain a training step.
And this training step is actually
that gradient, which is computed on this batch of training
characters.
And which, if I follow it by a little step,
will modify my weights and biases
and bring me to somewhere where this network works better,
where it has a smaller error function.
And now, my training loop.
You will see, this is very familiar to what
we had previously.
We use this magic plumbing function
that I gave you to load sequences of characters
in the correct way.
And once I have a sequence of characters,
I run session.run of my training step.
I have to give it the input characters.
I have to give it the expected output.
And, since this is a recurrent neural network,
I have to give it the input state.
And this will give me an output state.
And you see the magic line, why this
is a recurrent neural network.
That's the last line there in the red.
Input state becomes-- sorry, the output state
becomes an input state.
That's why it's recurrent.
They're passing the state around.
All right.
So we are done.
We've built a recurrent neural network.
Now, we want to actually train it.
So let's go to a demo.
I will be training this on the complete works of William
Shakespeare.
That's not quite big data.
The complete works of William Shakespeare are five megabytes.
Yes, that puts things in perspective.
But it's good stuff.
So here, we see it's training on sequences.
So here are those sequences of 30 characters.
And here's a batch of them.
It's actually training on those sequences.
Here, predicting not much at all.
It's just the beginning.
And from time to time, I stop the training.
And I take just my one cell.
Remember, I have just one cell.
It's replicated for the purpose of training,
but it's just one cell.
And this one cell, it has become--
well, once it will be trained, it will have become--
a language model.
So what I can do with it is generate a new Shakespeare
play.
How do I do that?
Well, I take the cell.
I put in garbage, a random character.
That gives me an output character.
Probability of an output character,
which is the next character, and an output state.
You feed back the output state in the input,
and I feed back the output character as the new input.
And I continue.
And this is a state machine that will start generating text.
You see here, it's--
yeah.
That's not quite Shakespeare yet.
It's training.
It's a bit slow on my machine.
I usually have a GPU connected here
to-- it brings me a nice 10x speed, or 6x roughly.
But still, well, it has done 50 more batches.
I will leave it running.
Let's go and see, sorry, here.
On this slide, what it is.
So at the beginning, it gives you this.
As I said, not quite Shakespeare.
But after only an epoch, what we call an epoch,
it's when you have seen the entire training data set once.
So after having seen only a tenth
of what Shakespeare produced in his life, this is what we have.
Still not quite Shakespeare, but you see
there is some structure to it.
After two tenths, hey, this looks better!
It's starting to actually spell English almost correctly.
And there are those things in capital letters
at the beginning that are starting
to look like characters, like character names.
Even slightly later, oh, look!
And you have to remember that this
is a neural network that is predicting character
by character.
It first has to learn to spell English before going
to higher orders of structure.
So it's still not completely exact English.
But it's starting to look like English.
At least, Shakespearean English.
And you see it has character names.
And it's actually inventing new character names.
Here, Pordia and Henry Blutius--
who can tell me, no, seriously, who
can tell me if Shakespeare actually
used Henry Blutius in his work?
What do you call it?
I'm giving you the answer.
He didn't.
But it's a very credible Shakespearean character name.
And this is what you get after 30 epochs.
So it actually has a title.
There is an act.
There's a scene.
After the scene, it tells you where this is happening.
And look, it knows how to put scenic indications
in brackets, who enters, with whom, and so on.
It has even picked up stuff like, character names are
all caps, and when the character is a function, like Lord
or Chamberlain, it's only the first character
that is a capital.
It has picked up completely correctly as well.
And it's actually English.
So now that we have this, let's try to--
so this was on slides.
I will stop this.
What I have done previously is that I trained
this for actually 30 epochs.
And I saved my weights and biases.
So I'm ready to just replay it and generate a new Shakespeare
play.
Let's generate a new one live in front of you.
Here it is.
Let me stop it.
Whoops, sorry about that.
[APPLAUSE]
MARTIN GORNER: Is someone brave enough
to come and play hallucinated Shakespeare on the stage
with me?
Come on.
Yes!
Thank you.
Big applause.
SPEAKER: Come on up?
MARTIN GORNER: Thank you.
Please, come up.
You will have to speak loudly.
But that's how it is in a theater.
You don't have a microphone.
You speak.
So you can read off the screen here.
We will alternate.
So maybe I start, and then you do the next one.
So let's say enter Bardolph and Boult.
The manner off with my bestowers that you shall not see him,
and we are now to be the brother's wife and force,
to be so many and most grave.
SPEAKER: What art thou again?
What needs thy life?
Then, what they do not dote on thee.
The word will be at thee.
And take my heart to thee.
And they distemper.
Will thou beat me well to say god save my son?
[APPLAUSE]
MARTIN GORNER: Thank you so much.
SPEAKER: Thank you.
MARTIN GORNER: Thank you.
That was fantastic.
Thank you.
Actually, I tried to do this also on the Python code
of TensorFlow itself.
That was fun.
So in the beginning, you had this.
Looks like Python?
Maybe.
But very, very quickly, it actually
picks up Pythonic structures, like those keywords,
and it's generating something that looks like function calls.
Slightly later, it actually correctly uses the keywords
with function names-- a hallucinated function name.
It's actually quite inventive in the function names.
And a colon at the end.
It's still getting the nested parenthesis wrong.
And after a longer while, it can recite the Apache license
in full.
Yes.
It's open source compliant, open source compliant.
And more interestingly for us, designers
of recurrent neural networks, it can actually close and open
the nested parentheses right to a depth of three,
which is quite impressive.
And what I find fantastic, it's that it has figured out
how to do Python comments.
And it's giving me TensorFlow advice in those comments.
But look, it makes sense!
Check that we have both scalar tensors
for being invalid to a vector of one
indicating the total loss of the same shape
as the shape of the tensor.
I'm sure this makes just as much sense as everything
that I've been saying since the beginning here.
All right, and small credits to a gentleman
called Andrej Karpathy who actually
wrote this neural network for the first time.
He published a blog about it.
He tried it on many different things.
He generated a business book for startups.
And he tried to generate an algebra book in LaTeX.
Actually, after training, this produced almost valid LaTeX.
So he had to hack it a little bit to make it compile.
But then, this looks like an algebra book.
That is even an attempt at a diagram.
And the line I prefer is how the neural network
solved how to write a proof.
Look at the very top.
"Proof omitted."
That's so clever.
All right.
So that's basically all I wanted to show you.
Well, this is how we generate it.
So we, I take just one cell.
And basically, in a loop, I feed in a character.
I take the output, feed it back as the input,
and feed the output state as the input state,
and just do this in a loop.
A couple of applications of this.
Oh, yes.
Actually, we still have a little bit of time.
This time, I've been using TensorBoard to visualize
my inputs and outputs.
Where is my TensorBoard?
Somewhere.
Sorry, I'll find it.
Here.
In the last session, I was just throwing the outputs
into matplotlib, which is the very standard Python plotting
library.
But there is a tool dedicated to visualizing training
in TensorFlow.
It's called TensorBoard.
And I advise you to use it, especially
if you do a distributed training, or training
on remote servers.
It can connect to a bucket and get the information from there
and visualize it.
So here again, I have configured,
when I was training this network,
I configured it to actually do training and validation.
I put one Shakespeare play aside for validation
to test my network.
And if you'll remember the session from yesterday,
I find it very important to follow my loss curves, both
the training and the test loss curve, on the screen.
This is what I got.
And actually, first of all, who sees something wrong?
Overfit, yeah.
And so now, the question is, why is it overfitting here?
I will give you the answer because you can't guess it.
But here, I was actually training on a small subset
of the Shakespeare corpus.
So here, it was overfitting because of lack of data.
And since I had this on the curves,
I wanted to show it to you because you certainly remember
that somewhere, where is it?
This one.
Somewhere here, I had this helpful engineering chart.
Which allows you to interpret what overfitting is.
And we went yesterday through the bad network.
We went through too many neurons.
We never had not enough data.
So I tried with not enough data.
And yes, it also gives you this very recognizable pattern
in the curves.
And as soon as I train with more data, this is what I have.
So here, the two curves follow each other closely.
And I know that I have solved the problem.
So actually, I was doing this because I
was trying to add dropout into my network
to make it work better.
No.
It was misbehaving just because of lack of data.
Dropout would not have solved that.
All right, and so a couple of applications,
practical applications to finish.
We've seen how to produce a character by character model.
We can also use this not character by character,
but word by word.
So as I said previously, with a word,
it's a bit more complicated because to one-hot encode
a word, you need to encode it on a vector of, this time,
30,000 components.
Because that's the typical size of a vocabulary,
typical language.
So those are big.
So on the inputs, there is actually
a very simple solution.
How do you reduce the size of a big vector?
Well, you use one layer of a neural network
and produce less outputs.
It's called embeddings.
And that layer can either be part of your training--
then, your embeddings are learned
as part of the training.
Or, you can use some neural network
that has been already trained, typically trained
on the English language generically,
and that just encodes words into smaller vectors.
There is a very famous, what is it,
neural network that has been built for that.
It's called Word2Vec.
Already trained, available on GitHub.
You can use that to encode your English words
if your problem deals with English words
as smaller vectors.
And so once we have solved this problem of how
to input words instead of characters,
you can, for example, use a recurrent neural network
like this to predict not what the next word is,
but a categorization of a sequence.
And this is used in newspapers to automatically categorize
articles as geopolitics, science, sports, and so on.
Works very well.
How do you do translation?
Well, to do translation, that's how Google Translate works,
you tack two of those recurrent networks end to end.
To the first one, you apply an English sentence
plus a stop symbol.
And then, you continue.
And you ask it to output the French sentence.
And what you have on the input, there is a choice.
Normally, you should be inputting
what your network outputs.
But people have also tried to input
what the network should output.
So both options exist.
And they give you different results.
You can read about this in literature.
So this is how translation works.
Of course, you have a big problem at the end.
I won't go into that.
Because to do the softmax layers there,
you actually want to produce a vector of 30,000 probabilities.
That's a bit heavy.
So there are ways of mitigating that.
But that's an active area of research.
One that is implemented in TensorFlow
is called sample softmax.
But there are many others because this
is an active area of research.
How to do this softmax layer to produce 30,000 probabilities
each time, which is a bit heavy.
And one more is image labeling.
So here, it's a very simplified version of image labeling.
Image labeling is you take an image,
and you want to produce a sentence.
Like, this a little girl holding a teddy bear.
This is a truck in the desert.
So this is actually also a translation problem.
You take vectors from an image.
And you apply a recurrent neural network
to produce a sequence of words which
you want to be the description of this image.
How do you encode an image as a vector?
Well, there are plenty of solutions.
One of them is to take an off the shelf
image recognition neural network, like Inception,
and just chop off the last couple of layers.
Normally, what Inception gives is categories.
This is a truck.
This is a beach.
This is a lizard.
That's not what you want.
But all the top layers are actually
encoding an image in some meaningful way into a vector.
You can use that as a fixed encoding function.
And input the vector corresponding to the image
here.
Produce this output sequence.
And sometimes, it works really well.
This is what was generated.
A herd of elephants walking across a dry grass field,
and so on.
And then sometimes, yeah, not quite.
Thank you.
[APPLAUSE]
[MUSIC PLAYING]
    You must  Log in  to get the function.
Tip: Click on the article or the word in the subtitle to get translation quickly!

Loading…

TensorFlow and Deep Learning without a PhD, Part 2 (Google Cloud Next '17)

1085 Folder Collection
jwlee published on April 23, 2017
More Recommended Videos
  1. 1. Search word

    Select word on the caption to look it up in the dictionary!

  2. 2. Repeat single sentence

    Repeat the same sentence to enhance listening ability

  3. 3. Shortcut

    Shortcut!

  4. 4. Close caption

    Close the English caption

  5. 5. Embed

    Embed the video to your blog

  6. 6. Unfold

    Hide right panel

  1. Listening Quiz

    Listening Quiz!

  1. Click to open your notebook

  1. UrbanDictionary 俚語字典整合查詢。一般字典查詢不到你滿意的解譯,不妨使用「俚語字典」,或許會讓你有滿意的答案喔