Subtitles section Play video
LUCASZ KAISER: Hi, my name is Lucasz Kaiser,
and I want to tell you in this final session
about Tensor2Tensor, which is a library we've
built on top of TensorFlow to organize the world's models
and data sets.
So I want to tell you about the motivation,
and how it came together, and what you can do with it.
But also if you have any questions
in the meantime anytime, just ask.
And only if you've already used Tensor2Tensor, in that case,
you might have even more questions.
But the motivation behind this library
is-- so I am a researcher in machine learning.
I also worked from production [INAUDIBLE] models,
and research can be very annoying.
It can be very annoying to researchers,
and it's even more annoying to people
who put it into production, because the research works
like this.
You have an idea.
You want to try it out.
It's machine learning, and you think,
well, I will change something in the model, it will be great.
It will solve physics problems, or translation, or whatever.
So we have this idea, and you're like, it's so simple.
I just need to change one tweak, but then, OK, I
need to get the data.
Where was it?
So we search it online, you find it,
and it's like, well, so I need to preprocess it.
You implement some data reading.
You download the model that someone else did.
And it doesn't give the result at all
that someone else wrote in the paper.
It's worse.
It works 10 times slower.
It doesn't train at all.
So then you start tweaking it.
Turns out, someone else had this postscript
that preprocessed the data in a certain way that
improved the model 10 times.
So you add that.
Then it turns out your input pipeline is not performing,
because it doesn't put data on GPU or CPU or whatever.
So you tweak that.
Before you start with your research idea,
you've spent half a year on just reproducing
what's been done before.
So then great.
Then you do your idea.
It works.
You write the paper.
You submit it.
You put it in the repo on GitHub,
which has a README file that says,
well, I downloaded the data from there,
but this link has already gone by two days
after he made the repo.
And then I applied.
And you describe all these 17 tweaks,
but maybe you forgot one option that was crucial.
Well, and then there is the next paper and the next research,
and the next person comes and does the same.
So it's all great except the production team, at some point,
they get like, well, we should put it into production.
It's a great result. And then they
need to track this whole path, redo all of it,
and try to get the same.
So it's a very difficult state of the world.
And it's even worse because there are different hardware
configurations.
So maybe something that trained well on a CPU
does not train on a GPU, or maybe you need an 8 GPU setup,
and so on and so forth.
So the idea behind Tensor2Tensor was,
let's make a library that has at least a bunch
of standard models for standard tasks that includes
the data and the preprocessing.
So you really can, on a command line, just say,
please get me this data set and this model, and train it,
and make it so that we can have regression tests and actually
know that it will train, and that it will not break with
TensorFlow 1.10.
And that it will train both on the GPU and on a TPU,
and on a CPU--
to have it in a more organized fashion.
And the thing that prompted Tensor2Tensor,
the thing why I started it, was machine translation.
So I worked with the Google Translate team
on launching neural networks for translation.
And this was two years ago, and this was amazing work.
Because before that, machine translation
was done in this way like--
it was called phrase-based machine translation.
So if you find some alignments of phrases,
then you translate the phrases, and then you
try to realign the sentences to make them work.
And the results in machine translation
are normally measured in terms of something
called the BLEU score.
I will not go into the details of what it was.
It's like the higher the better.
So for example, for English-German translation,
the BLEU score that human translators get is about 30.
And the best phrase-based-- so non-neural network,
non-deep-learning-- systems were about 20, 21.
And it's been, really, a decade of research at least,
maybe more.
So when I was doing a PhD, if you got one BLEU score up,
you would be a star.
It was good PhD.
If you went from 21 to 22, it would be amazing.
So then the neural networks came.
And the early LSTMs in 2015, they were like 19.5, 20.
And we talked to the Translate team,
and they were like, you know, guys, it's fun.
It's interesting, because it's simpler in a way.
You just train the network on the data.
You don't have all the--
no language-specific stuff.
It's a simpler system.
But it gets worse results, and who knows
if it will ever get better.
But then the neural network research moved on,
and people started getting 21, 22.
So the Translate team, together with Brain, where I work,
made the big effort to try to make a really large LSTM
model, which is called the GNMT, the Google Neural Machine
Translation.
And indeed it was a huge improvement.
It got to 25.
BLEU, later-- we added mixtures of experts, it even got to 26.
So they were amazed.
It launched in production, and well, it
was like a two-year effort to take the papers,
scale them up, launch it.
And to get these really good results,
you really needed a large network.
So as an example why this is important,
or why this was important for Google is--
so you have a sentence in German here,
which is like, "problems can never
be solved with the same way of thinking that caused them."
And this neural translator translates the sentence kind
of the way it should--
I doubt there is a much better translation--
while the phrase-based translators, you can see,
"no problem can be solved from the same consciousness
that they have arisen."
It kind of shows how the phrase-based method works.
Every word or phrase is translated correctly,
but the whole thing does not exactly add up.
You can see it's a very machiney way,
and it's not so clear what it is supposed to say.
So the big advantage of neural networks
is they train on whole sentences.
They can even train on paragraphs.
They can be very fluent.
Since they take into account the whole context at once,
it's a really big improvement.
And if you ask people to score translations,
this really starts coming close--
or at least 80% of the distance to what human translators do,
at least on newspaper language-- not poetry.
[CHUCKLING]
We're nowhere near that.
So it was great.
We got the high BLEU scores.
We reduced the distance to human translators.
It turned out the one system can handle
different languages, and sometimes even
multilingual translations.
But there were problems.
So one problem is the training time.
It took about a week on a setup of 64 to 128 GPUs.
And all the code for that was done specifically
for this hardware setup.
So it was distributed training where
everything in the machine learning pipeline
was tuned for the hardware.
Well, because we knew we will train on this data
center on this hardware.
So why not?
Well, the problem is batch sizes and learning rates,
they come together.
You can not tune them separately.
And then you add tricks.
Then you tweak some things in the model
that are really good for this specific setup,
for this specific learning grade or batch size.
This distributed setup was training asynchronously.
So there were delayed gradients.
It's a regular [? ISO, ?] so you decrease dropout.
You start doing parts of the model
specifically for a hardware setup.
And then you write the paper.
We did write a paper.
It was cited.
But nobody ever outside of Google
managed to reproduce this, get the same result
with the same network, because we can give you
our hyperparameters, but you're running on a different hardware
setup.
You will not get the same result.
And then, in addition to the machine learning setup,
there is the whole will tokenization pipeline, data
preparation pipeline.
And even though these results are on the public data,
the whole pre-processing is also partially Google.
It doesn't matter much.
But it really did not allow other people
to build on top of this work.
So it launched, it was a success for us,
but in the research sense, we felt that it
came short a little bit.
Because for one, I mean, you'd need a huge hardware setup
to train it.
And on the other hand, even if you had the hardware setup,
or if you got it on cloud and wanted to invest in it,
there would still be no way for you to just do it.
And that was the prompt, why I thought,
OK, we need to make a library for the next time
we build a model.
So the LSTMs were like the first wave of sequence models
with the first great results.
But I thought, OK, the next time when we come build a model,
we need to have a library that will ensure it works at Google
and outside, that will make sure when you train on one GPU,
you get a worse result, but we know what it is.
We can tell you, yes, you're on the same setup.
Just scale up.
And it should work on cloud so you can just,
if you want better result, get some money,
pay for larger hardware.
But it should be tested, done, and reproducible outside.
And the need-- so the Tensor2Tensor library started
with the model called Transformer,
which is the next generation of sequence models.
It's based on self-attentional layers.
And we designed this model.
It got even better results.
It got 28.4 BLEU.
Now we are on par with BLEU with human translators.
So this metric is not good anymore.
It just means that we need better metrics.
But this thing, it can train in one day on an 8 GPU machine.
So you can just get it.
Get an 8 GPU machine.
It can be your machine, it can be in the cloud.
Train, get the results.
And it's not just reproducible in principle.
There's been a number of groups that reproduced it, got
the same results, wrote follow-up papers,
changed the architecture.
It went up to 29, it went up to 30.
There are companies that use this code.
They launched competition to Google Translate.
Well, that happens.
And Google Translate improved again.
But in a sense, I feel like it's been a larger success in terms
of community and research, and it raised the bar for everyone.
It raised our quality as well.
So that's how it came to be that we
feel that it's really important to make things reproducible,
open, and test them on different configurations
and different hardwares.
Because then we can isolate what parts are really
good fundamentally from the parts that are just
tweaks that work in one configuration
and fail in the other.
So that's our solution to this annoying research problem.
It's a solution that requires a lot of work,
and it's based on many layers.
So the bottom layer is TensorFlow.
And TensorFlow, in the meantime, has also evolved a lot.
So we have TF Data, which is the TensorFlow data input pipeline.
It was also not there a year ago.
It's in the newer releases.
It helps to build input pipelines that
are performant across different hardware.
There is TF Layers and Keras, which
are higher-level libraries.
So you don't need to write, in small TensorFlow
Ops, everything.
You can write things on a higher level of abstraction.
There is the new distribution strategy,
which allows you to have an estimator
and say, OK, train on eight GPUs,
train on one GPU, train on a distributed setup,
train on TPU.
You don't need rewrite handlers for everything on your own.
But that's just the basics.
And then comes the Tensor2Tensor part, which is like, OK,
I want a good translation model, where do I get the data from?
It's somewhere on the internet, but where?
How do I download it?
How do I pre-process it?
Which model should I use?
Which hyperparameters of the model?
What if I want to change a model?
I just want to try my own, but on the same data.
What do I need to change?
How can it be done?
What if I want to use the same model, but on my own data?
I have a translation company.
I have some data.
I want to check how that works.
What if I want to share?
What if I want to share a part?
What if I want to share everything?
That's what Tensor2Tensor does.
So it's a library.
It's a library that has a lot of data sets--
I think it's more than 100 by now--
all the standard ones, images, ImageNet, CIFAR, MNIST,
image captionings, Coco, translations
for a number of languages, just pure language
modeling data sets, speech to text, music, video data sets.
It's recently very active.
If you're into research, you can either probably find it here
or there is a very easy tutorial on how to add it.
And then with the data sets come the models.
There is the transformer, as I said you--
told you-- that's how it started.
But then the standard things, ResNet, then more fancy image
models like RevNet, ShakeShake, Xception, Sequence Model,
also a bunch of them, SliceNet, ByteNet,
that's subversion of WaveNet.
LSTMs then algorithmic models like Neural GPUs.
There was a bunch of recent papers.
So it's a selection of models and data sets,
but also the framework.
So if you want to train a model, there is one way to do it.
There are many models.
You need to specify which one.
And there are many datasets.
You need to specify which one.
But there is one training binary.
So it's always the same.
No two page read me, please run these commands
and for another run different comments.
Same for decoding.
You want to get your the outputs of your model
on the new data set?
One command, t2t decoder.
You want to export it to make a server or a website?
One command.
And then you want to train, train locally,
you just run the binary.
If you want to train on Google Cloud,
just give your cloud project ID.
You want to train on cloud TPU, just say dash dash use TPU
and give the ID.
You need to tune hyper parameters.
There is support for it on Google Cloud.
We have ranges.
Just specify the hyperparameter range and tune.
You want to train distributed on multiple machines,
there is a script for that.
So Tensor2Tensor are data sets, models, and everything
around that's needed to train them.
Now, this project, due to our experience with translation,
we decided it's open by default. And open
by default, in a similar way as TensorFlow,
means every internal code change we push gets
immediately pushed to GitHub.
And every PR from GitHub, we import internally and merge.
So there is just one code base.
And since this project is pure Python, there's no magic.
It's the same code at Google and outside.
And it's like internally we have dozens of code changes a day.
They get pushed out to GitHub immediately.
And since a lot of brain researchers use this daily,
there are things like this.
So there was a tweet about research and optimizers.
And it was like, there are optimizers like AMS
grad, adaptive learning create methods.
And then James Bradbury at Facebook
at that time tweeted, well, it's not the latest optimizer.
The latest optimizer is in Tensor2Tensor encode with a
to do to write a paper.
The paper is written now.
It's a very good optimizer [INAUDIBLE] factor.
But yeah, the code, it just appears there.
The papers come later.
But it makes no sense to wait.
I mean, it's an open research community.
These ideas, sometimes they work, sometimes they don't.
But that's how we work.
We push things out.
And then we train and see.
Actually, by the time the paper appeared,
some people in the open source community
have already trained models with it.
So we added the results.
They were happy to.
It's a very good optimizer, saves a lot of memory.
It's a big collaboration.
So as I said, this is just one list of names.
It should probably be longer by now.
It's a collaboration between Google Brain, DeepMind.
Currently there are researchers from the Czech Republic
on GitHub and Germany, so over 100 contributors by now,
over 100,000 downloads.
I was surprised, because Ryan got this number for this talk.
And I was like, how comes there are 100,000 people using
this thing?
It's for ML researchers.
But whatever, they are.
And there are a lot of papers that use it.
So these are just the papers that have already
been published and accepted.
There is a long pipeline of other papers
and possibly some we don't know about.
So as I told you, it's a unified framework for models.
So how does it work?
Well, the main script of the whole library is t2t-trainer.
It's the one binary where you tell what model, what data set,
what hyperparameters, go train.
So that's the basic command line-- install tensor2tensor
and then call t2t-trainer.
The problem is the name of the dataset.
And it also includes all details like how
to pre-process, how to resize images, and so on and so forth.
Model is the name of the model and hyperparameter set
is which configuration, which hyperparameters
of the model, which learning grades, and so on, to use.
And then, of course, you need to specify
where to store the data, where to store
the model checkpoints for how many steps to train and so on.
But that's the full command.
And for example, you want a summarization model.
There is a summarization data set
that's been used in academia.
It's from CNN and Daily Mail.
You say you want the transformer,
and there is a hyperparameter set that
does well on summarization.
You want to image classification,
like CIFAR10 is quite a standard benchmark for papers.
You say, I want image CIFAR10.
ShakeShake model, this was state of the art a year
or a year ago.
This changes quickly.
You want the big model, you go train it.
And the important thing is we know this result.
This gives less than 3% error on CIFAR, which is, as I said,
was state of the art a year ago.
Now it's down to two.
But we can be certain that when you
run this command for the specified number of training
steps, you will actually get this state of the art,
because internally we run regression tests that
start this every day and tell us if it fails.
So the usefulness of this framework is not just in--
well, we have it grouped into one command.
But because it's automated, we can start testing it.
If there is a new change in TensorFlow that
will break some kernel and it doesn't come out
in the unit test, it often comes out
in the regression tests of these models.
And we found at least three bugs in the recent two versions
of TensorFlow, because some things in machine learning only
appear--
like, things still run, things still train,
but they give you 2% less.
These are very tricky bugs to find,
but if you know which day it started failing,
it's much easier.
Translation, as I said, it started with transformer.
We added more changes.
Nowadays, it trains to over 29 BLEU.
It's a very good translation model.
Just run this command on an 8 GPU machine.
Wait.
You will get a really good translator.
Speech recognition, there is the open librispeech data set.
Transformer model without any language model
gets a really good word error rate.
Some more fancy things, like if you want to generate images,
it's recently popular, have a model that just generates you.
Either phases or landscapes, there are different datasets.
So this is a model that you train just
on CIFAR 10 reversed.
Every data set in tensor2tensor you
can add this underscore rev. It reverses inputs and targets.
And generative models, they can take it and generate it.
For translation, it's very useful
if you want, instead of English, German, and German, English,
you just do underscore rev. It reverses
the ordering of the dataset.
So yeah, so they're the commands.
But so for example, on an image transformer,
if you try to train this on a single GPU
to get to this 2.9 beats per dimension,
you'd probably have to wait half a year.
So that's not very practical.
But that's the point.
Currently it's a very hard task to do a very good image
generative model.
One GPU might not be enough for state of the art.
So if you want to really push it, you need to train at scale.
You need to train multi GPU.
You need to go to TPUs.
Well, this is the command you've seen before.
To make it multi GPU, you just say worker GPU equals 8.
This will use eight GPUs on your machine.
Just make batches eight times larger.
Run the eight GPUs in parallel, and there it trains.
Want to train on a cloud TPU?
Use TPU, and you need to specify the master of the TPU instance
that you booked on cloud.
It trains the same.
Want to train on a cloud TPU pod?
I don't know, I guess you've heard today,
Google is opening up to public the pods which go up to 256,
I think, TPU cores.
Just say, oh, maybe up to 512, what I see from this command.
Just say do it.
Train.
It will train much faster.
How much faster?
Well, we've observed nearly linear scaling up
to half a pod, and I think, like, 10% loss on a full pod.
So these models, the translation models,
they can train on a pod for an hour,
and you get state of the art performance.
So this can really make you train very fast.
Same for ImageNet.
Well, I say an hour, there's now a competition.
Can we get down to half an hour, 18 minutes.
I'm not sure how important that is, but it's really fast.
Now, maybe you don't just care about training
one set of hyperparameters.
Maybe you have your own data set and you
need to tune hyperparameters, find a really good model
for your application.
Say cloud ML engine auto tune.
You need to say what metric to optimize--
so accuracy, perplexity, these are the standard metrics
that people tune the models for.
Say how many trials, how many of them to run in parallel.
And the final line is a range.
So a range says, well, try learning grades
from 0.1 to 0.3, logarithmically or uniformly.
These are the things you specify.
So you can specify continuous things in an interval
and you can specify discrete things.
Just try two, three, four, five layers.
And the tuner, it starts the number of parallel trials,
so 20 in this command.
The first one is random, and then the next one,
it has a quite sophisticated non-differential optimizing
model which is Bayesian mixed with CMAES.
What to try next, it will try another 20 trials.
Usually after, like, 60 or so it starts getting
to a good parameter space.
So if you need to optimize, that's how you do it.
And, like, if you're wondering what range to optimize,
we have a few ranges in code that we usually
optimize for when we start with new data.
On a TPU pod, if you want a model that doesn't just
do training on large batches, data
parallel, but model parallel.
If you want to have a model with a huge number of parameters,
more than one billion, you can use
something we call mesh TensorFlow that we also
have started developing in tensor2tensor,
which allows to do model parallelism in an easy way.
It just say, split my tensor into the cores,
how many cores you have.
Or split it eight-wise on this dimension and four-wise
on this dimension.
I'll tell a bit more later about that.
It allows you to train really large models if you want this.
And that gives really good results.
So that's how the library works.
You can go and use it with the models and data
sets that are there.
But what if you want to just get the data from the data set
or to add your own data set?
Well, it's still a Python library.
You can just import it.
And there is this problem class, which
you can use without any other part of the library.
So you can just--
you get an instance of the problem class
either by-- so we have this registry
to call things by name.
So you can say registry dot problem and the name.
You can say problems dot available to get
all the available names.
Or you can instantiate it directly.
If you look into the code where the class is, you can say,
give me this class.
And then generate data.
The problem class knows where on the internet to find the data
and how to pre-process it.
So the generate data will go to this place,
download it from the internet, and pre-process into TF example
files in the same way that we use it
or that the authors of this data set
decide it is good for their models.
And then you call a problem dot data set, which reads it
from this, can gives you this queue of tensors
in the form of a data set.
So that's for data sets.
For a model, all our models are a subclass
of this t2t model class, which itself is a Keras layer.
So if you want to take one model,
plug it together with another one, same as with layers.
You get a model.
You can get it again either by registry or by class name.
Call the model on a dictionary of tensors.
And you get the outputs and the losses if you need.
So you can add your own.
You can subclass the base problem class,
or for text to text or image to class problems,
there are subclasses that are easier to subclass.
You just basically point where your images are and get them
from any format to this.
And for your own model, you can subclass t2t model.
If you want to share it, it's on GitHub.
Make a PR.
Under models, there is a research sub directory
where there are models that we don't consider,
that we don't regression test.
We allow them to be free.
If you have an idea, want to share, put it there.
People might come, run it, tell you it's great.
So yeah, Tensor2Tensor, it's a set of data sets,
models, and scripts to run it everywhere.
And yeah, looking ahead, it's growing.
So we are happy to have more data sets.
We are happy to have more models.
We are ramping up on regression testing.
We're moving models out of research
to the more official part to have them
tested and stabilized.
On the technical side, we are on to simplifying
the infrastructure.
So TensorFlow 2 is coming.
The code base-- well, it's started more than a year ago.
It's based on estimators.
We are moving it to Keras.
We had our own scripts and binaries
for running on TPUs and multi-GPUs
or moving to a distribution strategy.
We are allowing experts to TF hub.
So this is a library for training your own models.
The main thing is the trainer.
Once it's trained and you want to share a pre-trained model,
TF hub is the right place.
You can export it with one line.
And the mesh TensorFlow allows to train huge models
on cloud pods.
I will tell you a little bit more about it in a moment.
On the research side, there's been a lot of research
in video models recently.
We have a ton of them in Tensor2Tensor.
And they're getting better and better.
And it's a fun thing to generate your own videos.
There is-- the new thing in machine translation
is using back translation.
So it uses an unsupervised-- you have a corpus of English
and a corpus of German, but no matching.
And to use a model you have to generate data and then back
translate and it shows improvements.
And in general, well, hyperparameter tuning
is an important thing in research, too.
So it's integrated now, and we're
doing more and more of it.
Reinforcement learning, guns, well, as I said,
there are a lot of researchers using it.
So there's a lot going on.
One of the things, Mesh TensorFlow,
it's a tool for training huge models, meaning really huge.
Like, you can have one model that uses a whole TPU
pod, 4 terabytes of RAM.
That's how many parameters you can do.
It's by Noam, Youlong, Niki, Ashish, and many people.
So what if you want to train an image generation
models on high definition videos or process data that's
huge even at batch size 1?
So you cannot just say, oh, I'll do one thing on one core,
another on one core, just split it by data.
One data example has to go on the whole machine.
And then there needs to be a convolution that applies to it,
or a matrix multiplication.
So how can we do this and not drown
into writing manually, OK, on this core,
do this, and then slice back?
So the idea is build every tensor, every dimension
it has needs to be named.
For example, you name the first dimension is batch.
The second is length.
And the third is just the hidden vector.
And for every dimension, you specify how it
will be laid out on a device.
So you say, OK, batches--
for example, modern devices, they have 2D--
they're like a 2D mesh of chips.
So the communication is fast to nearby chips,
but not so fast across.
So you can say if it's a grid of chips in hardware, you can say,
OK, the batch dimension will be on the horizontal chips
and the length will be on the vertical ones.
So we define how to split the tensor on the hardware mesh.
And then the operations are already
optimized to use the processing of the hardware
to do fast communication and operate on these tensors
as if they were single sensors.
So you specify the dimensions by name.
You specify their layout.
And then you write your model as if it was a single GPU model.
And so everything stays simple except for this layout thing,
which you need to think a little bit about.
We did a transformer on it.
We did an image transformer.
We can train models with 5 million parameters on TPU pods
with over 50% utilization.
So this paper, it's also a to do paper,
it should be coming out in a few weeks.
Not yet there, but it's new state-of-the-art on translation
language modeling.
It's the next step in really good models.
It also generates nice images.
So big models are good.
They give great results.
And this is a way of writing them simply.
So yeah, that's the Mesh TensorFlow.
And we try to make it-- so it runs on TPU pods,
but it also runs on clusters of GPUs,
because we tried to not make the mistake again to do something
that just runs on one hardware.
And with the Tensor2Tensor library,
you're welcome to be part of it.
Give it a try.
Use it.
We are on GitHub.
There is a GitHub chat.
There is an active lobby for Tensor2Tensor, where we also
try to be everyday day to help.
And yep, that's it.
Thank you very much.
[APPLAUSE]