Placeholder Image

Subtitles section Play video

  • FRANK CHEN: So hi everyone.

  • I'm Frank.

  • And I work on the Google Brain team working on TensorFlow.

  • And today for the first part of this talk,

  • I'm going to talk to you about accelerating machine learning

  • with Google Cloud TPUs.

  • So the motivation question here is, why is Google

  • building accelerators?

  • I'm always hesitant to predict this,

  • but if you look at the data, this has been--

  • the end of Moore's law has been going on

  • for the past 10 or 15 years, where we don't really see

  • the 52% year-on-year growth in single-threaded performance

  • that we saw from the late 1980s through the early 2000s

  • anymore, where now single-threaded performance

  • for CPUs is really growing at a rate of about maybe 3% or 5%

  • per year.

  • So what this means is that I can't just

  • wait 18 months for my machine learning models

  • to train twice as fast.

  • This doesn't work anymore.

  • At the same time, organizations are

  • dealing with more data than ever before.

  • You have people uploading hundreds and hundreds

  • of hours of video every minute to YouTube.

  • People are leaving product reviews on Amazon.

  • People are using chat systems, such as WhatsApp.

  • People are talking about personal assistance

  • and so on and so forth.

  • So more data is generated than ever before.

  • And organizations are just not really

  • equipped to make sense of them to use them properly.

  • And the third thread is that at the same time,

  • we have this sort of exponential increase

  • in the amount of compute needed by these machine learning

  • models.

  • This is a very interesting blog post by OpenAI.

  • In late 2012, where we just had--

  • where deep learning was first becoming useful.

  • We have like AlexNet, and we have

  • Dropout, which used a fair amount of computing power,

  • but not that much compared to in late 2017 where

  • DeepMind published the AlphaGo Zero and AlphaGo.

  • In the Alpha Zero paper, we see in about six, seven years,

  • we see the compute demand increase by 300,000 times.

  • So this puts a huge strain on companies'

  • compute infrastructure.

  • So what does this all mean?

  • The end of Moore's law plus this sort of exponential increase

  • in computer requirements means that we need a new approach

  • for doing machine learning.

  • At the same time, of course, everyone still

  • wants to do compute, do machine learning,

  • training faster and cheaper.

  • So that's why Google is building specialized hardware.

  • Now, the second question you might be asking

  • is, what sort of accelerators is Google building?

  • So from the title of my talk, you

  • know that Google is building a type of accelerator

  • that we call Tensor Processing Units, which are really

  • specialized ASICs designed for machine learning.

  • This is the first generation of our TPUs

  • we introduced back in 2015 at Google

  • I/O. The second generation of TPUs

  • now called Cloud TPU version 2 that we introduced

  • at Google I/O last year.

  • And then these Cloud TPU version 2's

  • can be combined into pods called Cloud TPU v2 Pods.

  • And of course, at Google I/O this year,

  • we introduced the third generation of cloud TPUs.

  • From air cooled.

  • Now it's liquid cooled.

  • And of course, you can link a bunch of them

  • up into a pod configuration as well.

  • So what are the differences between these generations

  • of TPUs?

  • So the first version of TPUs, it was really

  • designed for inference only.

  • So it did about 92 teraops of innate.

  • The second generation of TPUs does both training

  • and inference.

  • It operates on floating point numbers.

  • It does about 180 teraflops.

  • And it has about 64 gigs of HBM.

  • And the third generation to TPUs,

  • it's a big leap in performance.

  • So now we are doing 420 teraflops.

  • And we doubled the amount of memory.

  • So now it's 128 gigs of HBM.

  • And again, it does training and inference.

  • And of course, we see the same sort of progress

  • with Cloud TPU Pods as well.

  • Our 2017 pods did about 11.5 petaflops.

  • That is 11,500 teraflops of compute

  • with 4 terabytes of HBM.

  • And our new generation of pods does over 100 petaflops

  • with 32 terabytes of HBM.

  • And of course, the new generation of pods

  • is also liquid cooled.

  • We have a new chip architecture.

  • So that's all well and good, but really,

  • what we are looking for here is not just peak performance,

  • but cost effective performance.

  • So take this very commonly used image recognition model,

  • called ResNet 50.

  • If you train it on, again, a very common dataset

  • called ImageNet, we achieve about 4,100 images

  • per second on real data.

  • We also achieve that while getting state of the art

  • final accuracy numbers.

  • So in this case, it's 93% top 5 accuracy

  • on the ImageNet dataset.

  • And we can train this ResNet model

  • in about 7 hours and 47 minutes.

  • And this is actually a huge improvement.

  • If you look at the original paper by Kaiming He

  • and others where they introduce the ResNet architecture,

  • they took weeks and weeks to train one of these models.

  • And now with one TPU, we can train it

  • in 7 hours and 47 minutes.

  • And of course, these things are available on Google Cloud.

  • So the current training, so it takes about--

  • if you pay for the resource on demand, it's about $36.

  • And if you pay for it using Google Cloud's

  • preemptible instances, it is about $11.

  • So it's getting pretty cheap to train.

  • And of course, we want to do the cost effective performance

  • at scale.

  • So if you're trying the same model, ResNet 50,

  • on a Cloud TPU version 2 Pod, you

  • are getting something like 219,000 images per second

  • of training performance.

  • You get the same finer accuracy.

  • And training time goes from about eight hours

  • to about eight minutes.

  • So again, that's a huge improvement.

  • And this gets us into the region of we can just

  • iterate on-- you can just go train a model,

  • go get a cup of coffee, come back,

  • and then you can see the results.

  • So it gets into almost interactive levels

  • of machine learning, of being able to do machine learning

  • research and development.

  • So that's great.

  • Then the next question will be, how do these accelerators work?

  • So today we are going to zoom in on the second generation

  • of Cloud TPUs.

  • So again, this is what it looks like.

  • This is one entire Cloud TPU board that you see here.

  • And the first thing that you want to know

  • is that Cloud TPUs are really network-attached devices.

  • So if I want to use a Cloud TPU on Google Cloud, what happens

  • is that I create it.

  • I go to the Google Cloud Console,

  • and I create a Cloud TPU.

  • And then I create a Google Compute Engine VM.

  • And then under VM, I just have to install TensorFlow.

  • So literally, I have to do PIP install TensorFlow.

  • And then I can start writing code.

  • I don't have drivers to install.

  • You can use a clean Ubuntu image.

  • You can use the machine learning images that we provide.

  • So it's really very simple to get started with.

  • So each TPU is connected to a host server

  • with 32 lanes of PCI Express.

  • So each TPU-- so the thing here to note

  • is that the TPU itself is like an accelerator.

  • So you can think of it like GPUs.

  • So it doesn't run.

  • You can't run Linux on it by itself.

  • So it's connected to the host server

  • by 32 lanes of PCI Express to make sure that we

  • can transfer training data in.

  • We can get our results back out quickly.

  • And of course, you can see on this board clearly