Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MARTIN GORNER: Hi, everyone, and thank you for being here

  • at 8:30 in the morning, and welcome to this session

  • about TPUs and TPU pods.

  • So those are custom made accelerators

  • that Google has designed to accelerate machine learning

  • workloads.

  • And before I tell you everything about them, me and Kaz,

  • I would like to do something.

  • Of course, this is live, so you want to see a live demo.

  • And I would like to train with you here, onstage, using

  • a TPU pod, one of those big models

  • that used to take days to train.

  • And we'll see if we can finish the training

  • within this session.

  • So let me start the training.

  • I will come back to explaining exactly what I'm doing here.

  • I'm just starting it.

  • Run all cells.

  • Seems to be running.

  • OK, I'm just checking.

  • I'm running this on a 128 core TPU pod.

  • So one of the things you see in the logs here

  • is that I have all my TPUs appearing.

  • 0, 1, 2, 6, and all the way down to 128.

  • All right, so this is running.

  • I'm happy with it.

  • Let's hear more about TPUs.

  • So first of all, what is this piece of silicon?

  • And this is the demo that I've just launched.

  • It's an object detection demo that

  • is training on a wildlife data set of 300,000 images.

  • Why wildlife?

  • Because I can show you cute pandas.

  • And I can show you cute electronics as well.

  • So this is a TPU v2.

  • And we have a second version now, a TPU v3.

  • Those are fairly large boards.

  • It's large like this, roughly.

  • And as you can see, they have four chips on them.

  • Each chip is dual core, so each of these boards

  • has 8 TPU cores on them.

  • And each core has two units.

  • That is a vector processing unit.

  • That is a fairly standard data-oriented processor,

  • general purpose processor.

  • What makes this special for machine learning

  • is the matrix multiply unit.

  • TPUs have a built-in hardware-based matrix

  • multiplier that can multiply to 128 by 128 matrices in one go.

  • So what is special about this architecture?

  • There are two tricks that we used

  • to make it fast and efficient.

  • The first one is, I would say, semi-standard.

  • It's reduced precision.

  • When you train neural networks, reducing the precision

  • from 32-bit floating points to 16-bit

  • is something that people quite frequently do,

  • because neural networks are quite resistant to the loss

  • of precision.

  • Actually, it even happens sometimes

  • that the noise that is introduced by reduced precision

  • acts as a kind of regularizer and helps with convergence.

  • So sometimes you're even lucky when you reduce precision.

  • But then, as you see on this chart, float16 and float32,

  • the floating point formats, they don't have the same number

  • of exponent bits, which means that they

  • don't cover the same range.

  • So when you take a model and downgrade all your float32s

  • into float16s, you might get into underflow or overflow

  • problems.

  • And if it is your model, it's usually not

  • so hard to go in and fix.

  • But if you're using code from GitHub

  • and you don't know where to fix stuff,

  • this might be very problematic.

  • So that's why on TPUs, we chose a different--

  • actually we designed a different floating point

  • format called bfloat16.

  • And as you can see, it's actually

  • exactly the same as float32 with just the fractional bits

  • cut off.

  • So the point is it has exactly the same number

  • of exponent bits, exactly the same range.

  • And therefore, usually, it's a drop-in replacement

  • for float32 and reduced precision.

  • So typically for you, there is nothing

  • to do on your model to benefit from the speed of reduced

  • precision.

  • The TPU will do this automatically, on ship,

  • in hardware.

  • And the second trick is architectural.

  • It's the design of this matrix multiply unit.

  • So that you understand how this works,

  • try to picture, in your head, how to perform a matrix

  • multiplication.

  • And one result, one point of the resulting

  • matrix, try to remember calculus from school, is a dot product.

  • A dot product of one line of one matrix and one column

  • of the second matrix.

  • Now what is a dot product?

  • A dot product is a series of multiply-accumulate operations,

  • which means that the only operation you

  • need to perform a matrix multiplication

  • is multiply and accumulate.

  • And multiply-accumulate in 16 bits,

  • because we're using bfloat16 reduced precision.

  • That is a tiny, tiny piece of silicon.

  • A 16-bit multiply-accumulator is a tiny piece of silicon.

  • And if you wire them together as an array, as you see here.

  • So this in real life would be a 128 by 128 array.

  • It's called a systolic array.

  • Systolic in Greek means flow.

  • Because you will flow the data through it.

  • So the way it works is that you load one matrix into the array,

  • and then you flow the second matrix through the array.

  • And you'll have to believe me, or maybe

  • spend a little bit more time with the animation,

  • by the time the gray dots have finished

  • flowing through those multiply-accumulators,

  • out of the right side come all the dot products that

  • make the resulting matrix.

  • So it's a one-shot operation.

  • There are no intermediate values to store anywhere, in memory,

  • in registers.

  • All the intermediate values flow on the wires

  • from one compute unit to the second compute units.

  • It's very efficient.

  • And what is more, it's only made of

  • those tiny 16-bit multiply-accumulators,

  • which means that we can cram a lot of those into one chip.

  • 128 by 128 is 16,000 multiply-accumulators.

  • And that's how much you get in one TPU core, twice that

  • in two TPU cores.

  • So this is what makes it dense.

  • Density means power efficiency.

  • And power efficiency in the data center means cost.

  • And of course, you want to know how cheap

  • or how fast these things are.

  • Some people might remember from last year

  • I did a talk about what I built, this planespotting model,

  • so I'm using this as a benchmark today.

  • And on Google Cloud's AI platform,

  • it's very easy to get different configurations,

  • so I can test how fast this trains.

  • My baseline is-- on a fast GPU, this model

  • trains in half in four and a half hours.

  • But I can also get 5 machines with powerful GPUs

  • in a cluster.

  • And on those five machines, five GPUs,

  • this model will train in one hour.

  • And I've chosen this number because one hour is exactly

  • the time it takes for this model to train on one TPU v2.

  • So the rule of thumb I want you to remember

  • is that roughly 1 TPU v2, with its 4 chips,

  • is roughly as fast as five powerful GPUs.

  • That's in terms of speed.

  • But as you can see, it's almost three times cheaper.

  • And that's the point of optimizing the architecture

  • specifically for neural network workloads.

  • You might want to know how this works in software as well.

  • So when you're using TensorFlow, or Keras in TensorFlow,

  • your Python code TensorFlow Python code

  • generates a computational graph.

  • That is how TensorFlow works.

  • So your entire neural network is represented as a graph.

  • Now, this graph is what is sent to the TPU.

  • Your TPU does not execute Python code.

  • This graph is processed through XLA, the Accelerated Linear

  • Algebra compiler, and that is how

  • it becomes TPU microcode to be executed on the TPU.

  • And one nice side-effect of this architecture

  • is that if, in your TensorFlow code,

  • you load your data through the standard tf.data.Dataset

  • API, as you should, and as is required with TPUs, then

  • even the data loading part, or imagery resizing, or whatever

  • is in your data pipeline, ends up in the graph,

  • ends up executed on the TPU.

  • And the TPU will be pulling data from Google Cloud Storage

  • directly during training.

  • So that is very efficient.

  • How do you actually write this with code?

  • So let me show you in Keras.

  • And one caveat, this is Keras in TensorFlow 1.14,

  • which should be out in these next days.

  • The API is slightly different in TensorFlow 1.13

  • today, but I'd rather show you the one that will be--

  • the new one, as of tomorrow or next week.

  • So it's only a couple of lines of code.

  • There is the first line, TPUClusterResolver.

  • You can call it without parameters on most platforms,

  • and that finds the connected TPU.

  • The TPU is a remotely-connected accelerator.

  • This finds it.

  • You initialize the TPU and then you use the new distribution

  • API in TensorFlow to define a TPU strategy based on this TPU.

  • And then you say with strategy.scope,

  • and everything that follows is perfectly normal Keras code.

  • Then you define your model, you compile it,

  • you do model.fit, model.evaluate, model.predict,

  • anything you're used to doing in Keras.

  • So in Keras, it's literally these four lines of code

  • to add--

  • to work on a TPU.

  • And I would like to point out that these four lines of code

  • also transform your model into a distributed model.

  • Remember a TPU, even a single GPU,

  • is a board with eight cores.

  • So from the get go it's distributed computing.

  • And these four lines of code put in place

  • all the machinery of distributed computing for you.

  • One parameter to notice.

  • You see in the TPU strategy, there

  • is the steps_per_run equals 100.

  • So that's an optimization.

  • This tells the TPU, please run 100 batches worth of training

  • and don't report back until you're finished.

  • Because it's a network attached accelerator,

  • you don't want the TPU to be reporting back

  • after each batch for performance reasons.

  • So this is the software.

  • If you don't want to write your own code,

  • I encourage you to do so.

  • But if you don't, we have a whole library

  • of TPU optimized models.

  • So you will find them on the TensorFlow/tpu GitHub

  • repository.

  • And there is everything in the image--

  • in the vision space, in the machine translation,

  • and language, and NLP space, in speech recognition.

  • Even you can play with GaN models.

  • The one that we are demoing on stage,

  • remember we are training the model right now, is RetinaNet.

  • So this one is an object detection model.

  • And I like this model, so let me say a few words

  • about how this works.

  • In object detection, you put an image, and what you get

  • is not just the label--

  • this is a dog, this is a panda--

  • but you actually get boxes around where those objects are.

  • In object detection models, you have two kinds.

  • There are one shot detectors that

  • are usually fast but kind of inaccurate,

  • and then two-stage detectors that are much more

  • accurate but much slower.

  • And I like RetinaNet because they actually

  • found a trick to make this both the fastest

  • and the most accurate model that you can

  • find in object detection today.

  • And it's a very simple trick.

  • I'm not going to explain all the math behind it,

  • but basically in these detection models,

  • you start with candidate detections.

  • And then you prune them to find only the detections--

  • the boxes that have actual objects in them.

  • And the thing is that all those blue boxes that you see,

  • there is nothing in them.

  • So even during training, they will very easily

  • be classified as nothing to see, move along boxes,

  • with a fairly small error.

  • But you've got loads of them, which

  • means that when you compute the loss of this model, in the loss

  • you have a huge sum of very small errors.

  • And that huge sum of very small errors might in the end

  • be very big and overwhelm the useful signal.

  • So the two-stage detectors resolve that

  • by being much more careful about those candidate boxes.

  • In one-stage detectors, you start

  • with a host of candidate boxes.

  • And the trick they found in RetinaNet

  • is a little mathematical trick on the loss

  • to make sure that the contribution of all

  • those easy boxes stays small.

  • The upshot, it's both fast and accurate.

  • So let me go back here.

  • I actually want to say a word about now what I did, exactly,

  • when I launched this demo.

  • I guess most of you are familiar with the Google Cloud Platform.

  • So here I am opening the Google Cloud Platform console.

  • And in the Google Cloud Platform,

  • I have a tool called AI platform, which,

  • for those who know it, has had a facility for running training

  • jobs and for deploying models behind the REST API

  • for serving.

  • But there is a new functionality called Notebooks.

  • In AI platform, you can today provision ready all-installed

  • notebook for working in--

  • yeah, so let me switch to this one--

  • for working either in TensorFlow,

  • in PyTorch, with GPUs.

  • It's literally a one click operation.

  • NEW INSTANCE, I want a TensorFlow instance

  • with Jupyter notebook installed, and what you get

  • is here an instance that is running but with the link

  • to open Jupyter.

  • For example, this one-- and it will open Jupyter,

  • but it's already open.

  • So it's asking me to select something else, but it's here.

  • And here, you can actually work normally

  • in your Jupyter environment with a powerful accelerator.

  • You might have noticed that I don't have a TPU

  • option, actually not here, but here,

  • for adding an accelerator.

  • That's coming.

  • But here I am using Jupyter notebook instances

  • that are powered by a TPU v3 128-core pod.

  • How did I do it?

  • It's actually possible on the command line.

  • I give you the command line here.

  • There is nothing fancy about it.

  • There is one gcloud compute command line

  • to start to the instance and a second gcloud compute command

  • line to start the TPU.

  • You provision a TPU just as you would a virtual machine

  • in Google's cloud.

  • So this is what I've done.

  • And that is what is running right now.

  • So let's see if what we are.

  • Here it's still running.

  • As you see enqueue next 100 batches.

  • And it's training.

  • We are step 4,000 out of 6,000 roughly.

  • So we'll check back on this demo at the end of the session.

  • This demo, when I was doing it, to run it on stage,

  • I've been able also to run a comparison between how fast our

  • TPU v3s versus v2s.

  • In theory, v3s are roughly twice as powerful as v2s,

  • but that only works if you feed them enough

  • work to make use of all the hardware.

  • So here on RetinaNet, you can train

  • on images of various sizes.

  • Of course, if you train on smaller images, 256

  • pixel images, it will be much faster,

  • in terms of images per second.

  • And I've tried both--

  • TPU v2s and v3s.

  • You see with small images, you get a little bump

  • in performance from TPU v3s, but nowhere near double.

  • But as you get to bigger and bigger images,

  • you are feeding the hardware with more work.

  • And on 640 pixel images, the speed up you get from TPU v3

  • is getting close to the theoretical x2 factor.

  • So for this reason, I am running this demo here

  • at the 512 pixel image size on a TPU v3 pod.

  • I'm talking about pods.

  • But what are these pods, exactly?

  • To show you more about TPU pods, I

  • would like to give the lectern to Kaz.

  • Thank you Kaz.

  • KAZ SATO: Thank you, Martin.

  • [APPLAUSE]

  • So in my part, I directly introduce Cloud TPU pods.

  • What are pods?

  • It's a large cluster of Cloud TPUs.

  • The version two pod is now available as public beta, which

  • provides 11.6 petaflops, with 512 TPU cores.

  • The next generation version three pod

  • is also public beta now, which achieves

  • over 100 petaflops with 2,048 TPU cores

  • So those performance numbers are as high as the greatest

  • supercomputers.

  • So Cloud TPU pods are AI supercomputer

  • that Google have built from scratch.

  • But some of you might think, what's

  • the difference between a bunch of TPU instances and a Cloud

  • TPU pod?

  • The difference is the interconnect.

  • Google has developed ultra high-speed interconnect

  • hardware derived from a supercomputer technology,

  • for connecting thousands of TPUs with very short latency.

  • What does it do for you?

  • As you can see on the animation, every time you update

  • a single parameter on a single TPU,

  • that will be synchronized with all the other thousands

  • of TPUs, in an instant, by the hardware.

  • So in short, TensorFlow users can use the whole pod

  • as a single giant machine with thousands

  • of TPU cores inside it.

  • It's as easy as using a single computer.

  • And you may wonder, because it's an AI supercomputer,

  • you may also take super high cost.

  • But it does not.

  • You can get started with using TPU pods

  • with 32 cores at $24 per hour, without any initial cost.

  • So you don't have to pay millions of dollars

  • to build your own supercomputer from scratch.

  • You can just rent it for a couple of hours from the cloud.

  • Version three pod also can be provisioned with 32 cores.

  • That costs only $32 per hour.

  • For larger sizes, you can ask our service contact

  • for the pricing.

  • What is the cost benefit of a TPU pods over GPUs?

  • Here's a comparison result. With a full version two pod,

  • with 512 TPU cores, you can train the same ResNet-50 models

  • at 27 times faster speed at 38% lower cost.

  • This shows the clear advantage of the TPU pods

  • to a typical GPU-based solutions.

  • And there are other benefits you could get from the TPU pods.

  • Let's take a look at eBay's case.

  • eBay has over 1 billion product listings.

  • And to make it easier to search specific products

  • from 1 billion products, they built a new visual search

  • feature.

  • And to train the models, they have used 55 million images.

  • So it's a really large scale training for them.

  • And they have used Cloud TPU pods, and eBay

  • was able to get a 100 times faster training time,

  • compared with existing GPU service.

  • And they will also get a 10% accuracy boost.

  • Why is that?

  • TPU itself is not designed to increase the accuracy

  • that much.

  • But because if you can't increase the training speed

  • 10 times or 100 times, that means the data

  • scientists or researchers can have 10 times 100 times more

  • iterations for the trials, such as trying out

  • a different combination of the hyperparameters

  • or different preprocessings and so on.

  • So that ended up at least 10% accuracy boost in eBay's case.

  • Let's see what kind of TensorFlow code

  • you would write to get those benefits from TPU pods.

  • And before taking a look at the actual code,

  • I try to look back.

  • What are the efforts required, in the past,

  • to implement the large scale distributed training?

  • Using many GPUs or TPUs for a single machine

  • running training, that is so-called distributed training.

  • And there are two ways.

  • One is data parallel and another is model parallel.

  • Let's talk about the data parallel first.

  • With data parallel, as you can see on the diagram,

  • you have to split the training data into the multiple GPU

  • or TPU nodes.

  • And also you have to share the same parameter set, the model.

  • And to do that, you have to set up a cluster of GPUs or TPUs

  • by yourself.

  • And also you have to set up a parameter server that

  • shares all the updates of our parameters

  • among all the GPU or TPUs.

  • So it's a complex setup.

  • And also in many cases, you have to--

  • there's going to be synchronization overhead.

  • So if you have hundreds or as thousands

  • of the TPUs or GPUs in a single cluster,

  • that's going to be a huge overhead for that.

  • And that limits the scalability.

  • But with TPU pods, the hardware takes care of it.

  • Your high-speed interconnect synchronizes

  • all of the parameter updates in a single TPU

  • with the other thousands of TPUs in an instant,

  • with very short latency.

  • So there's no need to set up the parameter server,

  • or there's no need to set up the large cluster of GPUs

  • by yourself.

  • And also you can get almost linear scalability

  • to add more on the more TPU cores in your training.

  • And Martin will show you the actual scalability

  • result later.

  • And as I mentioned earlier, TensorFlow users

  • can use the whole TPU pods as a single giant computer

  • and with thousands of TPU cores inside it.

  • So it's as easy as using a single computer.

  • For example, if you have Keras code running on a single TPU,

  • it also runs on a 2,000 TPU cores without any changes.

  • This is exactly the same code Martin showed earlier.

  • So under the hood, all the complexity for the data

  • parallel training, such as splitting the training

  • data into the multiple TPUs, or the sharing

  • the same parameters, those are all

  • taken care of by the TPU pods' interconnect,

  • and XLA compilers, and the new TPUStrategy

  • API in the TensorFlow 1.14.

  • The one thing you may want to change is the batch size.

  • As Martin mentioned, a TPU core has a matrix processor

  • that has 128 by 128 matrix multipliers.

  • So usually, you will get the best performance

  • by setting in the batch size to 128 times the number of TPU

  • cores.

  • So if you have 10 TPU cores, that's going to be 1,280.

  • The benefit of TPU pods is not only the training times.

  • It also enables the training of giant modules

  • by using gear the Mesh TensorFlow.

  • Data parallel has been a popular way of distributed training,

  • but there's one downside.

  • It cannot train a big model.

  • Because all departments are shared with all the GPUs

  • or TPUs, you cannot bring a big model that doesn't fit

  • into the memory of a single GPU or a TPU.

  • So there's another way of distributed training called

  • a model parallel.

  • With model parallel, you can split the giant model

  • into the multiple GPUs or TPUs so that you

  • can train much larger models.

  • But that has not been a popular way.

  • Why?

  • Because it's much harder to implement.

  • As you can see on the diagrams, you

  • have to implement all the communications

  • between the fraction of the models.

  • It's like stitching between the models.

  • And again, you have to set up the complex cluster,

  • and in many cases, the communication

  • between the models.

  • Because if you have hundreds of thousands

  • of CPU or GPU or TPU cores, then that's

  • going to be a huge overhead for that.

  • So those are the reasons why model parallel has not

  • been so popular.

  • To solve those problems, TensorFlow team

  • has developed a new library called Mesh TensorFlow.

  • It's a new way of distributed training,

  • with the multiple computing nodes,

  • such as TPU pods, or multiple GPUs, or multiple CPUs.

  • TensorFlow provides an abstraction layer

  • that sees those computing nodes as a logical n-dimensional

  • mesh.

  • Mesh TensorFlow is now available as open source

  • code on the TensorFlow GitHub repository.

  • To see how it works with imaging,

  • you could have a simple neural network

  • like this for recognizing the MNIST model.

  • This network has the batch size as 512,

  • and data dimension as 784, and one hidden layer

  • with 100 nodes, and output as 10 classes.

  • And if you want to train that network with the model

  • parallel, you can just specify, I

  • want to split the parameters into four TPUs to the Mesh

  • TensorFlow, and that's it.

  • You don't have to think about how

  • to implement the communication between the split model

  • and how to worry about the communication overhead.

  • What kind of a code you would write?

  • Here is the code to use the model parallel.

  • At first, you have to define the dimensions

  • of both data and the model.

  • In this code, you are defining the batch dimension as 512,

  • and the data has a 784 dimensions,

  • and hidden layer has 100 nodes, and the 10 classes.

  • And then you define your own network

  • by using Mesh TensorFlow APIs, such as two sets of weights

  • and one hidden layers, and one logits and loss function,

  • by using those dimensions.

  • Finally, you define how many TPU or GPUs have in the mesh,

  • and what is the layout rule you want to use.

  • In this code example, it is defining a hidden layer

  • dimensions for splitting the model parameters into the four

  • TPUs.

  • And that's it.

  • So that the Mesh TensorFlow can take a look at this code

  • and automatically split the model parameters into the four

  • TPUs.

  • And it shares the same training data with all the TPUs.

  • You can also combine both data and the model parallel.

  • For example, you can define the 2D mesh like this.

  • And you use the rows of the mesh for the data parallel

  • and use the column of the mesh or the model parallel,

  • so that you can get the benefits from both of them.

  • And again, it's easy to define with Mesh TensorFlow.

  • You can just specify batch dimension

  • for the rows and hidden layer dimensions for the columns.

  • This is an example where you are using the Mesh

  • TensorFlow for training a transformer model.

  • Transformer model is a very popular language model,

  • and I don't go deeper into the transformer model.

  • But as you can see, it's so easy to map

  • each layer of a transformer model

  • to the layer load of Mesh TensorFlow

  • so that you can efficiently map the large data and large model

  • into the hundreds of thousands of TPU cores

  • by using Mesh TensorFlow.

  • So what's the benefit?

  • By using the Mesh TensorFlow running with to TPU pods,

  • the Google AI team was able to train the language module

  • and translation model with the billion word scale.

  • And they were able to achieve the state-of-the-art scores,

  • as you can see on those numbers.

  • So for those use cases, the larger the model, the better

  • accuracy you get.

  • The model parallel with TPU pods give the big advantage

  • on achieving those state-of-the-art scores.

  • Let's take a look at another use case of the large scale model

  • parallel I just call BigGAN.

  • And I don't go deeper into what is GAN or how the GAN works.

  • But here's the basic idea.

  • You have the two defined networks.

  • One is called discriminator D and another

  • is called generator G. And you define a loss function

  • so that the D to be trained to recognize whether an image is

  • a fake image or real image.

  • And at the same time, the generator will be trained

  • to generate a realistic image so that a D cannot find

  • it's a fake.

  • It's like a minimax game you are playing with those two

  • networks.

  • And eventually, you will have a generic G

  • that can generate a photo-realistic fake images,

  • artificial images.

  • Let's take a look at the demo video.

  • So this is not big spoiler.

  • I have already loaded the bigger models

  • that is trained on the TPU pod.

  • And as you can see, these are all

  • the artificial synthesized image at high quality.

  • You can also specify the category

  • of the generated images, such as ostrich,

  • so that you can generate the ostrich images.

  • These are all synthesized artificial images.

  • None of them are real.

  • And because BigGAN can have the so-called latent space that

  • has the seeds to generate those images,

  • you can interpolate between two seeds.

  • In this example, it is interpolating

  • between golden retriever and Lhasa.

  • And you can try out a different combination

  • of the interpolation, such as west highland white terrier

  • and golden retriever.

  • Again, those are all fake images.

  • So this bigger model was trained with the TPU version three

  • pod with 512 cores.

  • And that took 24 hours to 48 hours.

  • Why BigGAN takes so many TPU cores and so long time?

  • The reasons are the model size and the batch size.

  • The quality of a GAN model, measured by GAN model,

  • are measured by the inception score, or IS score.

  • That represents how much an inception model

  • thinks those images are real.

  • And that also represents the variety of generated images.

  • The BigGAN paper says that you get

  • better IS score when you are having

  • more parameters in the model and when

  • you are using the larger batch size for the training.

  • So that means the larger scale model parallel

  • on the hundreds of TPU cores is crucial for BigGAN model

  • to increase the quality of those generated images.

  • So we have seen two use cases.

  • A BigGAN use case and language model use cases.

  • And those are the first applications of the model

  • parallel on TPU pods.

  • But they are only the starters.

  • So TPU pods are available to everyone from now.

  • So we expect to see more and more exciting

  • use cases coming from the new TPU pods users

  • and also from the applications.

  • So that's it for my part.

  • Back to Martin.

  • MARTIN GORNER: So now it's time to check on our demo.

  • Did our model actually train?

  • Checking here, yeah, it looks like it has finished training.

  • A saved model has been saved.

  • So the only thing that is to do is

  • to verify if this model can actually predict something.

  • So on a second machine I will reload the exact same model.

  • OK.

  • I believe that's the one.

  • And let's go and reload it.

  • So I'll skip training this time and just go here

  • to inference and loading.

  • Whoops, sorry about that.

  • I just hope the demo gods will be with me today.

  • All right.

  • That's because I'm loading the wrong directory.

  • The demo gods are almost with me.

  • It's this one where my model has been saved.

  • All right.

  • Yes.

  • Indeed.

  • It wasn't the same.

  • Sorry about that.

  • No training, just inference.

  • And this time, it looks like my model is loading.

  • And once it's loaded, I will see if it can actually

  • detect animals in images, and here we are.

  • So this leopard is actually a leopard.

  • This bird is a bird.

  • The lion is a lion.

  • This is a very tricky image.

  • So I'm showing you not cherry-picked images.

  • This is a model I have trained on stage, here with you.

  • No model is perfect.

  • We will see bad detections, like this one.

  • But that's a tricky one.

  • It's artwork.

  • It's not an actual lion.

  • The leopard is spot on.

  • The lion is spot on.

  • And see that the boxing actually works very well.

  • The leopard has been perfectly identified in the image.

  • So let's move to something more challenging.

  • Even this inflatable artwork lion has been identified,

  • which is not always the case.

  • This is a complicated image--

  • a flock of birds.

  • So you see it's not seeing all of them.

  • But all of them at least are birds,

  • which is a pretty good job.

  • The leopard is fine.

  • Oh, and this is the most complex we have.

  • There is a horse and cattle.

  • Well, we start seeing a couple of bad detections here.

  • Of course, that cow is not a pig.

  • As I said, no model is perfect.

  • But here the tiger is the tiger, and we

  • have our two cute pandas.

  • And those two cute pandas are actually quite difficult,

  • because those are baby pandas.

  • And I don't believe that this model

  • has had a lot of baby animals in its 300,000 images data set.

  • So I'm quite glad that it managed to find the two pandas.

  • So moving back, let me finish by giving you

  • a couple of feeds and speeds on those models.

  • So here, this model has a RetinaNet 50 backbone,

  • plus all the detection layers that produced the boxes.

  • And we have been training it on a TPU v3 pod with 128 cores.

  • It did finish in 20 minutes.

  • You don't have to just believe me for that.

  • Let me show you.

  • Here I had a timer read my script.

  • Yep, 19 minutes and 18 seconds.

  • So I'm not cheating.

  • This was live.

  • But I could also have run this model on a smaller pod.

  • Actually, I tried on a TPU v2-32.

  • On this chart, you see the speed on this axis

  • and the time on this axis.

  • This is to show you that a TPU v2-32 is actually

  • a very useful tool to have.

  • We've been talking about huge models up to now.

  • But it's debatable whether this is a huge model.

  • This definitely was a huge model a year ago.

  • Today, with better tools, I can train it

  • in an hour on a fairly modest TPU v2 32-core pod.

  • So even as an individual data scientist,

  • that is a very useful tool for me to have handy when

  • I need to do a round of trainings on a model like this,

  • because someone wants an animal detection model.

  • And bringing the training down to the one hour

  • space, or 20 minutes space, allows

  • me to work a lot faster and iterate a lot faster

  • on the hyperparemeters, on the fine tuning, and so on.

  • You see on a single TPU v3, it's the bottom line.

  • And if we were to train this on a GPU--

  • so remember our rule of thumb from the beginning.

  • One TPU v2, roughly five GPUs.

  • Therefore 1 TPU v3, roughly 10 GPUs.

  • So the GPU line would be one tenth of the lowest

  • line on this graph.

  • I didn't put it because it would barely register there.

  • That shows you the change of scale

  • at which you can be training your models using TPUs.

  • You might be wondering about this.

  • So as you scale, one thing that might happen

  • is that you have to adjust your learning rate schedule.

  • So this is actually the learning rate schedule

  • I have used to train the model on the 128 core TPU pod.

  • Just a couple of words, because it might not

  • be the most usual learning rate schedule you have ever seen.

  • There is this ramp up.

  • So the second part is exponential decay.

  • That's fairly standard.

  • But the ramp up part, that is because we

  • are starting from ResNet-50, initialized

  • with pre-trained weights.

  • But we still leave those weights trainable.

  • So we are training the whole thing.

  • It's not transfer learning.

  • It's just fine tuning of pre-trained ResNet-50.

  • And when you do that, and you train

  • very fast, using big batches, as we do here,

  • the batch size here is 64 times 128.

  • So it's a very big batch size.

  • You might actually break those pre-trained weights

  • in ways that harm your precision.

  • So that's why it's quite usual to have a ramp up period

  • to make sure that the network, in its initial training

  • phases, when it doesn't know what it's doing,

  • does not completely destroy the information

  • in the pre-trained weights.

  • So we did it.

  • We did train this model here on stage in 20 minutes.

  • And the demo worked, I'm really glad about that.

  • So this is the end.

  • What we have seen is TPUs and TPU pods.

  • Fast, yes.

  • But mostly cost effective.

  • Very cost effective way and a good tool

  • to have for any data scientist.

  • Also, and more specifically for very large models,

  • but for what used to be large models in the past

  • and which are normal models today, such as a ResNet-50

  • [INAUDIBLE].

  • It's a very useful tools.

  • And then Cloud TPU pods, where you can actually

  • enable not only data, but model parallelism,

  • using this new library called Mesh TensorFlow.

  • A couple of links here with more information

  • if you would like to know more.

  • Yes, you can take a picture.

  • And if you have more questions, we

  • will be at the AI ML pod, the red one,

  • in front of one TPU rack.

  • So you can see this one live and get

  • a feel for what kind of computer it is.

  • And with that, thank you very much.

  • [APPLAUSE]

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it