Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • JEFF DEAN: I'm excited to be here today to tell you

  • about how I see deep learning and how

  • it can be used to solve some of the really challenging problems

  • that the world is facing.

  • And I should point out that I'm presenting

  • the work of many, many different people at Google.

  • So this is a broad perspective of a lot of the research

  • that we're doing.

  • It's not purely my work.

  • So first, I'm sure you may have all noticed,

  • but machine learning is growing in importance.

  • There's a lot more emphasis on machine learning research.

  • There's a lot more uses of machine learning.

  • This is a graph showing how many Arxiv papers--

  • Arxiv is a preprint hosting service

  • for all kinds of different research.

  • And this is the subcategories of it

  • that are related to machine learning.

  • And what you see is that, since 2009, we've actually

  • been growing the number of papers posted at a really

  • fast exponential rate, actually faster than the Moore's Law

  • growth rate of computational power that we got so nice

  • and used to for 40 years but it's now slowed down.

  • So we've replaced the nice growth in computing performance

  • with growth in people generating ideas, which is nice.

  • And deep learning is this particular form

  • of machine learning.

  • It's actually a rebranding in some sense

  • of a very old set of ideas around creating

  • artificial neural networks.

  • These are these collections of simple trainable mathematical

  • units organized in layers where the higher layers typically

  • build higher levels of abstraction

  • based on things that the lower layers are learning.

  • And you can train these things end to end.

  • And the algorithms that underlie a lot of the work

  • that we're doing today actually were

  • developed 35, 40 years ago.

  • In fact, my colleague Geoff Hinton

  • just won the Turing Award this year along with Yann LeCun

  • and Yoshua Bengio for a lot of the work

  • that they did over the past 30 or 40 years.

  • And really the ideas are not new.

  • But what's changed is we got amazing results 30 or 40 years

  • ago on toyish problems but didn't

  • have the computational resources to make these approaches work

  • on real large scale problems.

  • But starting about eight or nine years ago,

  • we started to have enough computation to really make

  • these approaches work well.

  • And so what are things-- think of a neural net as something

  • that can learn really complicated functions that

  • map from input to output.

  • Now that sounds kind of abstract.

  • You think of functions as like y equals x squared or something.

  • But really these functions can be very complicated

  • and can learn from very raw forms of data.

  • So you can take the pixels of an image

  • and train a neural net to predict

  • what is in the image as a categorical label like that's

  • a leopard.

  • That's one of my vacation photos.

  • From audio wave forms, you can learn

  • to predict a transcript of what is being said.

  • How cold is it outside?

  • You can learn to take input in one language-- hello,

  • how are you--

  • and predict the output being that sentence translated

  • into another language.

  • [SPEAKING FRENCH]

  • You can even do more complicated things

  • like take the pixels of an image and create a caption that

  • describes the image.

  • It's not just category.

  • It's like a simple sentence.

  • A cheetah lying on top of a car, which is kind of unusual

  • anyway.

  • Your priority for that should be pretty low.

  • And in the field of computer vision,

  • we've made great strides thanks to neural nets.

  • In 2011, the Stanford ImageNet contest,

  • which is a contest held every year,

  • the winning entry did not use neural nets.

  • That was the last year.

  • The winning entry did not use neural nets.

  • They got 26% error.

  • And that won the contest.

  • We know this task--

  • it's not a trivial task.

  • So humans themselves have about 5%

  • error, because you have to distinguish

  • among 1,000 different categories of things

  • including like a picture of a dog, you have to say which

  • of 40 breeds of dog is it.

  • So it's not a completely trivial thing.

  • And in 2016, for example, the winning entry got 3% error.

  • So this is just a huge fundamental leap

  • in computer vision.

  • You know, computers went from basically not

  • being able to see in 2011 to now we can see pretty darn well.

  • And that has huge ramifications for all kinds of things

  • in the world not just computer science

  • but like the application of machine learning and computing

  • to perceiving the world around us.

  • OK.

  • So the rest of this talk I'm going

  • to frame in a way of-- but in 2008, the US National

  • Academy of Engineering published this list of 14

  • grand engineering challenges for the 21st century.

  • And they got together a bunch of experts

  • across lots of different domains.

  • And they all collectively came up

  • with this list of 14 things, which

  • I think you can agree these are actually

  • pretty challenging problems.

  • And if we made progress on all of them,

  • the world would be a healthier place.

  • We'd have a safer place.

  • We'd have more scientific discovery.

  • All these things are important problems.

  • And so given the limited time, what I'm going to do

  • is talk about the ones in boldface.

  • And we have projects in Google Research that are focused

  • on all the ones listed in red.

  • But I'm not going to talk about the other ones.

  • And so that's kind of the tour of the rest of the talk.

  • We're just going to dive in and off we go.

  • I think we start with restoring and improving

  • urban infrastructure.

  • Right.

  • We know cities were designed-- the basic structure of cities

  • has been designed quite some time ago.

  • But there's some changes that we're

  • on the cusp of that are going to really dramatically change how

  • we might want to design cities.

  • And, in particular, autonomous vehicles

  • are on the verge of commercial practicality.

  • This is from our Waymo colleagues, part of Alphabet.

  • They've been doing work in this space for almost a decade.

  • And the basic problem of an autonomous vehicle

  • is you have to perceive the world around you

  • from raw sensory inputs, things like light [INAUDIBLE],,

  • and cameras, and radar, and other kinds of things.

  • And you want to build a model of the world and the objects

  • around you and understand what those objects are.

  • Is that a pedestrian or a light pole?

  • Is it a car that's moving?

  • What is it?

  • And then also be able to predict both a short time from now,

  • like where is that car going to be in one second,

  • and then make a set of decisions about what actions

  • you want to take to accomplish the goals,

  • get from A to B without having any trouble.

  • And it's really thanks to deep learning vision

  • based algorithms and fusing of all the sensor data

  • that we can actually build maps of the world

  • like this that are understandings

  • of the environment around us and actually

  • have these things operate in the real world.

  • This is not some distant far off dream.

  • Waymo is actually operating about 100 cars

  • with passengers in the back seat and no safety

  • drivers in the front seat in the Phoenix, Arizona area.

  • And so this is a pretty strong sense

  • that this is pretty close to reality.

  • Now Arizona is one of the easier self-driving car environments.

  • It's like it never rains.

  • It's too hot so there aren't that many pedestrians.

  • The streets are very wide.

  • The other drivers are very slow.

  • Downtown San Francisco is harder,

  • but this is a sign that it's not that far off.

  • Obviously, a vision works, it's easier

  • to build robots that can do things in the world.

  • If you can't see, it's really hard to do things.

  • But if you can start to see, you can actually

  • have practical robotics things that

  • use computer vision to then make decisions about how

  • they should act in the world.

  • So this is a video of a bunch of robots practicing

  • picking things up, and then dropping them and picking

  • more things up, and essentially trying to grasp things.

  • And it turns out that one nice thing about robots

  • is you can actually collect the sensor data

  • and pool the experience of many robots,

  • and then collectively train on their collective experience,

  • and then get a better model of how

  • to actually grasp things, and then push that out

  • to the robots.

  • And then the next day they can all

  • practice with a slightly better grasping model,

  • because unlike humans that you plop

  • on the carpet in your living room,

  • they don't get to pool their experience.

  • OK.

  • So in 2015, the success rate on a particular grasping task

  • of grasping objects that a robot has never seen before

  • was about 65%.

  • When we use this kind of arm farm--

  • that's what that thing is called.

  • I wanted to call it the armpit, but I was overruled.

  • Basically, by collecting a lot of experience,

  • we were actually able to get a pretty significant boost

  • in grasp success rate, up to 78%.

  • And then with further work on algorithms and more refinement

  • of the approach, we're now able to get a 96% grasp success

  • right.

  • So this is pretty good progress in three years.

  • We've gone from a third of the time you fail to pick something

  • up, which is very hard to actually string together

  • a whole sequence of things and actually have robots actually

  • do things in the real world, to grasping almost working quite

  • reliably.

  • So that's exciting.

  • We've also been doing a lot of work

  • on how do we get robots to do things more easily.

  • Rather than having them practice themselves,

  • maybe we can demonstrate things to them.

  • So this is one of our AI residents doing work.

  • They also do fantastic machine learning research,

  • but they also film demonstration videos for these robots.

  • And what you see here is a simulated robot

  • trying to emulate from the raw pixels of the video what

  • it's seeing.

  • And on the right, you see a few demonstrations of pouring

  • and the robot using those video clips,

  • five or 10 seconds of someone pouring something,

  • and some reinforcement learning based trials to attempt

  • to learn to pour on its own.

  • After 15 trials and about 15 minutes of training,

  • it's able to pour that well, I would

  • say like at the level of a four-year-old

  • not an eight-year-old.

  • But that's actually much--

  • in 15 minutes of effort, it's able to get

  • to that level of success, which is a pretty big deal.

  • OK.

  • One of the other areas that was in the grand challenges

  • was advanced health informatics.

  • I think you saw in the keynote yesterday

  • the work on lung cancer.

  • We've also been doing a lot of work

  • on an eye disease called diabetic retinopathy, which

  • is the fastest growing cause of blindness in the world.

  • There's 115 million people in the world with diabetes.

  • And each of them ideally would be screened every year

  • to see if they have diabetic retinopathy, which

  • is a degenerative eye disease that if you catch in time

  • it's very treatable.

  • But if you don't catch it in time,

  • you can suffer full or partial vision loss.

  • And so it's really important that we

  • be able to screen everyone that is at risk for this.

  • And yeah.

  • Regular screening.

  • And that's the image that you get

  • to see as an ophthalmologist.

  • And in India, for example, there's

  • a shortage of more than 100,000 eye doctors

  • to do the necessary amount of screening of this disease.

  • And so 45% of patients suffer vision loss

  • before they're diagnosed, which is tragic,

  • because it's a completely preemptible thing if you

  • catch it in time.

  • And basically, the way an ophthalmologist looks at this

  • is they look at these images and they grade it

  • on a five point scale, one, two, three, four, or five,

  • looking for things like these little hemorrhages

  • that you see on the right hand side.

  • And it's a little subjective.

  • So if you ask two ophthalmologists

  • to grade the same image, they agree on the score, one, two,

  • three, four, or five, 60% of the time.

  • And if you ask the same ophthalmologist

  • to grade the same image a few hours later,

  • they agree with themselves 65% of the time.

  • And this is why second opinions are useful in medicine,

  • because some of these things are actually quite subjective.

  • And it's actually a big deal because the difference

  • between a two and a three is actually go away and come

  • back in a year versus we better get you

  • into the clinic next week.

  • Nonetheless, this is actually a computer vision problem.

  • And so instead of having a classification of a thousand

  • general categories of dogs and leopards,

  • you can actually just have five categories of the five

  • levels of diabetic retinopathy and train

  • the model on eye images and an assessment

  • of what the score should be.

  • And if you do that, you can actually

  • get the images labeled by several ophthalmologists, six

  • or seven, so that you reduce the variance that you already

  • see between ophthalmologists assessing the same image.

  • Five of them say it's two.

  • Two of them say it's a three, it's probably more like a two

  • than a three.

  • And if you do that, then you can essentially

  • get a model that is on par or slightly better

  • than the average board certified ophthalmologist that's

  • set at doing this task, which is great.

  • This is work published at the end of 2016

  • by my colleagues in "JAMA," which is a top medical journal.

  • We wanted to do even better though.

  • So it turns out you can actually, instead of--

  • you can get the images labeled by retinal specialists who

  • have more training in retinal eye disease.

  • And instead of getting independent assessments,

  • you get three retinal specialists

  • in a room for each image.

  • And you essentially say, OK, you all

  • have to come up with an adjudicated number.

  • What number do you agree on for each image?

  • And if you do that, then you can train

  • on the output of this consensus of three retinal specialists.

  • And you actually now have a model

  • that is on par with retinal specialists, which

  • is the gold standard of care in this area,

  • rather than the not as good model

  • trained on an ophthalmologist's opinion.

  • And so this is something that we've

  • seen born out where you have really good high quality

  • training data and you can actually

  • then train a model on that and get

  • the effects of retinal specialists into the model.

  • But the other neat thing is you can actually have

  • completely new discoveries.

  • So someone new joined the ophthalmology research team

  • as a warm up exercise to understand

  • how our tools worked.

  • Lily Peng, who is on the stage yesterday,

  • said, oh, why don't you go see if you

  • can predict age and gender from the retinal image

  • just to see if the machine learning pipeline--

  • a person could get that machine learning pipeline going?

  • And ophthalmologists can't predict gender

  • from an eye image.

  • They don't know how to do that.

  • And so Lilly thought the average that you see on this

  • should be no better than flipping a coin.

  • You see a 0.5.

  • And the person went away and they

  • said, OK, I've got it done.

  • My AUC is 0.7.

  • And Lilly is like, hmm, that's weird.

  • Go check everything and come back.

  • And so they came back and they said,

  • OK, I've made a few improvements.

  • It's now 0.8.

  • That got people excited because all of a sudden

  • we realized you can actually predict

  • a whole bunch of interesting things from a retinal image.

  • In particular, you can actually detect

  • someone's self-reported sex.

  • And you can predict a whole bunch of other things

  • like their age, things about their systolic and diastolic

  • blood pressure, their hemoglobin level.

  • And it turns out you combine those things together

  • and you can get a prediction of someone's cardiovascular risk

  • at the same level of accuracy that normally a much more

  • invasive blood test where you have to draw blood, send it off

  • to the lab, wait 24 hours, get the lab test back.

  • Now you can just do that with a retinal image.

  • So there's real hope that this could be a new thing

  • that if you go to the doctor you'll get

  • a picture of your eye taken.

  • And we'll have a longitudinal history of your eye

  • and be able to learn new things from it.

  • So we're pretty excited about that.

  • A lot of the grand challenges were around understanding

  • molecules and chemistry better.

  • One is engineer better medicines.

  • But this work that I'm going to show you

  • might apply to some of these other things.

  • So one of the things quantum chemists want to be able to do

  • is predict properties of molecules.

  • You know, will this thing bind to this other thing?

  • Is it toxic?

  • What are its quantum properties?

  • And the normal way they do this is they

  • have a really computationally expensive simulator.

  • And you plug in this molecule configuration.

  • You wait about an hour.

  • And at the end of that you get the output, which says, OK,

  • here are the things the simulator told you.

  • So it turns out-- and it's a slow process.

  • You can't consider that many different molecules

  • like you might like to.

  • It turns out you can use the simulator

  • as a teacher for a neural net.

  • So you can do that.

  • And then all of a sudden you have a neural net

  • that can basically learn to do what the simulator can

  • do but way faster.

  • And so now you have something that

  • is about 300,000 times faster.

  • And you can't distinguish the accuracy

  • of the output of the neural net versus the simulator.

  • And so that's a completely game changing thing

  • if you're a quantum chemist.

  • All of a sudden your tool is sped up by 300,000 times.

  • And all of a sudden that means you

  • can do a very different kind of science.

  • You can say, oh, while I'm going to lunch

  • I should probably screen 100 million molecules.

  • And when I come back, I'll have 1,000

  • that might be interesting.

  • So that's a pretty interesting trend.

  • And I think it's one that will play out

  • in lots and lots of different scientific fields

  • or engineering fields where you have this really

  • expensive simulator but you can actually

  • learn to approximate it with a much cheaper neural net

  • or machine learning based model and get

  • a simulator that's much faster.

  • OK.

  • Engineer the tools of scientific discovery.

  • I have a feeling this 14th one was just

  • kind of a vague catch all thing that the panel of experts that

  • was convened decided should do.

  • But it's pretty clear that if machine learning is going

  • to be a big part of scientific discovery and engineering,

  • we want good tools to express machine learning algorithms.

  • And so that's the motivation for why

  • we created TensorFlow is we wanted to be to have tools

  • that we could use to express our own machine learning ideas

  • and share them with the rest of the world,

  • and have other researchers exchange machine learning ideas

  • and put machine learning models into practice in products

  • and other environments.

  • And so we released this at the end of 2015

  • with this Apache 2.0 license.

  • And basically it has this graph based computational model

  • that you can then optimize with a bunch of traditional compiler

  • optimizations and it then can be mapped

  • onto a variety of different devices.

  • So you can run the same computation

  • on CPUs or GPUs or our TPUs that I'll tell you about in

  • a minute.

  • Eager Mode makes this graph implicit rather than explicit,

  • which is coming in TensorFlow 2.0.

  • And the community seems to have adopted

  • TensorFlow reasonably well.

  • And we've been excited by all the different things

  • that we've seen other people do, both in terms

  • of contributing to the core TensorFlow system

  • but also making use of it to do interesting things.

  • And so it's got some pretty good engagement kinds of stats.

  • 50 million downloads for a fairly obscure programming

  • packages is a fair number that seems

  • like a good mark of traction.

  • And we've seen people do things.

  • So I mentioned this in the keynote yesterday.

  • I like this one.

  • It's basically a company building fitness center

  • for cows so you can tell which of your 100 dairy cows

  • is behaving a little strangely today.

  • There is a research team at Penn State and the International

  • Institute of Tropical Agriculture in Tanzania

  • that is building a machine learning model that

  • can run on device on a phone in the middle of a cassava field

  • without any network connection to actually detect

  • does this cassava plant have disease

  • and how should I treat it.

  • I think this is a good example of how

  • we want machine learning to run in lots

  • and lots of environments.

  • Lots of places in the world sometimes

  • you have connectivity.

  • Sometimes you don't.

  • A lot of cases you want it to run on device.

  • And it's really going to be the future.

  • You're going to have machine learning models running

  • on tiny microcontrollers, all kinds of things like this.

  • OK.

  • I'm going to use the remaining time to take you on a tour

  • through some researchy projects and then sketch how they might

  • fit together in the future.

  • So I believe what we want is we want bigger machine learning

  • models than we have today.

  • But in order to make that practical,

  • we want models that are sparsely activated.

  • So think of a giant model, maybe with 1,000 different pieces.

  • But you activate 20 or 30 of those pieces for any given

  • example, rather than the entire set of 1,000 pieces.

  • We know this is a property that real organisms have

  • in their neural systems is most of their neural capacity

  • is not active at any given point.

  • That's partly how they're so power efficient.

  • Right.

  • So some work we did a couple of years ago at this point

  • is what we call a sparsely gated mixture of experts layer.

  • And the essential idea is these pink rectangles here

  • are normal neural net layers.

  • But between a couple of neural net layers,

  • we're going to insert another collection

  • of tiny little neural nets that we call experts.

  • And we're going to have a gating network that's

  • going to learn to activate just a few of those.

  • It's going to learn which of those experts

  • is most effective for a particular kind of example.

  • And the expert might have a lot of parameters.

  • It might be pretty large matrix of parameters.

  • And we're going to have a lot of them.

  • So we have in total eight billion-ish parameters.

  • But we're going to activate just a couple of the experts

  • on any given example.

  • And you can see that when you learn to route things,

  • you try to learn to use the expert that

  • is most effective at this particular example.

  • And when you send it to multiple experts,

  • that gives you a signal to train the routing network,

  • the gating network so that it can learn that this expert is

  • really good when you're talking about language that

  • is about innovation and researchy things

  • like you see on the left hand side.

  • And this center expert is really good at talking

  • about playing a leading role and central role.

  • And the one on the right is really good at kind

  • of quicky adverby things.

  • And so they actually do develop very different kinds

  • of expertise.

  • And the nice thing about this is if you

  • compare this in a translation task with the bottom row,

  • you can essentially get a significant improvement

  • in translation accuracy.

  • That's the blue score there.

  • So one blue point improvement is a pretty significant thing.

  • We really look like one blue point improvements.

  • And because it has all this extra capacity,

  • we can actually make the sizes of the pink layers

  • smaller than they were in the original model.

  • And so we can actually shrink the amount

  • of computation used per word by about a factor of two,

  • so 50% cheaper inference.

  • And the training time goes way down because we just

  • have all this extra capacity.

  • And it's easier to train a model with a lot of parameters.

  • And so we have about 1/10 the training cost

  • in terms of GPU days.

  • OK.

  • We've also been doing a lot of work

  • on AutoML, which is this idea behind automating some

  • of the machine learning tasks that a machine learning

  • researcher or engineer does.

  • And the idea behind AutoML is currently

  • you think about solving a machine learning problem

  • where you have some data.

  • You have some computation.

  • And you have an ML expert sit down.

  • And they do a bunch of experiments.

  • And they kind of stir it all together

  • and run lots of GPU days worth of effort.

  • And you hopefully get a solution.

  • So what if we could turn this into using

  • more computation to replace some of the experimentation

  • that a machine learning--

  • someone with a lot of machine learning experience

  • would actually do?

  • And one of the decisions that a machine learning expert makes

  • is what architecture, what neural network structure

  • makes sense for this problem.

  • You know, should I use a 13 layer model or a nine layer

  • model?

  • Should it have three by three or five by five filters?

  • Should it have skip connections or not?

  • And so if you're willing to say let's try to take this

  • up a level and do some meta learning,

  • then we can basically have a model that generates models

  • and then try those models on the problem we actually care about.

  • So the basic iteration of meta learning here

  • is we're going to have a model generating model.

  • We're going to generate 10 models.

  • We're going to train each of those models.

  • And we're going to see how well they each work

  • on the problem we care about.

  • And we're going to use the loss or the accuracy of those models

  • as a reinforcement learning signal for the model generating

  • model so that we can steer away from models that didn't seem

  • to work very well and towards models

  • that seem to work better.

  • And then we just repeat a lot.

  • And when we repeat a lot, we essentially

  • get more and more accurate models over time.

  • And it works.

  • And it produces models that are a little strange looking.

  • Like they're a little more unstructured

  • than you might think of a model that a human

  • might have designed.

  • So here we have all these crazy skip connections.

  • But they're analogous to some of the ideas

  • that machine learning researchers themselves

  • have come up with in.

  • For example, the resonant architecture

  • has a more structured style of skip connection.

  • But the basic idea is you want information

  • to be able to flow more directly from the input to the output

  • without going through as many intermediate computational

  • layers.

  • And the system seems to have developed

  • that intuition itself.

  • And the nice thing is these models actually

  • work pretty well.

  • So if you look at this graph, accuracy

  • is on the y-axis for the ImageNet problem.

  • And computational cost of the models,

  • which are represented by dots here, is on the x-axis.

  • So generally, you see this trend where

  • if you have a more computationally expensive

  • model, you generally get higher accuracy.

  • And each of these black dots here

  • is something that was a significant amount of effort

  • by a bunch of top computer vision researchers or machine

  • learning researchers that then they

  • published and advanced the state of the art at the time.

  • And so if you apply AutoML to this problem, what you see

  • is that you actually exceed the frontier of the hand

  • generated models that the community has come up with.

  • And you do this both at the high end,

  • where you care most about accuracy

  • and don't care as much about computational costs

  • so you can get a model that's slightly more accurate

  • with less computational cost.

  • And at the low end, you can get a model

  • that's significantly more accurate for a very small

  • amount of computational cost.

  • And that, I think, is a pretty interesting result.

  • It says that we should really let computers and machine

  • learning researchers work together

  • to develop the best models for these kinds of problems.

  • And we've turned this into a product.

  • So we have Cloud AutoML as a Cloud product.

  • And you can try that on your own problem.

  • So if you were maybe a company that

  • doesn't have a lot of machine learning researchers,

  • or machine learning engineers yourselves,

  • you can actually just take a bunch of images in

  • and categories of things you want to do-- maybe you

  • have pictures from your assembly line.

  • You want to predict what part is this image of.

  • You can actually get a high quality model for that.

  • And we've extended this to things more than just vision.

  • So you can do videos, and language, and translation.

  • And more recently we've introduced something

  • that allows you to predict relational data

  • from other relational data.

  • You want to predict will this customer buy something given

  • their past orders or something.

  • We've also obviously continued research in the AutoML field.

  • So we've got some work looking at the use of evolution

  • rather than reinforcement learning for the search,

  • learning the optimization update rule,

  • learning the nonlinearity function rather than just

  • assuming we should use [INAUDIBLE]

  • or some other kind of activation function.

  • We've actually got some work on incorporating

  • both inference latency and the accuracy.

  • Let's say you want a really good model that has

  • to run in seven milliseconds.

  • We can find the most accurate model

  • that will run in your time budget allowed by using a more

  • complicated reward function.

  • We can learn how to augment data so that you can stretch

  • the amount of label data you have in interesting ways

  • more effectively than handwritten data augmentation.

  • And we can explore lots of architectures

  • to make this whole search process a bit more efficient.

  • OK.

  • But it's clear if we're going to try these approaches,

  • we're going to need more computational power.

  • And I think one of the truisms of machine learning

  • over the last decade or so is more computational

  • power tends to get better results

  • when you have enough data.

  • And so it's really nice that deep learning

  • is this really broadly useful tool

  • across so many different problem domains,

  • because that means you can start to think about specializing

  • hardware for deep learning but have

  • it apply to many, many things.

  • And so there are two properties that deep learning algorithms

  • tend to have.

  • One is they're very tolerant of reduced precision.

  • So if you do calculations to one decimal digit of precision,

  • that's perfectly fine with most of these algorithms.

  • You don't need six or seven digits of precision.

  • And the other thing is that they are all--

  • all these algorithms I've shown you are made up

  • of a handful of specific operations, things like matrix

  • multiplies, vector dot products, essentially

  • dense linear algebra.

  • So if you can build machines, computers,

  • that are really good at reduced precision dense linear algebra,

  • then you can accelerate lots of these machine learning

  • algorithms quite a lot compared to more general purpose

  • computers that have general purpose CPUs that

  • can run all kinds of things or even

  • GPUs which tend to be somewhat good at this but tend to have,

  • for example, higher precision than you might want.

  • So we started to think about building

  • specialized hardware when I did this kind of thought

  • exercise in 2012.

  • We were starting to see the initial success

  • of deep neural nets for speech recognition

  • and for image recognition and starting

  • to think about how would we deploy

  • these in some of our products.

  • And so there was this scary moment

  • where we realized that if speech started to work really well,

  • and at that time we couldn't run it

  • on device because the devices didn't

  • have enough computational power, what

  • if 100 million users started talking to their phones

  • for three minutes a day, which is not implausible if speech

  • starts to work a lot better.

  • And if we were running the speech models on CPUs,

  • we need to double the number of computers in Google data

  • centers, which is slightly terrifying to launch

  • one feature in one product.

  • And so we started to think about building these specialized

  • processors for the deep learning algorithms we wanted to run

  • and TPU V1 has been in production use

  • since 2015 was really the outcome of that thought

  • exercise.

  • And it's in production use based on every query you do,

  • on every translation you do, speech processing, image

  • crossing, AlphaGo use a collection of these.

  • This is the actual racks of machines that were

  • competed in the AlphaGo match.

  • You can see the little Go board we've

  • commemorated with on the side.

  • And then we started to tackle the bigger problem of not just

  • inference, which is we already have a trained model

  • and you just want to apply it, but how do you actually do

  • training in an accelerated way.

  • And so the second version of TPUs

  • are for training and inference.

  • And that's one of the TPU devices,

  • which has four chips on it.

  • This is TPU V3, which also has four chips on it.

  • It's got water cooling.

  • So it's slightly scary to have water in your computers,

  • but we do.

  • And then we designed these systems

  • to be configured together into larger configurations we

  • call pods.

  • So this is a TPU V2 pod.

  • This is a bigger TPU V3 pod with water cooling.

  • You can actually see one of the racks of this in the machine

  • learning dome.

  • And really these things actually do provide

  • a lot of computational power.

  • Individual devices with the four chips

  • are up to 420 teraflops have a fair amount of memory.

  • And then the actual pods themselves are

  • up to 100 petaflops of compute.

  • This is a pretty substantial amount of compute

  • and really lets you very quickly try machine

  • learning research experiments, train very large production

  • models on large data sets, and these are also

  • now available through our cloud products.

  • As of yesterday, I think we announced them to be in beta.

  • One of the keys to performance here

  • is the network interconnect between the chips in the pods

  • is actually your super high speed 2D

  • mesh with wrap around links.

  • That's why it's toroidal.

  • And that means you can essentially program this thing

  • as if it's a single computer.

  • And the software underneath the covers

  • takes care of distributing the computation appropriately

  • and can do very fast all reduced kind of operations

  • and broadcast operations.

  • And so, for example, you can use a full TPU V2 pod

  • to train ImageNet in 7.9 minutes versus the same problem using

  • eight GPUs.

  • You get 27 times faster training at lower cost.

  • The V3 pod is actually even substantially larger.

  • You can train an ImageNet model in scratch

  • in less than two minutes, more than a million images

  • per second in training, which is essentially the entire ImageNet

  • data set every second.

  • And you can train very large BERT language models,

  • for example, as I was discussing on stage

  • in the keynote yesterday in about 76 minutes

  • on a fairly large corpus of data which normally would take days.

  • And so that really helps make our researchers

  • and ML production systems more productive

  • by being able to experiment more quickly.

  • If you can run an experiment in two minutes, that's

  • a very different kind of science and engineering

  • you do than if that same experiment would

  • take you a day and a half.

  • Right.

  • You just think about running more experiments,

  • trying more things.

  • And we have lots of models already available.

  • OK.

  • So let's take some of the ideas we talked about

  • and think about how they might fit together.

  • So I said we want these really large models

  • but have them be sparsely activated.

  • I think one of the things we're doing wrong in machine learning

  • is we tend to train a machine learning model

  • to do a single thing.

  • And then we have a different problem.

  • We tend to train a different model to do that other thing.

  • And I think really we should be thinking about how can we

  • train models that do many, many things

  • and leverage the expertise that they have

  • in doing many things to then be able to take on a new task

  • and learn to do that new task more quickly and with less

  • data.

  • This is, essentially, multi task learning.

  • But often multi task learning in practice today

  • means three or four or five tasks, not

  • thousands or millions.

  • I think we really want to be thinking bigger and bolder

  • about really doing in the limit one model for all of the things

  • we care about.

  • And obviously, we're going to try

  • to train this large model using fancy ML hardware.

  • OK.

  • So how might this look?

  • So I imagine we've trained a model

  • on a bunch of different tasks.

  • And it's learned these different components,

  • which can be sometimes shared across different tasks,

  • sometimes independent, specialized

  • for a particular task.

  • And now a new task comes along.

  • So with the AutoML style reinforcement learning,

  • we should be able to use an RL logarithm to find pathways

  • through this model that actually get us

  • into a pretty good state for that new task,

  • because it hopefully has some commonalities with other things

  • we've already learned.

  • And then we might have some way to add capacity to the system

  • so that for a task where we really care about accuracy,

  • we can add a bit of capacity and start to use that for this task

  • and have that pathway be more specialized for that task

  • and therefore hopefully more accurate.

  • And I think that's an interesting direction to go in.

  • How can we think more about building a system like that

  • than the current kind of models we have today where

  • we tend to fully activate the entire model for every example

  • and tend to have them just for a single task?

  • OK.

  • I want to close on how we should be thinking about using machine

  • learning and all the different places

  • that we might consider using it.

  • And I think one of the things that I'm

  • really proud of as a company is that last year we published

  • a set of principles by which we think

  • about how we're going to use machine learning

  • for different things.

  • And I think these seven things when

  • we look at using machine learning in any of our products

  • or settings we think carefully about how are we actually

  • fulfilling these principles by using

  • machine learning in this way.

  • And I think there's more on the actual principles website

  • that you can go find, but I think this is really, really

  • important.

  • And I'll point out that some of these things

  • are evolving research areas as well as principles

  • that we want to apply.

  • So for example, number two, avoid creating or reinforcing

  • unfair bias.

  • And bias in machine learning models

  • is a very real problem that you get from a variety of sources.

  • Could be you have biased training data.

  • Could be you're training on real world data

  • and the world does itself is biased

  • in ways that we don't want.

  • And so there is research that we can apply and extend in

  • how do we reduce bias or eliminate it

  • from machine learning models.

  • And so this is an example of some of the work

  • we've been doing on bias and fairness.

  • But what we try to do in our use of ML models

  • is apply the best known practices

  • for our actual production use but also

  • advance the state of the art in understanding bias and fairness

  • and making it better.

  • And so with that, in conclusion, deep neural nets and machine

  • learning are really tackling some of the world's

  • great challenges I think.

  • I think we're really making progress in a number of areas.

  • There's a lot of interesting problems

  • to tackle and to still work on.

  • And they're going to affect not just computer science.

  • Right.

  • We're affecting many, many aspects of human endeavor

  • like medicine, science, other kinds of things.

  • And so I think it's a great responsibility

  • that we have to make sure that we do these things right

  • and to continue to push for the state of the art

  • and apply it to great things.

  • So thank you very much.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it