Placeholder Image

Subtitles section Play video

  • RAYMOND BLUM: Hi, everybody.

  • So I'm Raymond.

  • I'm not television's Raymond Blum,

  • that you may remember from "The Facts of Life."

  • I'm a different Raymond Blum.

  • Private joke.

  • So I work in site reliability at Google

  • in technical infrastructure storage.

  • And we're basically here to make sure that things, well,

  • we can't say that things don't break.

  • That we can recover from them.

  • Because, of course, we all know things break.

  • Specifically, we are in charge of making sure

  • that when you hit Send, it stays in your sent mail.

  • When you get an email, it doesn't go away

  • until you say it should go away.

  • When you save something in Drive,

  • it's really there, as we hope and expect forever.

  • I'm going to talk about some of the things

  • that we do to make that happen.

  • Because the universe and Murphy and entropy

  • all tell us that that's impossible.

  • So it's a constant, never-ending battle

  • to make sure things actually stick around.

  • I'll let you read the bio.

  • I'm not going to talk about that much.

  • Common backup strategies that work maybe outside of Google

  • really don't work here, because they typically scale effort

  • with capacity and demand.

  • So if you want twice as much data backed up,

  • you need twice as much stuff to do it.

  • Stuff being some product of time, energy, space, media, et

  • cetera.

  • So that maybe works great when you go from a terabyte

  • to two terabytes.

  • Not when you go from an exabyte to two exabytes.

  • And I'm going to talk about some of the things we've tried.

  • And some of them have failed.

  • You know, that's what the scientific method is for,

  • right?

  • And then eventually, we find the things that work,

  • when our experiments agree with our expectations.

  • We say, yes, this is what we will do.

  • And the other things we discard.

  • So I'll talk about some of the things we've discarded.

  • And, more importantly, some of things

  • we've learned and actually do.

  • Oh, and there's a slide.

  • Yes.

  • Solidifying the Cloud.

  • I worked very hard on that title, by the way.

  • Well, let me go over the outline first, I guess.

  • Really we consider, and I personally

  • obsess over this, is that it's much more important.

  • You need a much higher bar for availability of data

  • that you do for availability of access.

  • If a system is down for a minute, fine.

  • You hit Submit again on the browser.

  • And it's fine.

  • And you probably blame your ISP anyway.

  • Not a big deal.

  • On the other hand, if 1% of your data goes away,

  • that's a disaster.

  • It's not coming back.

  • So really durability and integrity of data

  • is our job one.

  • And Google takes this very, very seriously.

  • We have many engineers dedicated to this.

  • Really, every Google engineers understands this.

  • All of our frameworks, things like Big Table and formerly

  • GFS, now Colossus, all are geared towards insuring this.

  • And there's lots of systems in place

  • to check and correct any lapses in data availability

  • or integrity.

  • Another thing we'll talk about is redundancy,

  • which people think makes stuff recoverable,

  • but we'll see why it doesn't in a few slides from now.

  • Another thing is MapReduce.

  • Both a blessing and a curse.

  • A blessing that you can now run jobs

  • on 30,000 machines at once.

  • A curse that now you've got files

  • on 30,000 machines at once.

  • And you know something's going to fail.

  • So we'll talk about how we handle that.

  • I'll talk about some of the things

  • we've done to make the scaling of the backup resources

  • not a linear function of the demand.

  • So if you have 100 times the data,

  • it should not take 100 times the effort to back it up.

  • I'll talk about some of the things

  • we've done to avoid that horrible linear slope.

  • Restoring versus backing up.

  • That's a long discussion we'll have in a little bit.

  • And finally, we'll wrap up with a case study, where Google

  • dropped some data, but luckily, my team at the time

  • got it back.

  • And we'll talk about that as well.

  • So the first thing I want to talk about is what I said,

  • my personal obsession.

  • In that you really need to guarantee

  • the data is available 100% of the time.

  • People talk about how many nines of availability

  • they have for a front end.

  • You know, if I have 3 9s 99% of the time is good.

  • 4 9s is great.

  • 5 9s is fantastic.

  • 7 9s is absurd.

  • It's femtoseconds year outage.

  • It's just ridiculous.

  • But with data, it really can't even be 100 minus epsilon.

  • Right?

  • It has to be there 100%.

  • And why?

  • This pretty much says it all.

  • If I lose 200 k of a 2 gigabyte file,

  • well, that sounds great statistically.

  • But if that's an executable, what's

  • 200 k worth of instructions?

  • Right?

  • I'm sure that the processor will find some other instruction

  • to execute for that span of the executable.

  • Likewise, these are my tax returns

  • that the government's coming to look at tomorrow.

  • Eh, those numbers couldn't have been very important.

  • Some small slice of the file is gone.

  • But really, you need all of your data.

  • That's the lesson we've learned.

  • It's not the same as front end availability,

  • where you can get over it.

  • You really can't get over data loss.

  • A video garbled, that's the least of your problems.

  • But it's still not good to have.

  • Right?

  • So, yeah, we go for 100%.

  • Not even minus epsilon.

  • So a common thing that people think,

  • and that I thought, to be honest, when I first

  • got to Google, was, well, we'll just make lots of copies.

  • OK?

  • Great.

  • And that actually is really effective against certain kinds

  • of outages.

  • For example, if an asteroid hits a data center,

  • and you've got a copy in another data center far away.

  • Unless that was a really, really great asteroid, you're covered.

  • On the other hand, picture this.

  • You've got a bug in your storage stack.

  • OK?

  • But don't worry, your storage stack

  • guarantees that all rights are copied

  • to all other locations in milliseconds.

  • Great.

  • You now don't have one bad copy or your data.

  • You have five bad copies of your data.

  • So redundancy is far from being recoverable.

  • Right?

  • It handles certain things.

  • It gives you location isolation, but really, there

  • aren't as many asteroids as there

  • are bugs in code or user errors.

  • So it's not really what you want.

  • Redundancy is good for a lot of things.

  • It gives you locality of reference

  • for I/O. Like if your only copy is in Oregon,

  • but you have a front end server somewhere in Hong Kong.

  • You don't want to have to go across the world

  • to get the data every time.

  • So redundancy is great for that, right?

  • You can say, I want all references

  • to my data to be something fairly local.

  • Great.

  • That's way you make lots of copies.

  • But as I said, to say, I've got lots of copies of my data,

  • so I'm safe.

  • You've got lots of copies of your mistaken deletes