Placeholder Image

Subtitles section Play video

  • KENNY YU: So, hi, everyone.

  • I'm Kenny.

  • I'm a software engineer at Facebook.

  • And today, I'm going to talk about this one problem--

  • how do you deploy a service at scale?

  • And if you have any questions, we'll save them for the end.

  • So how do you deploy a service at scale?

  • So you may be wondering how hard can this actually be?

  • It works on my laptop, how hard can it be to deploy it?

  • So I thought this too four years ago when I had just

  • graduated from Harvard as well.

  • And in this talk, I'll talk about why it's hard.

  • And we'll go on a journey to explore many

  • of the challenges you'll hit when you actually

  • start to run a service at scale.

  • And I'll talk about how Facebook has approached some of these challenges

  • along the way.

  • So a bit about me--

  • I graduated from Harvard class 2014 concentrating in computer science.

  • I took C50 fall of 2010 and then TF'd it the following year.

  • And I TF'd other classes at Harvard as well.

  • And I think one of my favorite experiences at Harvard was TFing.

  • So if you have the opportunity, I highly recommend it.

  • And after I graduated, I went to Facebook,

  • and I've been on this one team for the past four years.

  • And I really like this team.

  • My team is Tupperware, and Tupperware is Facebook cluster management

  • system and container platform.

  • So there's a lot of big words, and my goal

  • is by the end of this talk is that you'll

  • have a good overview of the challenges we face in cluster management

  • and how Facebook is tackling some of these challenges

  • and then once you understand these, how this relates

  • to how we deploy services at scale.

  • So our goal is to deploy a service in production at scale.

  • But first, what is a service?

  • So let's first define what a service is.

  • So a service can have one or more replicas,

  • and it's a long-running program.

  • It's not meant to terminate.

  • It responds to requests and gives a response back.

  • And as an example, you can think of a web server.

  • So if you're running Python, you're a Python web server

  • or if you're running PHP, an Apache web server.

  • A response requests, and it gives you get a response back.

  • And you might have multiple of these, and multiple of these

  • together compose your service that you want to provide it.

  • So as an example, Facebook users Thrift for most of its back end services.

  • Thrift is open source, and it makes it easy to do something

  • called Remote Procedure Calls or RPCs.

  • And it makes it easy for one service to talk to another.

  • So as an example, service at Facebook, let's take the website as an example.

  • So for those of you that don't know, the entire website

  • is pushed as one monolithic unit every hour.

  • And the thing that actually runs the website is hhvm.

  • It runs our version of PHP, called Hack, as a type-safe language.

  • And both of these are open source.

  • And the way to website is deployed is that there

  • are many, many instances of this web server running in the world.

  • This service might call other services in order to fulfill your request.

  • So let's say I hit the home page for Facebook.

  • I might want to give my profile and render some ads.

  • So the service will call maybe the profile service or the ad service.

  • Anyhow, the website is updated every hour.

  • And more importantly, as a Facebook user, you don't even notice this.

  • So here's a picture of what this all looks like.

  • First, we have the Facebook web service.

  • We have many copies of our web server, hhvm, running.

  • Requests from Facebook users-- so either from your browser or from your phone.

  • They go to these replicas.

  • And in order to fulfill the responses for these requests,

  • it might have to talk to other services-- so profile service or ad

  • service?

  • And once it's gotten all the data it needs,

  • it will return the response back.

  • So how did we get there?

  • So we have something that works on our local laptop.

  • Let's say you're starting a new web app.

  • You have something working-- a prototype working-- on your laptop.

  • Now you actually want to run it in production.

  • So there are some challenges there to get that first instance running

  • in production.

  • And now let's say your app takes off.

  • You get a lot of users.

  • A lot of requests start coming to your app.

  • And now that single instance you're running

  • can no longer handle all the load.

  • So now you'd have multiple instances in production.

  • And now let's say your app-- you start to add more features.

  • You add more products.

  • The complexity of your application gets more complicated.

  • In order to simplify that, you might want

  • to extract some of the responsibilities into separate components.

  • And now instead of just having one service in production,

  • you have multiple services in production.

  • So each of these transitions involves lots of challenges,

  • and I'll go over each of these challenges along the way.

  • First, let's focus on the first one.

  • From your laptop to that first instance in production,

  • what does this look like?

  • So first challenge you might hit when you

  • want to start that first copy in production

  • is reproducing the same environment as your laptop.

  • So some of the challenges you might hit is let's

  • say you're running a Python web app.

  • You might have various packages of Python libraries

  • or Python versions installed on your laptop,

  • and now you need to reproduce the same exact versions and libraries

  • on that production environment.

  • So versions and libraries-- you have to make sure

  • they're installed on the production environment.

  • And then also, your app might make assumptions

  • about where certain files are located.

  • So let's say my web app needs some configuration file.

  • It might be stored in one place on my laptop,

  • and it might not even exist in a production environment.

  • Or it may exist in a different location.

  • So the first challenge here is you need to reproduce

  • this environment that you have on your laptop on the production machine.

  • This includes all the files and the binaries that you need to run.

  • Next challenge is how do you make sure that stuff on the machine

  • doesn't interfere with my work and vice versa?

  • Let's say there's something more important running on the machine,

  • and I want to make sure my dummy web app doesn't interfere with that work.

  • So as an example, let's say my service--

  • the dotted red box--

  • it should use four gigabytes of memory, maybe two cores.

  • And something else in the machine wants to use

  • two gigabytes of memory and one core.

  • I want to make sure that that other service doesn't take more memory

  • and start using some of my service's memory

  • and then cause my service to crash or slow down and vice versa.

  • I don't want to interfere with the resources used by that other service.

  • So this is a resource isolation problem.

  • You want to ensure that no workload on machine

  • interferes with my workload and vice versa.

  • Another problem with interference is protection.

  • Let's say I have my workload in a red dotted box,

  • and something else running a machine, the purple dotted box.

  • One thing I want to ensure is that that other thing doesn't somehow

  • kill or restart or terminate my program accidentally.

  • Let's say there's a bug in the other program that goes haywire.

  • The effects of that service should be isolated in its own environment

  • and also that other thing shouldn't be touching important

  • files that I need for my service.

  • So let's say my service needs some configuration file.

  • I would really like it if something else doesn't touch that

  • file that I need to run my service.

  • So I want to isolate the environment of these different workloads.

  • The next problem you might have is how do you ensure that a service is alive?

  • Let's say you have your service up.

  • There's some bug, and it crashes.

  • If it crashes, this means users will not be able to use your service.

  • So imagine if Facebook went down and users are unable to use Facebook.

  • That's a terrible experience for everyone.

  • Or let's say it doesn't crash.

  • It's just misbehaving or slowing down, and then restarting it might help--

  • might help it mitigate the issue temporarily.

  • So what I really like is if my service has an issue,

  • please restart it automatically so that user impact is at a minimum.

  • And one way you might be able to do this is to ask the service, hey,

  • are you alive?

  • Yes.

  • Are you alive?

  • No response.

  • And then after a few seconds of that, if there's still no response,

  • restart the service.

  • So the goal is the service should always be up and running.

  • So here's a summary of challenges to go from your laptop

  • to one copy in production.

  • How do you reproduce the same environment as your laptop?

  • How do you make sure that once you're running on a production machine,

  • no other workload is affecting my service,

  • and my service isn't affecting anything critical on that machine?

  • And then how do I make sure that my service is always up and running?

  • Because the goal is to have users be able to use your service all the time.

  • So there are multiple ways to tackle this issue.

  • Two typical ways that companies have approached this problem

  • is to use virtual machines and containers.

  • So for virtual machines, the way that I think about it is you

  • have your application.

  • It's running on top of an operating system

  • and that operating system is running on top of another operating system.

  • So if you ever use dual boot on your Mac,

  • you're running Windows inside a Mac-- that's very similar idea.

  • There are some issues with this.

  • It's usually slower to create a virtual machine,

  • and there is also an efficiency cost in terms of CPU.

  • Another approach that companies take is to create containers.

  • So we can run our application in some isolated environment that

  • provides all the guarantees as before and run it directly

  • on the machine's operating system.

  • We can avoid the overhead of a virtual machine.

  • And this tends to be faster to create and more efficient.

  • And here's a diagram that shows how these relate to each other.

  • On the left, you have my service--

  • the blue box-- running on top of a guest operating

  • system, which itself is running on top of another operating system.

  • And there's some overhead because you're running two operate systems

  • at the same time versus the container--

  • we eliminate that extra overhead of that middle operating system

  • and run our application directly on the machine with some protection around it.

  • So the way Facebook has approached these problems is to use containers.

  • For us, the overhead of using virtual machines

  • is too much, and so that's why we use containers.

  • And to do, this we have a program called a Tupperware agent

  • running on every machine at Facebook, and it's

  • responsible for creating containers.

  • And to reproduce the environment, we use container images.

  • And our way of using container images is based on btrfs snapshots.

  • Btrfs is a file system that makes it very fast to create copies

  • of entire subtrees of a file system, and this makes it very fast

  • for us to create new containers.

  • And then for resource isolation, we use a feature of Linux

  • called control groups that allow us to say, for this workload,

  • you're allowed to use this much memory, CPU, whatever resources and no more.

  • If you try to use more than that, we'll throttle you,

  • or we'll kill your workload to avoid you from harming

  • the other workloads on the machines.

  • And for our protection, we use various Linux namespaces.

  • So I'm going to not go over too much detail here.

  • There's a lot of jargon here.

  • If you want more detailed information, we

  • have a public talk from our Systems of Scale Conference in July 2018

  • that will talk about this more in depth.

  • But here's a picture that summarizes how this all fits together.

  • So on the left, you have the Tupperware agent.

  • This is a program that's running on every machine at Facebook that

  • creates containers and ensures that they're all running and healthy.

  • And then to actually create the environment for your container,

  • we use container images, and that's based on btrfs snapshots.

  • And then the protection layer we put around

  • the container includes multiple things.

  • This includes control groups to control resources and various namespaces

  • to ensure that the environments of two containers

  • are sort of invisible to each other, and they can't affect each other.

  • So now that we have one instance of the service in production,

  • how can we get many instances of the service in production?

  • There are new sets of challenges that this brings.

  • So the first challenge you'll have is, OK,

  • how do I start multiple replicas of a container?

  • So one approach you may take is, OK, given one machine,

  • let's just start multiple on that machine.

  • And that works until that machine runs out our resources,

  • and you need to use multiple machines to start multiple copies of your service.

  • Now so now you have to use multiple machines to start your containers.

  • And now you're going to hit a new set of classic problems

  • because now this is a distributed systems problem.

  • You have to get multiple machines to work together to accomplish some goal.

  • And in this diagram, what is the component that

  • creates the containers on the multiple machines?

  • There needs to be something that knows to tell the first one to create

  • containers and a second one to create containers or sock containers as well.

  • And now what if a machine fails?

  • So let's say I have two copies of my servers running.

  • The two copies are running two different machines.

  • For some reason, machine two loses power.

  • This happens in the real world all the time.

  • What happens then?

  • I need two copies of my service running at all times in order

  • to serve all the traffic my service has.

  • But now that machine two is down, I don't have enough capacity.

  • Ideally, I would want something to notice, hey, the copy of machine two

  • is down.

  • I know machine three has available resources.

  • Please start a new copy on machine three for me.

  • So ideally, some component would have all this logic

  • and do all this automatically for me, and this problem

  • is known as a failover.

  • So when real-world failures happen, then we

  • want ideally to be able to restart that workload on a different machine,

  • and that's known as a failure.

  • So now let's look at this problem from the caller's point of view.

  • The callers or clients of your service have a different set of issues now.

  • So in the beginning, there's two copies of my servers running--

  • there's a copy on machine one and a copy on machine two.

  • The caller knows that it's on machine one and machine two.

  • Now machine two loses power.

  • The caller still thinks that a copy is running on machine two.

  • It's still going to send traffic there.

  • The requests are going to fail.

  • Users are going to have a hard time.

  • Now let's say there is some automation that knows,

  • hey, machine two is down please start another one on machine three.

  • How's the client made aware of this?

  • The client still thinks the replicas are on machine one and machine two.

  • It doesn't know that there's a new copy on machine three.

  • So something needs to tell the client, hey, the copies are now

  • on machine and one machine three.

  • And this problem is known as a service discovery problem.

  • So for service discovery, the question it tries to answer

  • is where is my service running?

  • So now another problem we might face is how do you deploy your service?

  • So remember I said the website is updated every hour.

  • So we have many copies of the service.

  • And it's updated every hour, and users never

  • even notice that it's being updated.

  • So how is this even possible?

  • The key observation here is that you never

  • want to take down all the replicas of your service at the same time

  • because if all of your replicas are down in that time period,

  • requests to Facebook would fail, and then users would have a hard time.

  • So instead of taking them all down at once,

  • one approach you might take it to take down only a percentage on a time.

  • So as an example, let's say I have three replicas of my service.

  • I can tolerate one replica being down any given moment.

  • So let's say I want to update my containers from blue to purple.

  • So what I would do is take down one star a new one with the new software update

  • way until that's healthy.

  • And then once that's healthy and traffic is back to normal again,

  • now I can take that the next one and update that.

  • And then once that's healthy, I can take down the next one.

  • And now all my replicas are healthy, and users have not had any issues

  • throughout this whole time period.

  • Another challenge you might hit is what if your traffic spikes?

  • So let's say at noon, the number of users

  • that use your app increases by 2x.

  • And now you need to increase the number of applicants to handle the load.

  • But then at nighttime, whenever users decrease

  • and it becomes too expensive to run the extra replica,

  • so you might on a tear down the number of replicas

  • and use those machines to run something else more important.

  • So how do you handle this dynamic resizing of your service

  • based on traffic spikes?

  • So here's a summary of some of the challenges

  • you might face when you go from one copy to many copies.

  • First, you have to actually be able to start

  • multiple replicas on multiple machines.

  • So there needs to be something that correlates that logic.

  • You need to be able to handle machine failures

  • because once you have many machines, machines will fail in the real world.

  • And then if containers are moving around between machines, how are clients

  • made aware of this movement?

  • And then how do you update your service without affecting clients?

  • And how do you handle traffic spikes?

  • How do you add more replicas?

  • How do you spindown replicas?

  • So all these problems you'll face when you

  • have multiple instances in production.

  • And our approach for solving this at Facebook

  • is we introduce a new component, the Tupperware control

  • plane, that manages the lifecycle of containers across many machines.

  • It acts as a central coordination point between all the Tupperware

  • agents in our fleet.

  • So this solves the following problems.

  • It is the thing that will start multiple replicas across many machines.

  • If a machine goes down, it will notice, and then it

  • will be its responsibility to recreate that container on another machine.

  • It is responsible for publishing service discovery information

  • so that client will now be made aware that the container is

  • running on a new machine, and it handles deployments

  • of the service in a safe way.

  • So here's how this office together.

  • You have the Tupperware control plane, which is this green box.

  • It's responsible for creating and stopping containers.

  • You have this service discovery system.

  • And I'll just draw a cloud as a black box.

  • It provides an abstraction where you give it a service name,

  • and it will tell you the list of machines

  • that my service is running now.

  • So right now, there are no replicas of my service,

  • so it returns an empty list for my service.

  • Clients of my service--

  • they want to talk to my replicas.

  • But the first thing they do is they first ask the service discovery system,

  • hey, where are the replicas running?

  • So let's say we start two replicas for the first time on machine

  • one and machine two.

  • So you have two containers running.

  • The next step is the update of service discovery system

  • so that clients know that they're running on machine one and machine two.

  • So now things are all healthy and fine.

  • And now let's say machine two loses power.

  • Eventually, the control plane will notice

  • because it's trying to heartbeat with every agent in the fleet.

  • It sees that machine two is unresponsive for too long.

  • It deems the work on machine two as dead,

  • and it will update the service discovery system

  • to say, hey, the service is no longer running on machine two.

  • And now clients will stop sending traffic there.

  • Meanwhile, it seems that machine three has available resources

  • to create another copy of my service.

  • It will create a container on machine three.

  • And once the container on machine three is healthy,

  • it will update the service discovery system

  • to let clients know you can send traffic to machine three now as well.

  • So this is how a failover and service discovery are managed

  • by the Tupperware control client.

  • I'll save questions for the end.

  • So what about deployments?

  • Let's say I have three replicas already you running and healthy.

  • Clients know it's on machine 1, 2, and 3.

  • And now I want to push a new version of my service,

  • and I can tolerate one replica being down at once.

  • So the first thing I want to do is first,

  • let's say I want to update the replica on machine one.

  • The first thing I want to do is make sure clients stop sending traffic there

  • before I tear down the container.

  • First, I'm going to tell the service discovery

  • system to say machine one is no longer available,

  • and now clients will stop setting traffic there.

  • Once clients stop sending traffic there, I

  • can tear down the container on machine one

  • and create a new version using the new software update.

  • And once that's healthy, I can update the service discovery system

  • to say, hey, clients you can send traffic to machine one again.

  • And in fact, you'll be getting the new version of the service there.

  • The process repeats for machine two.

  • We disable machine two, stop the container, recreate that container,

  • and then publish that information as well.

  • And now we repeat again for machine three.

  • We disable the entry so that clients stop sending traffic there.

  • We stop the container and then recreate the container.

  • And now clients can send traffic there as well.

  • So now at this point, we've updated all the replicas of our service.

  • We've never had more than one replica down,

  • and users are totally unaware that any issue has happened in this process.

  • So now we are able to start many replicas of our service

  • in production we can update them.

  • We can handle failovers, and we can scale to handle load.

  • And now let's say your web app gets more complicated.

  • The number of features or products grow.

  • It gets a bit complicated having one service,

  • so you want to separate out responsibilities

  • into multiple services.

  • So now your app is now multiple services that you need to deploy to production.

  • And you'll hit a different set of challenges now.

  • So first understand the challenges.

  • Here's some background about Facebook.

  • Facebook has many data centers in the world.

  • And this is an example of a data center--

  • a bird's eye view.

  • Each building has many thousands of machines serving the website or ads

  • or databases to store user information.

  • And they are very expensive to create--

  • so the construction costs, purchasing the hardware, electricity,

  • and maintenance costs.

  • This is a big deal.

  • This is a very expensive investment for Facebook.

  • And now separately, there are many products at Facebook.

  • And as a result, there are many services to support those products.

  • So given that data centers are so expensive,

  • how can we utilize all the resources efficiently?

  • And also, another problem we have is reasoning about physical infrastructure

  • is actually really hard.

  • There are a lot of machines.

  • A lot of failures can happen.

  • How can we hide as much of this complexity from engineers as possible

  • so that engineers can focus on their business logic?

  • And so the first problem is how can we effectively use the resources

  • we have becomes a bin-packing problem.

  • And the Tupperware logo is actually a good illustration of this.

  • Let's say this square represents one machine.

  • Each container represents a different service or work we want to run.

  • And we want to stack as many containers that

  • will fit as possible onto a single machine

  • to most effectively utilize that machine's resources.

  • So it's kind of like playing Tetris with machines.

  • And yeah, so our approach to solving this

  • is to stack multiple containers onto as few machines as possible, resources

  • permitting.

  • And now our data centers are spread out geographically across the world,

  • and this introduces a different set of challenges.

  • So as an example, let's say we have a West Coast data center and an East

  • Coast data center, and my service just so

  • happens to only be running in the East Coast data center.

  • And now a hurricane hits the East Coast and takes down the data center.

  • Our data center loses power.

  • Now suddenly, users of that service cannot use that service until we create

  • new replicas somewhere else.

  • So ideally, we should spread our containers across these two data

  • centers so that when disaster hits one, the service continues to operate.

  • And so the property we would like is to spread replicas

  • across something known as fault domains, and a fault domain

  • is a group of things likely to fail together.

  • So as an example, a data center is a fault domain of machines

  • because they're located geographically together,

  • and they might all lose power at the same time together.

  • So another issue you might have is hardware fails in the real world.

  • Data center operators need to frequently put machines into repair

  • because their disk drives are failing.

  • The machine needs to be rebooted for whatever reason.

  • The machines need to be replaced with newer generation of hardware.

  • And so they frequently need to say I need

  • to take those machines into maintenance but on those 1,000 machines

  • might be many different teams' services running on those machines.

  • Ideally, the data center operators would need

  • to interact with all those different teams

  • in order to have them safely move the replicas away before taking

  • down all 1,000 machines.

  • So the goal is how can we safely replace those 1,000 machines in an automated

  • way and have as little involvement with service owners as possible?

  • So in this example, a single machine might be running containers

  • from five different teams.

  • And so if we had no automation, five different teams

  • would need to do work to move those containers elsewhere.

  • And this might be challenging for teams because sometimes a container might

  • store a local state on the machine and that local state needs

  • to be copied somewhere else before you take the machine down.

  • Or sometimes a team might not have enough replicas elsewhere.

  • So if you take down this machine, they will actually

  • be unable to serve all their traffic.

  • So there are a lot of issues in how we do this safely.

  • So recap of some of the issues we face here--

  • we want to efficiently use all the resources that we have.

  • We want to make sure replicas are spread out in a safe way.

  • And hardware will fail in the real world,

  • and we want to make repairing hardware as safe and as seamless as possible.

  • And the approach Facebook has taken here is to provide abstractions.

  • We provide abstractions to make it easier for engineers

  • to reason about physical structure.

  • So an example, we can stack multiple containers on a machine,

  • and users don't need to know how that works.

  • We provide some abstractions to allow users

  • to say I want to spread across these fault domains,

  • and it will take care of that.

  • They don't need to understand how that actually works.

  • And then we allow engineers to specify some policies

  • on how to move containers around in the fleet,

  • and we'll take care of how that actually works.

  • And we provide them a high-level API for them to do that.

  • So here's a recap.

  • We have a service running on our local laptop or developer environment.

  • We want to start running in production for real.

  • And suddenly, we have more traffic Than one instance can handle,

  • so we start multiple replicas.

  • And now our app gets more complicated.

  • So we have instead of just one service, we have many services in Production.

  • And all the problems we faced in this talk

  • are problems we face in the cluster management space.

  • And these are the problems my team is tackling.

  • So what exactly is cluster management?

  • The way you can understand cluster management

  • is by understanding the stakeholders in this system.

  • So for much of this talk, we've been focusing on the perspective

  • of service developers.

  • They have an app they want to easily deploy to production as reliably

  • and safely as possible.

  • And ideally, they should focus most of their energy on the business logic

  • and not on the physical infrastructure and the concerns around that.

  • But our service needs to run on real-world machines,

  • and in the real world, machines will fail.

  • And so center operators--

  • what they want from the system is being able to automatically and safely fix

  • machines and have the system move containers around as needed.

  • And a third stakeholder is efficiency engineers.

  • They want to make sure we're actually using

  • the resources we have as efficiently as possible

  • because data centers are expensive, and we

  • want to make sure that we're getting the most bang for our buck.

  • And the intersection of all these stakeholders is cluster management.

  • The problems we face and the challenges working with these stakeholders

  • are all in the cluster management space.

  • So to put it concisely, the goal of cluster management is

  • make it easy for engineers to develop services

  • while utilizing resources as efficiently as possible

  • and making the services run as safely as possible

  • in the presence of real-world failures.

  • And for more information about what my team is working on,

  • we have a public talk from the Systems of Scale Conference in July,

  • and it's talking about what we're currently working on

  • and what we will be working on for the next few years.

  • So to recap, this is my motto now--

  • coding is the easy part.

  • Productionizing is the hard part.

  • It's very easy to write some code and get a prototype working.

  • The real hard part is to make sure it's reliable, highly available, efficient,

  • debuggable, testable so you can easily make changes in the future,

  • handling all existing legacy requirements

  • because you might have many existing users

  • or use cases, and then be able to withstand requirements changing

  • over time.

  • So let's say you get it working now.

  • What if the number just grows by 10x or 100x--

  • will this design still work as intended?

  • So with cluster management systems and container platforms, it makes--

  • at least the productionizing part-- a bit easier.

  • Thank you.

  • [APPLAUSE]

KENNY YU: So, hi, everyone.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it