Placeholder Image

Subtitles section Play video

  • SPEAKER: We've been sort of ambitious

  • here in terms of all the different kinds of distribution

  • we want to support.

  • In particular, I think I hear a lot

  • from people who think of tf.distribute.Strategy

  • as a way to get access to an all-reduced programming

  • model, distribution model.

  • But really we are trying to do something

  • that's much broader in scope.

  • This is-- it introduces a number of challenges and is a bit more

  • restrictive when it--

  • in terms of users of the programming model

  • that users of tf.distribute.Strategy can use.

  • And I'll go into that a little bit more.

  • But I believe it's the right move,

  • because it will allow not just a variety of different use cases,

  • but a broad range of maybe future work.

  • What I am going to talk about is how easy it is to use

  • and how we've made it so that you

  • can switch between distribution strategies,

  • because the code really is not really entangled with user

  • code, with a few exceptions, which I'll go into.

  • It's also-- we've done a lot of work

  • to get very good performance.

  • But I am really not going to talk about that today.

  • It's very helpful to go through what these different training

  • architectures look like.

  • So you sort of have some idea of the range of different use

  • cases that we're trying to accommodate on the distribution

  • side.

  • Right now, we are doing only data parallelism

  • and tf.distribute.Strategy.

  • That means we are dividing up our input across into a bunch

  • of different pieces.

  • The model itself is replicated across maybe 100 devices

  • or workers.

  • And there is some shared view of the model variables

  • that all of those models--

  • copies will update via some mechanism

  • that is maybe specific to the particular strategy.

  • So the oldest approach here is parameter servers and workers,

  • which was what was used by disbelief prior to TensorFlow's

  • creation and was the model assumed by TensorFlow libraries

  • prior to tf.distribute.Strategy.

  • For example, Estimator pretty much

  • assumes a sort of a primer sort of model.

  • And here what is going on is that we have each worker

  • task is going to be communicating

  • with the parameter server tests.

  • Its variable is on a single parameter server test.

  • You might have multiple parameter server tests,

  • though, if you have lots of variables, or they're big.

  • It become-- you would, otherwise, need more of them

  • to avoid being a bottleneck.

  • This model was very well-suited to the early days.

  • You have the uses of TensorFlow at Google,

  • which were characterized by the fact

  • that we add huge quantities of CPUs in a data center.

  • And we had large betting matrices.

  • And we would do very sparse lookups

  • in those large embedding matrices.

  • The main thing different here about the distribute Strategy

  • version of ParameterServerStrategy

  • is that it has much better support

  • for multiple GPUs on a machine.

  • We call this-- it's a between graph

  • strategy, which is sort of a term from TensorFlow 1,

  • where we talk a lot more about graphs.

  • In TensorFlow 2 land, we don't really

  • have a better term for this.

  • But it's just that we're--

  • each worker is going to be running its own main function

  • that's going to sort of operate independently and schedule

  • operations on itself and the parameter server

  • tasks that it's communicating with.

  • This allows these workers to run asynchronously.

  • And that combination of working between graph and asynch

  • makes it easy to tolerate failures.

  • So you can run pre-emptable workers

  • at lower priority or whatever.

  • As long as your parameter servers stay up,

  • you can keep going without really any interruption.

  • There is a lot of RPC communication here.

  • And this sometimes shows up as a performance blip

  • at the beginning of a step when it's reading variables,

  • particularly if the runtime isn't clever about, like,

  • the order in which it reads the variables, which

  • can be because there's some operations it

  • can do on all the variables.

  • But it really needs certain variables

  • to do the first layers first.

  • So and there has been-- we've seen some performance hiccups

  • as a result of that.

  • Moving on to what I call the new central storage strategy, which

  • I think is still in review, but all basically,

  • we're talking about--

  • I'm going to be talking about, like,

  • the sort of vision of where we want to be.

  • This is the ParameterServerStrategy

  • writ small.

  • It's restricting the whole-- all-- everything

  • to single machine.

  • And now, we're distributing not across machines so much

  • as the accelerators within a machine.

  • But again, the implementation is almost identical.

  • It's just the configuration is different,

  • because all of the variables are stored

  • as a single copy on the CPU.

  • And this is also known as PSCPU in the TF CNN

  • benchmark language.

  • So on one machine, there's no need to have multiple clients

  • or worry about asynchrony.

  • This all run-- can run synchronously.

  • So that's a little bit different than it.

  • And you might use this in a situation

  • even on one machine, where if you have large embeddings,

  • it won't fit on your GPU or if your particular model

  • or hardware--

  • it finds-- is particularly well-suited to this.

  • I think on TF CNN benchmarks, they've

  • done-- have this giant matrix of what's

  • the best way for any particular model hardware combination.

  • Something to experiment with--

  • probably the thing that is most well-known

  • for tf.distribute.Strategy, though, is the all-reduce sync

  • training.

  • So this is great for where you have

  • quite a lot of variability and good connection between all

  • of your devices, because they're going to be operating basically

  • in lockstep.

  • We're going to do one training step across a whole job.

  • And it's all going to be coordinated

  • via a single client.

  • And so this is used both for TPUs and multi-GPUs

  • within one machine, using mirrored strategy or TPU

  • strategy.

  • And in this strategy, we mirror the variables of the model

  • onto each of the device.

  • So let's say you had a variable--

  • let's call it A--

  • in your model.

  • And we're going to replicate that by each device

  • having its own copy of the variable locally

  • so that it can just read it.

  • And there's no delay.

  • Together, these form a conceptual mirrored variable,

  • which is there's actually a mirrored variable

  • object that we return to the user instead of these component

  • variables to store as their, like, model member variables.

  • And then, we keep these in sync by applying identical updates.

  • And this is where we're going to need some way of communicating

  • between the different devices in order

  • to make sure those updates are all the same, which brings us

  • to all-reduce, which is a standard CS

  • algorithm for communicating between devices in a network

  • efficient manner.

  • So here, all means we're going to be communicating

  • from every device to every device.

  • And the reduce means we're going to do basically a sum or mean.

  • There's some other things like max and min

  • and so forth that you can also do.

  • But this is really where the synchronization is coming

  • between the devices and where we're also

  • going to be spending-- spend a significant amount of work

  • adding new capabilities into the runtime.

  • So how does synchronous training actually look?

  • So let's say we have a model, two layers on two devices.

  • And each of those layers has a variable.

  • So there is really four component-- variable components

  • here, because we're keeping separate copies of each

  • of the two variables on each of the two devices.

  • Each device gets a subset of the training data

  • and does a forward pass, using just those local copies

  • of the variables.

  • And then, in the backward pass, we, again, compute

  • gradients using those local variable values.

  • And then, we take those gradients and aggregate them

  • by using an all-reduce that communicates

  • a single aggregated gradient value out to all the replicas.

  • And each replica then applies that gradient

  • to its local variable.

  • Since the all-reduced produces the same value

  • in all the replicas, the updates are all the same.

  • And the values stay in sync across all

  • the different replicas.

  • So the next forward pass here can start immediately.

  • You know, there's no delay waiting for the values,

  • because we have all--

  • by the end of the step local copies of all updated

  • values for all variables.

  • And furthermore, we can get some parallelism here

  • by doing all reduces of the gradient's one layer overlapped

  • with the computation of the gradients of other layers.

  • So these go all the way in line.

  • And so we have keeping both the network communication

  • and the computation parts of whatever hardware

  • you have busy at the same time.

  • That's great for throughput and performance.

  • And this does perform-- we observe

  • that on many, many models, this performs well.

  • And we want to scale this up to multi-machine.

  • And so there's members of our team--

  • [INAUDIBLE] and [INAUDIBLE] have made collective ops

  • and a strategy for doing this on multiple machines,

  • so using these new collective ops.

  • So we have multi-worker mirrored strategy that implements this.

  • This is a little bit experimental.

  • We're working on it.

  • But it works today.

  • You can use it.

  • And it employs these new collective ops

  • with, again, between graph so that each worker is only

  • scheduling the ops that are running on that worker, which

  • is good for scalability.

  • But again, it's the same model as the mirrored strategy, where

  • everything's running in lockstep, synchronizing

  • the gradients on every step.

  • We have a couple of different all-reduce implementations.

  • I think it can use nickel.

  • But also there's-- like, you can do ring all-reduce within

  • a machine.

  • You can also do it across machines,

  • very similar to the multi-GPU situation.

  • Or you can aggregate within each machine,

  • communicate across machines, and then broadcast

  • in a sort of hierarchical way.

  • And in different situations, those--

  • one will perform better than the other.

  • Last strategy is OneDeviceStrategy, which

  • we currently don't expose.

  • But it's good for, like, you want

  • to be able to supply a strategy to a function that's going

  • to open that strategy scope.

  • And you want it to work and also in a non-distributed context.

  • So now, I'm going to talk a little bit about things

  • we don't have today but maybe are coming, possibly

  • depending upon interest.

  • So if you would like us to prioritize some of this stuff,

  • let us know.

  • We have some GitHub issues, where

  • we can collect use cases and interest in these things.

  • So once upon a time, now deprecated,

  • there's was the SyncReplicaOptimizer,

  • which you combined the sort of parameter server

  • style of variable storage but with the synchronized variable

  • updates.

  • You might also want to have a sort of a hybrid strategy,

  • where you have mirrored variables in all-reduce

  • for most of your variables.

  • But if you have a large embedding that

  • won't fit on your GPU, maybe you just put that on the CPU

  • as the central strategy.

  • So we have GitHub issues tracking both

  • of those possible features.

  • Similarly, model parallelism is something,

  • where we have some infrastructure in place so that

  • we can eventually add this-- but it is not there yet--

  • on the ideas you would specify a number of logical devices,

  • and then manually place the particular ops

  • or layers or whatever onto those logical devices.

  • And then, that computation would then

  • be spread across those logical devices

  • times the number of replicas, actual physical devices.

  • Now today, if you want to do model parallelism,

  • I'm going to recommend Mesh-TensorFlow.

  • It actually works and is out there.

  • And it has a different model of doing model parallelism, where

  • instead it's splitting operations

  • across lots of devices.

  • And in general, I think that's a good fit for backprop training,

  • because you have to keep all these intermediate variables

  • around in order to do the backwards step when

  • you're doing the forward step.

  • And that just-- if you just work it out, just is,

  • I think, a better fit for those type of training.

  • However, there is one case, where

  • the other sort of parallelism is really natural,

  • which would be to do input pre-processing

  • on a separate job.

  • And that is a good fit, because there's no gradients involved

  • in input processing, so no need to hold

  • on to your intermediate values from the forward pass.

  • If you're interested in this, another GitHub issue--

  • express your interest.

  • Explain your constraints.

  • There are some questions, for example,

  • when we implement this, like, do we

  • need to support complete reproducibility or

  • deterministic allocation?

  • Or maybe we could just have, like,

  • a bunch of queues and threads running as fast as they can.

  • And you get the records that in the order that they come.

  • It would be good to know if you're interested in this

  • what you need.

  • OK.

  • So that puts us into the next stage.

  • How do these studies actually work under the covers?

  • And when we were doing this, we took

  • a very sort of simple view of we just basically tried

  • to figure out what the code looked

  • like written with mirrored strategy

  • and with ParameterServiceStrategy.

  • And we saw what changed, anything

  • that changed to how to be the responsibility of the strategy.

  • And what you learned doing this exercise

  • is that keeping state and updating state

  • are the things that change.

  • And so when you switch strategies, things like,

  • you know, variables, batch norm updates, metrics--

  • all those things sort of need some little bit of support

  • to become more distributed and work well

  • in a distributed setting.

  • And so we need a little bit of help to say, OK.

  • We're about to compute an update for a variable.

  • This is something you're going to need to do specially.

  • So what we do is we've made the TF library, in general,

  • responsible for including any changes needed in order

  • to identify what is going to have to change

  • when you distribute it.

  • We can't really, at this point, just take

  • the sort of sea of ops that you get

  • from maybe saving your model to disk and in a graph def

  • and magically understanding the intent behind all those ops

  • in order to efficiently distribute it.

  • We really need those hints that the library provides

  • by delegated calling strategy APIs.

  • But the good news is is that almost all of that

  • is restricted to TensorFlow libraries

  • and in many cases, just base classes and not subclasses.

  • And I'll get more into that later.

  • So the best-- the simplest case is

  • if you're using something like Keras and Estimator

  • or Estimator or Keras and Estimator.

  • And we control in the TensorFlow library your training loop.

  • And in that case, we can make the experience very easy.

  • If you are in--

  • but we are very interested in enabling

  • new use cases, where the user wants

  • to control the training loop.

  • And in that case, you might need to do a little bit more.

  • But hopefully, we also gave you new capabilities and lots

  • of good performance.

  • And so you'll be happy to add a few things to your code

  • in order to distribute your custom training loop.

  • I will talk a bit about also if you're a TensorFlow developer

  • and you want to make your library work with distribution

  • strategy, because it somehow interacts with state

  • that needs to be distributed.

  • I'll talk about that some.

  • I'm probably not going to--

  • I'm not going to talk really directly

  • about making a new strategy.

  • If you want to make a new strategy, you're going to need,

  • at this time, pretty heavy involvement of the [INAUDIBLE]

  • team.

  • And so talk to us.

  • We'll help you.

  • Right now, we don't expect there to be a huge number of them.

  • We have reached out to the Horovod team.

  • And we have had some initial work

  • on making a strategy with them.

  • But now, I'm going to go a little--

  • so there's different API surfaces

  • for each of these use cases if you're making a new strategy.

  • But you have the whole of the library.

  • But there are much smaller slices, if,

  • in these other cases.

  • So in the simplest case of the Keras and Estimator

  • you just basically need to know the constructor

  • of these strategies and this scope thing.

  • So if you just add these two lines to your Keras training

  • loop with-- using compile fit, we should modular bugs

  • in the feature requests and things like making everything

  • work--

  • the intent is is that that just works.

  • You get mirrored for free.

  • You get a big performance.

  • But the most-- basically, when you say strategy-- put

  • everything inside the strategy scope,

  • you're selecting this strategy, saying

  • it's the current strategy.

  • And it's the strategy that should be used for everything.

  • Most important part of that is taking

  • control of variable creation.

  • So when you are saying, you know,

  • tf.keras.applications.ResNet50, it's

  • creating a bunch variables.

  • Or maybe it waits and tell you to model, compiler fit.

  • But whenever it creates the variables,

  • it's inside the strategy scope.

  • And so we control the variable creation.

  • We can use mirrored variables instead of regular variables.

  • And all of these Keras libraries and library function goals

  • have been made distribute aware.

  • So really, there's not much else for the user to do.

  • With Estimator, again, about two lines--

  • it gave me-- most of the API is just the constructor.

  • So all we really need to do is pass that distribution strategy

  • to the Estimator's RunConfig, either for training

  • or distribution--

  • I mean, training or avow or both.

  • And when the Estimator calls the user's model function,

  • it will call it once per replica inside the strategy scope

  • automatically and so that we use--

  • and as a result, it will use mirrored variables.

  • It also uses a special variable creator, so,

  • because we're going to call this function once for replica.

  • We want to make sure it uses the same variables each time,

  • maybe different components of the mirrored variables,

  • but the same mirrored variables.

  • And Estimator has been extended to know

  • how to merge the result of all of these model function

  • calls into a single and coherent answer.

  • So this sort of is an introduction to--

  • or a good time to talk about one of the concepts

  • that we have inside distribution strategy, which is mirrored

  • versus per-replica values.

  • Mirrored values are the same across all replicas.

  • So mirrored variables are an example

  • of this, where we go through a lot of effort

  • to make sure that they are in sync.

  • But there are other things that can be mirrored.

  • For example, the output of reduction--

  • when we aggregate the gradients, those

  • will also be mirrored across all the replicas.

  • And in general, when we update a mirrored variable,

  • we need to update it with mirrored values.

  • So we know that the updates are all the same.

  • Per-replica values are sort of almost everything else--

  • basically, all these things that are going to be different

  • on every replica, they--

  • like the different inputs, the different activations,

  • and so forth.

  • And we aggregate these values, either the mirrored

  • or per-replica values into these aggregated containers that have

  • a single value per-replica--

  • these container types, so like mirrored variables.

  • An example of one is something we actually

  • can't hand to the user--

  • for example, if the mirror variables

  • will be stored in the model as the model's member variables.

  • And we've done some operator overloading so that these--

  • in the cases where the operations

  • are safe to do on the mirrored variables,

  • we will do them for you.

  • Not all of the operations will be safe.

  • And I'll go into that later.

  • And then, there's this shared variable creator, which

  • is, in addition to us intercepting variable creation

  • in order to create mirrored variables instead

  • of regular variables, we want to make sure

  • that each call to the model function

  • produces the same variables.

  • And there's a number of heuristics

  • that we use to make sure that when you create

  • a variable in each model call that we

  • return the same variable as was created for the first model

  • call.

  • So going on to custom training loops,

  • we're going to now start getting a little bit more

  • into the programming model that distribution strategy expects

  • all callers to conform to.

  • So-- and it's a little more exposed

  • when you're writing a custom training loop--

  • so you start from data sources, which are typically

  • variables or data sets.

  • I guess, they could also be constants, which is not

  • the most interesting example.

  • But then, each replica--

  • again, this is data parallelism is

  • going to be running a computation on that data.

  • And then, we have to somehow combine the results.

  • And we have basically one tool here,

  • which is a reduction or potentially concatenation too.

  • But that's turns out to be not the common case.

  • Now, what do you do with that reduction?

  • So if you are using an optimizer, what

  • you do is you are going to add all the gradient updates--

  • gradients you get from all the different replicas--

  • and then broadcast that reduced value

  • to where all the variables are.

  • Now, in the mirrored strategy case,

  • the variables are in the same place as the computation.

  • So that's becomes an all-- that's just an all-reduce.

  • And then we use the-- to update variables.

  • Another really common thing is people want it,

  • like, the average loss or something like that.

  • So we just take the reduced value, return it to the user,

  • print it out.

  • Great.

  • A sort of new capability here when you're using distribution

  • strategy is to broadcast the aggregate value communicate--

  • from all the replicas back to all the replicas

  • and then do some more computation.

  • So hopefully, this is going to unlock some doors

  • by allowing more complicated distributed

  • algorithms from researchers.

  • So we'll hopefully see the sort of MPI style of distributed

  • computation a lot more now that distribution strategy is

  • available.

  • So this is just a picture representation

  • of what I was just talking about.

  • Now, the dotted arrows are-- or dashed arrows--

  • are per-replica values that--

  • you know, activations and so forth--

  • that can be the input to a reduce in order

  • to become a mirrored value, which, as I said,

  • can be used to update a variable or just return to the user.

  • Now, I'm going to have several slides

  • here, where I'm going to go in detail to an example of using

  • the custom training loop.

  • There's going to be a little bit here

  • that's going to be future work.

  • But it's all in the plan.

  • This example is going to take a few slides.

  • So I can go into some detail.

  • But I'm going to show you how to make a custom training loop.

  • Like in the Keras example before, we create a strategy,

  • open its scope.

  • Not every operation actually has to be inside the scope.

  • But it's much simpler if we just put everything inside the scope

  • since that works.

  • And that's just a simple rule.

  • And in the future, you won't have

  • to worry about what goes in and what goes out.

  • Just put everything inside--

  • works great.

  • So the first thing we do is, in this example,

  • is create a data set.

  • And we're going to pass it the global batch size,

  • just like in the Keras case.

  • It's the strategy's job to split that across replicas.

  • Now for now, we need users to explicitly wrap their data

  • sets using a strategy method we call

  • experimental_distribute_dataset.

  • In the future, we'll do this automatically

  • for any data set iterated inside a strategy scope.

  • If the automatic splitting algorithm

  • is inappropriate for whatever reason,

  • you can manually specify how to split your data

  • set, using a function that takes an input context

  • and returns a data set with a per-replica batch size.

  • So just like in the Keras case, again, the scope

  • controls variable creation.

  • So whenever you create your model,

  • it's best to do that inside the scope

  • so that any variables will be created

  • using the policy dictated by the strategy.

  • Originally, we tried making the Keras loss

  • classes automatically scale the loss values according

  • to the number of replicas.

  • We found that that did lead to some user confusion.

  • So for now, we've switched to requiring users

  • to explicitly specify a NONE reduction

  • and do the reduction as part of a later step

  • that you'll see in a future slide.

  • Or alternatively, you can just use any tensor

  • to tensor function directly.

  • In addition, optimizers have been made distribute-aware.

  • I'll talk about that in detail later.

  • So here, we define the function with the main computation

  • of our training loop that we're going to perform every step.

  • This function will be called once per replica, at least

  • in the mirrored strategy case.

  • And the model function may create variables, at least

  • in the first column.

  • And that's important, because that's frequently

  • something Keras will do if it was unavailable--

  • if the input shape was unavailable at the time

  • the model was created.

  • But this is fine since we're going

  • to be running this inside the strategy scope.

  • And variables will still use the strategy's policy.

  • Here's where we're going to average

  • the loss using the global batch size.

  • And that's a good policy independent of how many

  • replicas you have or whatever.

  • For regularization losses, we use the scale regularization

  • loss API, which divides by the number of replicas

  • so that when you add up across all the replicas,

  • you are going to get something that scales

  • with just the variables, not how many replicas you have.

  • By having an explicit call, we hope

  • to reduce the confusion that we saw with our earlier approach,

  • where we tried to automatically scale losses

  • by the number of replicas on user's behalf.

  • We're going to create a gradient-- computer gradient,

  • using ordinary TensorFlow 2 APIs.

  • This gradient is going to be local to each replica

  • and then passed to the optimizer, which

  • is distribute-aware and is going to deal

  • with aggregate ingredients, among other things, which

  • I will go into detail later.

  • So those first two slides of the custom training loop

  • were demonstrating computation in cross-replica mode.

  • And this last slide was computation in replica mode.

  • In replica mode, we're operating on ordinary tensors.

  • And we can use the full TensorFlow

  • API to specify the computation that

  • is going to be repeated on each replica device.

  • Cross-replica mode-- instead, you

  • are operating on aggregate values, which

  • are maps from the replica to the tensor or variable

  • on that particular replica.

  • In the future, we're going to actually add a logical device

  • component in order to support model parallelism, where

  • you're actually split a model across multiple logical

  • devices.

  • We also have an update mode that is

  • going to be used inside the optimizer

  • to update each variable.

  • It's going to run code on whatever devices

  • that variable resides.

  • In ParameterServerStrategy, this will

  • be a single device, but maybe a different device

  • for each variable, whereas in mirrored strategy,

  • this will run all on the same devices as the computation.

  • Moving on with our example, here we're

  • going to train a whole epoch.

  • We currently recommend running this

  • in graph mode, which we get with the tf function decorative

  • there at the top.

  • We have tests, though, that verify that our API is

  • working in your mode.

  • You'll likely want the performance of running

  • different replicas in parallel.

  • If you want to do some per step processing that requires eager,

  • we recommend using a P-- tf Python function.

  • So we're going to iterate over our data set.

  • Right now we are depending upon the call

  • to experimental distribute data set from the earlier slide

  • to split the data set across all the replicas.

  • The plan is to do this automatically

  • whenever you iterate inside a strategy scope.

  • Note that this is particularly tricky

  • in the multi-worker case.

  • In the one machine case, this is just splitting each batch.

  • But with multiple machines, we want

  • to do some sort of decentralized splitting

  • so you're not getting the same input on different workers.

  • In the body of the loop, we're going

  • to transition between cross-replica and replica mode,

  • which involves explicitly using strategy APIs.

  • The replica step function from the earlier slide

  • will be called in replica mode, once per replica

  • on different input shards.

  • There's a tricky situation here when we

  • are at the end of a data set.

  • So we don't have enough data to fill the batches

  • on all replicas.

  • In that situation, we need to pad the input with batch size

  • zero inputs to make sure all replicas perform

  • the same number of steps.

  • This way, all-reduce doesn't freak out

  • waiting for something from a replica that

  • isn't running a step at all.

  • Note that the all-reduce operations that we're

  • going to be doing inside the step are on gradients,

  • and those gradients are going to have

  • the same shape as the variables and not dimension

  • that depends on the batch size.

  • In those replicas where there's a batch size or input,

  • we're going to have a zero gradient,

  • but at least it'll be the right shape.

  • Experimental run V2 returns per-replica value combining

  • the return value of replica step from each replica.

  • In this case, each replica is returning a vector

  • with a per example loss value with size

  • equal to the per-replica batch size, where we then

  • use the reduce API to average the loss

  • into an ordinary tensor.

  • By specifying axis equals zero, it

  • will average across the batch dimension

  • and across all the replicas to convert a global batch of loss

  • values into a single scalar.

  • Lastly, here is a simple, standard outer loop.

  • We're going to iterate through all the epochs

  • that we're executing.

  • It runs outside of the function in eager mode,

  • so you have a lot of flexibility to run whatever logic you want.

  • For example, you could put early stopping logic here.

  • You can also after each epoch, checkpoint or maybe eval.

  • This is completely straightforward,

  • since myriad variables implement the checkpoint saving protocol.

  • So they save the same way as normal variables.

  • We have tests that verify that the resulting checkpoints can

  • be loaded by a non-distributed model and vice versa.

  • So I talked about using strategy.reduce at the end,

  • after the experimental run call.

  • There are some alternatives. strategy.concat--

  • not quite implemented yet--

  • but it's another way of getting values out

  • in a way that doesn't really depend upon how it was split up

  • into different pieces for the data parallel computation.

  • You might also want to call just get the results

  • on this local worker.

  • And that's really important if you

  • were going to do some further computation

  • and you don't want if you--

  • like, you're in one of these between graph settings

  • where you have multiple main loops,

  • and you don't want two main loops using the data

  • from any other worker.

  • This is-- the last thing I'm going

  • to cover is making a library that needs to be distributed,

  • remember, because it operates [INAUDIBLE] what

  • APIs you might use.

  • So the first, easiest thing to know about is

  • tf.distribute.get_strategy is how you get a strategy.

  • And the important thing to know about it

  • is that it always returns you something implementing

  • the strategy API, even if you're not into the strategy scope,

  • because there is a default strategy that does something

  • moderately sensible even if you don't have knowledge

  • about what's going on because you're in some strategy

  • scope that has a specific configuration.

  • So distribution strategy aware code is

  • just code that uses this get_strategy API

  • and does its work via those APIs.

  • And most of the work is already done for you

  • for the normal cases, as long as you're just

  • implementing a new metric and optimizer loss.

  • You just have to inherit from the base

  • class that has done all of the work to be distributed enabled.

  • There are new capabilities available to you,

  • though, if you want to be distribution-aware,

  • and I'll talk a little bit about that, those new APIs

  • and options available to you.

  • But first, I'm just going to sort explain

  • the implementation.

  • For losses, we provide helpers for per example

  • and regularization losses that are distributed-aware.

  • If you can supply the global batch size,

  • there is no actual need to do anything distribute-specific,

  • and we can just scale the loss by the value that

  • is constant for all batches, including

  • the last partial batch, and weigh each example equally.

  • Otherwise, we compute the per-replica batch size

  • from the tensor shape and scale it by the number of replicas

  • from the current strategy to get the global batch size.

  • This might be slightly wrong in that it'll

  • weight the last batch slightly more

  • but is the best we can do without knowing

  • the global batch size.

  • So now going into Optimizer--

  • Optimizer, we made much bigger changes.

  • So we're going to look at the Apply gradients call.

  • That's past a parallel list of gradients and variables.

  • And again, we get the strategy and then

  • we do the thing called a merge call.

  • Now merge call is the sort of secret weapon

  • we developed at the beginning of creating distribute strategy.

  • It allows-- it's basically our main tool

  • for doing things that cross replicas when

  • inside a per-replica call.

  • And so we can do synchronization or communication

  • inside a merge call.

  • The way it works is when the mirrored strategy is running

  • something on each replica, it's actually

  • running each of those functions in a separate thread.

  • So each thread corresponds to one replica,

  • but we only are running one replica thread at a time.

  • We use [INAUDIBLE] and so forth so

  • that the first replica runs until completion

  • or it gets to a merge call.

  • If it gets to a merge call we say, OK, pause that thread.

  • Run the next replica thread until it gets up

  • to the same point in the code, and it gets up

  • to that merge call.

  • Then repeat that until we've gotten

  • all of the replica threads up to that merge call point.

  • And we have args from each thread,

  • and now we aggregate all the args across

  • produced on all those threads into per-replica values.

  • And then we call that function that we pass to the merge call

  • once with these sort of aggregate values

  • across from all the merge calls.

  • Now that function now can do things,

  • like reductions and whenever.

  • They cross all those replicas, and whatever

  • is returned by that function is then

  • returned by it as the return value for all

  • of the merge calls.

  • And then we resume the first replica thread

  • until it finishes, and so on, and so forth.

  • So this distributed_apply function

  • is the argument, the thing that's inside the merge call.

  • So it's only executed once, even though apply gradients calls

  • executed for each replica.

  • And the grads and vars value here

  • is now, instead of being a list of variables

  • and a list of gradients, a list of mirrored variables

  • and a list of per-replica gradients.

  • Now, we want those per-replica gradients

  • to be aggregated across all the replicas,

  • so we do a reduction, where we add them up.

  • And this batch reduce too will reduce across

  • to all of the gradients at once, but it'll put each gradient

  • after aggregation on the devices where

  • the corresponding variable lives.

  • So in the mirrored case, this is an all reduction,

  • but in the parameter server case,

  • each gradient might be going to a variable living

  • on a different parameter server, potentially.

  • And so it know--

  • takes the variable as the destination,

  • so it can know where to put the aggregated gradient value.

  • And then with those reduced gradients,

  • we then call update, which calls this

  • apply gradient to update variable function on once

  • per device where the variable is.

  • And the gradients now, gradient values

  • here are now mirrored variables and update validates

  • that those are mirrored variables so

  • that we can be sure that the update is

  • the same across all copies of the variable.

  • So this is that sort of subset of the programming.

  • The state transition diagram that you can see

  • is restricted to just for normal variable updates.

  • This introduces another concept, which

  • is the sort of locality of the devices

  • where all these things are running.

  • Replica devices are the devices where we're

  • doing most of the computation.

  • Variable devices are devices where the variable lives,

  • which may be one or many, depending

  • on if it's mirrored or not.

  • And the reduce_to API is the bridge that

  • gets us from one to the other.

  • We also have this non-slot concept

  • that needed for in order to match the behavior ADAM

  • optimizer.

  • So we have the standard pattern that is generally

  • how we update state.

  • The merge_call is taking tensors for each replica

  • and producing per-replica containers.

  • Then reduce_to is turning those produced

  • per-replica containers into aggregate values are mirrored.

  • And then we update, taking the mirrored values

  • to update the values.

  • And we know because they have the mirrored type,

  • the updates are identical.

  • So I see that we're a little bit low on time,

  • so I'm just going to breeze through this.

  • This is the fact that we've overloaded the operations

  • on mirrored variables so that like, for example,

  • assign operations will do all of those steps for you

  • as long as you set an aggregation.

  • That aggregation ends up being the reduce operation

  • that we do.

  • One of the new things you get can do now with distribution

  • strategy, though, is say, we're going

  • to actually opt into a different model

  • for how we're going to update the variables.

  • We instead could sink--

  • instead of syncing on write, like we

  • do for mirrored variables, we're going

  • to sync-on-read, which means these variables are

  • going to be totally, when at write time, independent.

  • And we're going to keep writing, writing, writing, writing,

  • writing to them, assuming reads are rare.

  • And then when we read, that's when

  • we do the reduction to aggregate the value

  • across all the different variables

  • on the different replicas.

  • These aren't trainable, but they're

  • really great for at least a couple of cases we've seen.

  • One is metrics and batch norms statistics

  • that are ones that are used only in avow.

  • And you get those.

  • So we have the synchronization aggregation arguments

  • that we've added as new APIs in order

  • to access these new capabilities.

  • And so you can set, for example, synchronization to ON_READ

  • in order to get for metrics and batch norm statistics.

  • And variable aggregation can be set, even for mirrored

  • variables but really both.

  • And it lets you say what reduction

  • to do when you do operations.

  • You don't need to set that, though, if--

  • you don't need to set the aggregation if you're only

  • using optimizers.

  • Optimizers don't rely on this.

  • I'm just going to breeze through this.

  • OK, so here's an example.

  • Metrics just set-- change the add weight methods

  • to change the defaults so that SUM is the default

  • aggregation [INAUDIBLE] and ON_READ is the default

  • synchronization method.

  • And then when you implement a subclass of it,

  • so you'll see there's--

  • we just are computing the numerator and denominator

  • in the update method and those are using variables

  • created using add_weight.

  • This just is updating the per-replica values

  • within each replica, and we may be doing a parallel eval

  • on a bunch of different replicas.

  • And then at the end, we do this result function,

  • and only then do we aggregate across replicas.

  • And so we get a total numerator, a total denominator, and divide

  • and we get a value.

  • And you'll notice that we didn't have to use

  • anything strategy specific.

  • It's purely the operator overloading all the logic

  • in the sub needed for distributed enabling

  • is in the parent class.

  • You could imagine layers also wanting to use all-reduce.

  • Here's some code that will do it.

  • This does introduce synchronization replicas

  • and could be used optionally for batch norm layers.

  • This is one of the-- batch norm layers are

  • one of the very few cases where you will have interaction

  • between batch elements and so if you

  • don't have enough examples on each replica,

  • you want to communicate across replicas to get enough batch

  • norm statistics.

  • They also use sync-on-read for those avows,

  • as mentioned earlier.

  • A few more obscure methods, we generally

  • hide these in the strategy extended

  • so users don't have to see it.

  • So variable_created_in_scope is used for error checking

  • by Keras.

  • colacate_vars_with is used for slot variables and optimizers.

  • That's it.

  • Sorry I don't have a great conclusion,

  • except for if you have more to ask about this,

  • we have a component label on GitHub called dist-strat.

  • Ask away.

  • [APPLAUSE]

  • AUDIENCE: I have a question.

  • Say I want to implement a custom layer.

  • When I call add_weights, what do I

  • need to care about to make it work

  • with distribution strategy?

  • SPEAKER: Usually nothing.

  • The defaults are all set up in the base class

  • so that unless there's something weird about your layer

  • where like, it has interaction between different batch

  • elements, like in the batch norm case,

  • your layer probably just doesn't have to care.

  • AUDIENCE: I see.

  • SPEAKER: Because all that logic is in the base class.

  • AUDIENCE: I see.

  • And is mirrored strategy only have--

  • does it only have mirrored variables?

  • SPEAKER: Right now, you will get mirrored variables by default

  • if you use mirrored strategy, but you

  • can opt into these weird sync-on-read variables

  • by setting options that are not about default value.

  • AUDIENCE: OK, is optimizer.iterations

  • finding [INAUDIBLE] sync-on-read variable?

  • SPEAKER: I think it's supposed to be a mirrored variable.

  • AUDIENCE: It is a mirrored variable.

  • SPEAKER: It should be.

  • It should be mirrored variable, unless there's a bug.

  • AUDIENCE: I see.

  • AUDIENCE: It is a mirrored variable.

  • SPEAKER: OK, great.

  • We have our expert here to tell us.

  • AUDIENCE: Cool.

  • AUDIENCE: Do parameters servers work with accelerators?

  • Because you can imagine in the limit

  • that your accelerators have really, really

  • high interconnect, which I know a lot of them

  • do and are moving towards, that like,

  • mirrored would be too conservative.

  • And you'd like to say, round-robin your convolution

  • variables around the accelerators.

  • Like, can you do that, or is that planned?

  • SPEAKER: If you have exotic ideas, we can talk to you.

  • Right now, we are more focused on more basic use cases.

  • For accelerators with good interconnect,

  • we see that more all reduce style has been more efficient.

  • The problem you might run into, though, is either running out

  • of memory or if your steps take widely different times

  • because like, maybe you're doing an RMN

  • with widely different steps, that the synchronization

  • overhead of saying that we're doing everything in lockstep

  • might be a high cost.

  • AUDIENCE: In particular, I was thinking

  • about the memory of having [INAUDIBLE] copy on

  • [INAUDIBLE].

  • SPEAKER: Yeah, it's sort of hard to avoid

  • that because you need to--

  • you can do model parallelism to avoid reducing the memory.

  • But the mesh TensorFlow is probably a better path there.

SPEAKER: We've been sort of ambitious

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it