Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • ZONGWEI ZHOU: Good morning and good afternoon, researchers

  • and TensorFlow developer.

  • My name is Zongwei Zhou.

  • I'm from TensorFlow performance team.

  • Today, I'm going to tell you more about TensorFlow

  • distributed--

  • training.

  • Let me start the talk by sharing a piece of user experience

  • that you might already encountered before.

  • Suppose that you have an awesome prototype machine learning

  • model that you worked so hard to make

  • it run efficiently on single host with multiple GPUs.

  • Now, it's time to really get it running end

  • to end with some more resources.

  • Then, you started your four of your beefy cloud

  • virtual machines, each with multiple modern GPUs

  • with connected fancy 100 gigabits network.

  • With all these resources, you deploy your models

  • and hope to see it run blazingly fast.

  • But wait, why am I only getting 1.5X faster than a single

  • machine while I am using four of them?

  • Yes, I know.

  • This is really frustrating.

  • If you have similar experience and questions,

  • this talk is for you.

  • In today's talk, you are going to learn

  • how to scale out your TensorFlow to Keras model

  • to multiple machine multiple GPU.

  • And using the optimization, we will ship in

  • the upcoming TensorFlow 2.2 release.

  • You are going to dramatically improve

  • your training throughput.

  • I use one case study, BERT SQuAD, for today's talk.

  • BERT, of course, it has to be this revolutionary NLP

  • model, which is very popular in the past year.

  • I used the fine tuning task, which

  • is the SQuAD reading comprehension task,

  • for example.

  • Your training model is going to be given some reading materials

  • and questions.

  • And your model needs to answer the questions.

  • And then the context can be evaluated with the accuracy.

  • The model we use today available in TensorFlow to Official Model

  • Garden, where you could go and download via this link,

  • and try it out following.

  • And here is a quick demonstration of the BERT SQuAD

  • training throughput.

  • Starting from the first line, this

  • is the training throughput using TensorFlow

  • 2.1 one out of the box.

  • With three optimizations in TensorFlow 2.2

  • that I'm going to introduce today,

  • the model is now running 2 point X faster.

  • Aren't you excited and want to see these optimizations?

  • But before diving into the sizing optimization,

  • let me provide some background information to introduce

  • how the model is synchronously trained on multiple hosts,

  • multiple GPU.

  • And here, we are leveraging the native TensorFlow disputed

  • training support, which is multi-worker mirror strategy.

  • In the figure here, we have two GPU device

  • on two different hosts.

  • And we are using a simple deep learning

  • model, which have two layer, A and B, and with one variable

  • each.

  • So each GPU receive a subset of the training data

  • and compute the forward path using its local copy

  • of model variables.

  • And then, it runs through the backward path

  • to calculate the gradients of each layer.

  • After the gradient are calculated,

  • all the devices now start communicating

  • between themselves using allreduce algorithm

  • to aggregate the gradients.

  • Up til gradient aggregation, each device

  • would get back the same set of the aggregated gradients

  • and then use that to update their local variables.

  • We call it synchronous, because every device

  • needs to aggregate a gradient, get

  • the same set of the aggregated gradient,

  • update their variables, before they can actually

  • proceed to the forward path of the next training step.

  • So with so many phases in the deep learning training,

  • how do we actually identify where the bottleneck is?

  • Is it in forward path, backward path, gradient aggregation,

  • or variable updates, how do we know where to optimize?

  • Yes, of course, we are going to use

  • TensorFlow profiler, which Quimin just

  • introduced to you today.

  • So let's start with the run of BART SQuAD model,

  • log in to 1 VN, fire up the TensorBoard, take a profile,

  • and then open the TraceViewer.

  • And here is the trace you will get from the TraceViewer.

  • In this view, you see there's a pink bar

  • with number five at the bottom of the profile, which

  • means that this is the number five training

  • step in your profile.

  • And this is the whole operations within this training step.

  • And the first five rows are events that is related to GPU.

  • Specifically, in the first row, you

  • will see a GPU computation, which

  • includes forward path, backward path, and variable updates.

  • And you can also see a big blue bar

  • called NCCL allreduce kernel, which

  • is the gradient aggregation.

  • It's called NCCL because, including TensorFlow,

  • every machine learning framework is using NVIDIA NCCL allreduce

  • library to aggregate gradients.

  • So by far, you may already spot the problem, right.

  • This is exactly the power of the TensorFlow profiler, which make

  • it so visualized and obvious.

  • So the problem here we are facing

  • is that the gradient aggregation can actually

  • dominate the entire step time.

  • All you see is this big blue box of NCCL allreduce.

  • So how do we resolve this issue and optimize the training

  • performance?

  • Here comes the three optimizations.

  • First optimization from the profiler,

  • you can get the information regarding the total time

  • use in NCCL allreduce.

  • And you know your model, you know the total size

  • of the model variables and gradients.

  • And from these three information,

  • you can calculate your NCCL allreduce throughput.

  • And you also know you use how many

  • machines in each machine, how many GPUs are there.

  • Use these two informations, and following the NCCL NVIDIA

  • tuning guideline, you can calculate

  • the ideal NCCL throughputs.

  • And in this case, we found out that the real NCCL allreduce

  • throughput is actually much smaller

  • than the ideal throughput.

  • Then, we know that the first optimization

  • is to actually tune your NCCL to fully utilize

  • the underlying cross-host network

  • to improve the performance.

  • Second optimization, so if you notice the big blue bar here,

  • it says at 32, which means that the gradient aggregation is

  • in the four precision float32 format.

  • And we know that the model can be trained

  • with mixed precision, so the gradient

  • can actually be in lower precision live float16.

  • So can we actually aggregate the gradient

  • in float16, which would efficiently cut the network

  • data that is being transferred by half and last

  • improve the performance?

  • The third one, if you noticed, the NCCL

  • allreduce it change data across the network.

  • And it doesn't even use the GPU itself a lot.

  • So you see an empty space out there about the blue bar.

  • So can we actually push the NCCL allreduce forward

  • a little bit, so that it overlap with the GPU computation

  • happening in backward path?

  • This way we can reduce the GPU idle time

  • and improve the training step time.

  • So here is the three ideas, and they all look good.

  • And I'm going show how to implement

  • this idea using TensorFlow 2.2.

  • First optimization, NCCL throughput tuning.

  • So TensorFlow 2.2 is shipped with the latest NVIDIA NCCL

  • libraries.

  • And we have a lot of experiments on GoogleCloud VMs

  • to identify a set of recommended parameters

  • to help you reach the peak throughput.

  • Users can append those parameters

  • when running their models, such as the NCCL Socket NThreads

  • parameter here, append it before the model.main.py.

  • So if users has different network environment

  • than the Cloud VMs, you might need

  • to run the experiment to find out the optimum parameters.

  • And this is suboptimal, and we are looking to improve that.

  • We are working with NVIDIA to see if we can autotune the NCCL

  • parameters.

  • So in future TensorFlow release, user

  • doesn't manually test and set any of these parameters.

  • For now, with the optimal NCCL parameters,

  • we see a dramatic 30% to 40% throughput improvement

  • in BART SQuAD.

  • These are good for just one optimization.

  • So are you excited to see the second optimizations?

  • OK, here it comes.

  • If you remember, we were looking to aggregate the gradients

  • in lower precision, which is float16.

  • The prerequisite is that the model is already trained

  • with Keras mixed precision API.

  • And we have an online tutorial, following this link,

  • to tell you about the technical details and usage of this API.

  • Generally speaking, mixed precision

  • would use two types of floating point number representation

  • during training, float32 and float16.

  • If you can see from the figure, float32 with more data,

  • it can represent a larger range of numbers with high precision

  • than float16.

  • But float16 has its own advantage.

  • Computing flaot16 in modern GPU can be up to 2 to 4X

  • faster than a FP32 computation.

  • So this is really a trade-off between the model accuracy

  • and the computing efficiency.

  • But mixed precision API actually try to give you

  • the best of both world.

  • To show that the mixed precision under the hood

  • I will show the BART SQuAD custom training loop codes.

  • With this code, user can get the maximum flexibility

  • to change every aspect of the training.

  • And here is the training loop code in TensorFlow 2.1.

  • With mixed precision, your model variable

  • are still kept in float32 for best model accuracy.

  • But the computation, including the grading computation here,

  • are all done in float16.

  • And up to the gradient computation,

  • the gradients are converted back to a FP32,

  • and then apply the gradient updates

  • to the FP32 model variables.

  • The applied gradients API here would actually

  • implicitly aggregate the FP32 gradient for you.

  • This is why you see the gradient is aggregated in FP32

  • from the previous profile.

  • With TensorFlow 2.2, now you can change the custom training loop

  • code a bit to enable gradient aggregation in float16.

  • First, we modify the optimizer apply gradient API.

  • In our text and argument, called all_reduce_sum_gradients.

  • When you set this argument to force,

  • it essentially tell the optimizer API

  • not to do the gradient aggregation for you.

  • Instead, the user can manually call gradient aggregation API

  • from distribution strategy to do the allreduce by yourself.

  • And the best thing is that the user can apply any customized

  • gradient operations before and after allreduce

  • using this manual allreduce, including

  • what we want to do for gradient aggregation in float16.

  • So you simply just need to cast the gradient to FP16

  • before the allreduce, do the gradient allreduce,

  • and then cast the gradient back to FP16.

  • So this is as simple as what you need to do.

  • And as you can see, you may find a lot of gradient

  • casts here to flow in between FP32 and FP16, which is OK.

  • Because the TensorFlow graph optimization will actually

  • remove these redundant casts under the hood.

  • So you only leave with one gradient

  • cast from FP16 to FP32, which anyway you have to do,

  • because you are applying gradient to the FP32 model

  • variables.

  • So the cast here is just more for your code readability

  • but will not downgrade your performance.

  • Also I want to make one more point.

  • For advanced user that wants to customize gradient operation,

  • including allreducing FP16, they can

  • use the Custom Training Loop to get maximum flexibility.

  • But for average user who just want the allreducing FP16 out

  • of the box, so we are working on supporting

  • this in a Keras Compile and Fit in upcoming future releases.

  • So in the future release, you can

  • get allreducing FP16 out of the box using Keras Compile and Fit

  • API.

  • So with FP16 allreduce, we are now

  • seeing that the BART SQuAD training throughput is further

  • increased by about 35%.

  • We are now almost 2.2X throughput comparing with

  • the TensorFlow 2.1 by far.

  • But wait, we still got one more optimization.

  • The third optimization, if you still remember,

  • we want to reduce the GPU idle time by overlapping gradient

  • aggregation with the backward computation on GPU.

  • Let's take a look at the deep learning model

  • figure we have seen previously.

  • So in this model, after the TensorFlow

  • calculate the gradient of your Layer B,

  • we can immediately send out the gradient of the Layer B

  • for aggregation.

  • And at the same time, on your GPU,

  • you can still calculated the gradient of Layer A.

  • So as you can see right now, the network computation

  • of allreduce the gradients of Layer B

  • is now run in parallel with the gradient calculation

  • in layer A, which we can use the GPU resources in network

  • both efficiently.

  • To enable this overlap, let's meet some more changes

  • to the same custom training loop code in previous optimization.

  • So in TensorFlow 2.2, we introduce these collected hints

  • to the distribution strategy allreduce API

  • by inputting a bytes per pack arguments to the allreduce API.

  • It tells the TensorFlow to break your model gradients

  • into multiple packs.

  • And the TPU TensorFlow runtime, we actually

  • send our pack once this pack of gradient

  • is available in the backward computation.

  • So this is as easy as it looks, just

  • giving it a bytes per pack numbers, right.

  • But the next question would be, what

  • is the optimum bytes per pack to achieve the maximum training

  • throughput?

  • In TensorFlow 2.2, user needs to do some simple experiment

  • to identify the optimum parameter.

  • NVIDIA provide the official NCCL allreduce benchmarks,

  • user can get these benchmarks running on their multi-hosts

  • with their networking environment.

  • And they need to change the data, allreduce data size.

  • And typically, what user will see along

  • with increase of the allreduce data size,

  • so does the NCCL allreduce throughput,

  • it would also increase with the data size.

  • But up to a certain allreduce data size,

  • the NCCL throughput will start to plateau,

  • which means that your NCCL reached the limits

  • of your underlying network.

  • And here, the data range of the data size

  • is exactly the optimum pack size,

  • which is sufficiently large to saturate your underlying

  • network.

  • If you set the pack size to be smaller,

  • each gradient pack you see now cannot fully utilize

  • the network.

  • So you are wasting your network bandwidth.

  • If you set the gradient pack larger,

  • then it means that TensorFlow needs

  • to wait for longer time to actually waiting

  • for the first pack of the gradient,

  • it means that you have less overlaps.

  • So we know that the optimum pack size is actually

  • the allreduce pack size that reached the throughput plateau.

  • But this seems to-- still some work

  • that require from the user.

  • So in future release, we are also

  • working on autotuning this pack size

  • so the user can get the optimum performance out of the box.

  • So by far, we have seen these three optimizations.

  • And it improves the BART SQuAD training throughput by 2.5X.

  • And these optimizations are very useful in public Cloud

  • when the networking between the Cloud VMs is limited.

  • We are also working on more improvements, as I introduced,

  • along with the talk.

  • We are going to support Keras Compile

  • and Fit, support autotune in NCCL parameters,

  • and the pack size.

  • All coming in the future releases of TensorFlow.

  • So stay tuned for all and more improvements.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it