Placeholder Image

Subtitles section Play video

  • My name is Fria.

  • Anna Montali.

  • Revolt Software engineers on the TENSORFLOW.

  • Team working on distributed tensorflow.

  • We're so excited to be here today to tell you about distributor TENSORFLOW training.

  • Let me grab the clicker.

  • Okay.

  • Hopefully, most of you know what Tensorflow is.

  • It's an open source machine learning framework used extensively both inside and outside Google.

  • For example, if you're tried the smart, composed feature that was launched ah, couple of days ago, that future uses sense of flow.

  • Tensorflow allows you to build, train and predict using your electric such as this in training.

  • We learned the parameters off the network using data training complex Neural letters with large amounts of data can often take a long time in the graph.

  • Here you can you can see the training time on the X axis and the accuracy of predictions on the Y axis.

  • This is taken from training an image recognition model on a single GPU.

  • As you can see, it took more than 80 hours to get to 75% accuracy.

  • If you have some experience running complex machine learning models, this might sound rather familiar to you.

  • And it might make you feel something like this.

  • If you're training takes only a few minutes to a few hours, you'll be productive and happy.

  • And you can try out new ideas faster when it starts to take a few days.

  • Maybe you could still manage and run a few things in parallel.

  • Well, it starts to take a few weeks.

  • Your progress will slow down and it becomes expensive to try out every new idea.

  • And then it starts to take more than a month.

  • I think it's not even worth thinking about.

  • And this is not an exaggeration.

  • Training complex models such as the Resident 50 that we'll talk about later in the talk can take up to a week on a single but powerful Jeep.

  • You like a Tesla P 100.

  • So a natural question to us is, How could we make training fast?

  • There are a number of things you can try.

  • You can use a faster accelerator, such as a tip you or tense or processing units.

  • I'm sure you've heard all about them in the last couple of days.

  • Here.

  • Your input by plane might be the ball like so you can work on that and make that faster.

  • There are a number of guidelines on the Tensorflow website that you can try and improve the performance or your training in this talk will focus on distributor training that is running training in parallel on multiple devices.

  • Such a CIB used three pews or TP use in order to make your training faster with the techniques that will talk about in this talk, you can bring down your training time from weeks two hours, with just a few lines of code and a few powerful jeep use in the grass.

  • Here, you can see the images for second process while training an image recognition model.

  • As you can see as we increase the number of deep use from 1 to 4 to eight, the images for second process it can almost double every time we'll come back to these performance numbers later with more details.

  • So before diving into the details of how you can get that kind of scaling in tensorflow first, I want to cover a few high level concepts and architectures and distributed training.

  • This will give us a strong foundation, but which to understand the various solutions as your focus on training today, let's take a look at what a typical training Luke looks like.

  • Let's say you have a simple modern like this with a couple of hidden layers.

  • Each layer has a bunch of rates and biases, also called the model Parameters or trainable variables.

  • A training step begins with some processing on the input data where then feed this input into the model and compute the predictions in the forward pass.

  • We then compare the predictions with the input label and compared to compute the loss.

  • Then, in the backward pass, we compute the Grady INS, and finally we update the models parameters using these Grady INTs.

  • This whole process is known as one training step, and the training loop repeats this training step until you reach the desired accuracy.

  • Let's say you begin your training for the simple machine under your desk with a multi core CPU.

  • Luckily, tensorflow handle scaling onto a multi core CPU for you automatically.

  • Next.

  • You may speed up by adding goad accelerator to your machines, such as a jeep.

  • You r a t p.

  • U.

  • The distributor training.

  • You can go even further.

  • We can go from one machine with a single device to one machine with multiple devices and finally to multiple machines with possibly multiple devices each connected over the network with a number of techniques.

  • Eventually, it's possible to scale to hundreds of devices.

  • And that's indeed what we do in a lot of Google systems, by the way, and the rest of this talk will use the terms device worker or accelerator to refer to processing units such as Refuse or T Pews so hard his distributor training work.

  • Like everything else in software engineering, there are a number of ways to go about when you think about distributing your training.

  • What approach you pick depends on the size of your model, the amount of training data you have and the available devices, the most common architecture and distributor training.

  • And what is what is known as data parallelism In data parallelism.

  • We run the same model and computation on each worker, but with a different slice of the input data.

  • Each device computes the loss and the Grady INTs.

  • We use these radiance to update the models parameters, and the updated model is then used in the next round of computation.

  • They're too common approaches when you think about How do you update the model using these radiance?

  • The first approach is what is known as a synchronous parameter silver approach.

  • In this approach, we designate some some devices as parameter servers as shown in blue.

  • Here, these servers hold the parameters off your motto.

  • Others are designated as workers as soon and green.

  • Here, workers ruled the bulk of the computation.

  • Each worker fetches the parameters from the parameter server.

  • It then confuse the loss ingredients.

  • It sends a radiance back to the parameter of server, which then abdicates the models parameters using these radiance.

  • Each worker does this independently, so this allows us to scale this approach to a large number of workers.

  • This has worked well for many models and Google, where training workers might be preempted by high priority production jobs or where this asymmetry between the different workers order machines might go down for regular maintenance.

  • And all of this doesn't hurt the scaling because the workers are not reading on each other.

  • The downside of this approach, however, is that workers can get out of sync, their computing, the ingredients on steel parameter values and this kin d Laken virgins.

  • The second approach is what is known as synchronous order deuce.

  • This approach has become more common with the rice off fast accelerators such a steep use RG pews.

  • In this approach, each worker has a copy off parameters on its own.

  • There are no special parameter servers.

  • Each worker computes the loss ingredients based on a subset of training samples.

  • Once a greedy INT heir computed, the workers communicate among themselves to propagate the Grady INTs and update the models parameters.

  • All the workers are synchronized, which means that the next round of computation doesn't begin until each worker has received the updated radiance and a bitter.

  • That's motto when you have fast devices in a controlled environment, the variance off step time between the different workers can be small when combined with strong communication links between the different devices.

  • Overall, overhead off synchronization could be small, so whenever practical, this approach can lead to faster convergence.

  • A class off algorithms called all reduced can be used to efficiently combine the radiance across the different workers.

  • All reduce aggregates the values from different workers, for example, by adding them up and then dropping them to the different workers.

  • It's a fuse algorithm that can be very efficient, and it can reduce the overhead off synchronization of radiance by a lot.

  • There are many already use algorithms available depending on the type of communication available between the different workers.

  • One common algorithm is what is known as wrinkled reduce.

  • In Ringgold reduce.

  • Each worker sends its ingredients through successor on the ring and receives greedy INTs from its predecessor.

  • There are few more such rounds of radiant exchanges.

  • I wouldn't be going into the details here, but at the end of the algorithm, each worker had received a copy of the combined radiance.

  • Bring All Reduce uses network bandwidth optimally because it uses both the upload and the downward banquet at each worker.

  • It can also overlap the Grady in competition at lower layers in the network with transmission of radiance at the higher layer, which means it can further reduce the training time.

  • Bring all reduces just one approach on dhe.

  • Some hardware vendors supply specialized implementations of all reduce for their hardware.

  • For example, the NVIDIA Nickel we have a team and Google working on fast implementations off all reduce for various devices.

  • Apologies.

  • The bottom line is that all reduced can be fast when working with multiple devices on a single machine or multiple or a small number of machines.

  • So given these two broad architectures and data parallelism, you may be wondering which approach should you pick.

  • There isn't one right answer parameters.

  • Ever approach is preferable if you have a large number off not so powerful or not so reliable machines.

  • For example, if you have a large cluster, off machines would just see fuse.

  • The synchronous authorities approach, on the other hand, is preferable if you have fast devices with strong communication links.

  • Such a steep use or multiple G abuse on a single machine parameter sever approach has been around for a while, and it has been supported, well intensive ful.

  • Keep use.

  • On the other hand, use all of your old reduce approach out of the box in the next section of this talk will show you how you can scale your training, using the audio's approach on multiple Jeep use with just a few lines of code.

  • Before I get into that, I just want to mention another type off distributor training known as model parallelism, that you may have heard of a simple way to think about mortal peril is, um, is when your model is so big that doesn't fit in the memory off one device.

  • So you divide the model into smaller parts, and you could do those computations on different workers with the same training samples, for example, you could put different layers off your model on different devices.

  • These is, however, most devices have big enough memory that most models can fit in their memory.

  • So in the rest of this talk will continue to focus on data parallelism.

  • Now that you're armed with fundamentals of distributed training architectures, let's see how you can do this Intensive low, as I already mentioned, we're going to focus on scaling to multiple G pews with the arteries are conjecture in order to do so easily.

  • I'm pleased to introduce the new distribution started G a p I.

  • This ap I allows you to distribute your training intensive flow with very little modification to your code.

  • The distribution started T a p I.

  • You no longer need to place your ops are parameters on specific devices.

  • You don't need to worry about structuring your model in a way that the Grady INTs our losses across devices are aggregated correctly, distribution does so distributions try does that for you.

  • It is easy to use and faster train.

  • Now let's look at some code to see how you can do this intense.

  • How we can use this, a p I.

  • In our example, we're going to be using pensive lows, High level A P I called estimator.

  • If you have uses a P I before you might be familiar with the falling snippet of code to create a custom estimator, it requires three arguments.

  • The 1st 1 is a function that defines your model, so defines the parameters off your model, how you compute the loss in the Grady INTs and how you update the models parameters.

  • The second argument is a directory where you want to persist the state off your model, and the third argument is a configuration called Drunken Fake, where you can specify things like how often you want to check point, How often summary should be saved and so on in this case reviews the default on conflict.

  • Once you've cleared the estimator, you can start your training by calling the train method with the input function that provides your training data.

  • So given this coat, thio do the training on one device.

  • How can you change it?

  • To run on multiple G abuse, you simply need to add one line of code.

  • Instance.

  • She ate something called Mirror Strategy and pass it to the run config call.

  • That's it.

  • That's all the court changes you need to scale this code to multiple reviews.

  • Mirror strategy is a type of distribution strategy.

  • A B I that I just mentioned with this a p I.

  • You don't need to make any changes to your model function or your input function or your training loop.

  • You don't even need to specify your devices.

  • If you want to run on all available devices, it will automatically detect that and run your training on all available cheap use.

  • So that said, those are all the court changes you need this a p I is available in T F Green trip now, and you can use it.

  • You can try it out today.

  • Let me quickly talk about what mirror strategy does.

  • Meter strategy implements the synchronous all reduce architecture that we talked about out of the box for you in mirror strategy.

  • The models.

  • Parameters are mirrored across the various devices, hence the name meter strategy.

  • Each device computes the loss ingredients based on a subsidy off the input data.

  • The Grady INTs are then aggregated across the workers, using an alder use algorithm that is appropriate for your device topology.

  • As I already mentioned with mirror astrology, you don't need to make any changes to your model or your training loop.

  • This is because we have changed underlying components of tensorflow to be distribution aware, for example, optimizer batch norm.

  • Some reason sexual.

  • You don't need to make any changes to your input function, either, as long as you're using the recommended tensorflow data.

  • Set a P I.

  • Saving and tech pointing works seamlessly so you can save that one or no distribution strategy and resume with another.

  • And summaries work as expected as well, so you can continue to visualize your training.

  • Intense aboard mirrored strategy is just one type of distribution strategy, and we're working on a few others for a variety of fuse cases.

  • I'll know 100 off to under lead to show you some cool demos and performance numbers.

  • Thanks trea for the great introduction to middle strategy before we run the demo, let's get familiar with a few configurations I'm going to be running the resident 50 modern from the Tensorflow Model Garden president fifties and a much classifications model that has 50 layers.

  • It uses skip connections for efficient, radiant flow.

  • The Tensorflow model garden is a repo, whether our collection off different models the written, intensive, low, high level AP eyes.

  • So if you're new to tensorflow, this is a great resource to start with, I'm going to be using the image, Net Data said is in Porto model training.

  • The image, in a data said, is a collection off over a 1,000,000 images that have been categorized into 1000 labels I'm going to in Stan.

  • She ate the end one standard instance on G C E an attach eight and video test lobby one hundreds or water cheap use.

  • Let's run the demo.

  • No, As I mentioned, I'm creating an n one standard instance, attaching age and video Tesla V 100 or Walter Defuse.

  • I also attacked SST disk.

  • This contains the image in a data said, which is in port to our model training to run identical model.

  • We need to install a few drivers and pay packages, and here is a just with all the commands required.

  • I'm going to make this just public so you can set up an instance yourself and try running the model.

  • Let's open a necessary connection to the instance by clicking on about a hill, they should bring up a terminal like this.

  • So I've already cloned the garden model report.

  • We're going to be running this command inside the resonant directory.

  • We're gonna run the imagery main file.

  • So we're using the image in a data set.

  • A bad size off 10 24 or 1 28 40 You, our modern directory is going to point to the G.

  • C s bucket that's gonna hold our checkpoints and summaries that we want to save.

  • We point our data directory to the SST disc, which, as the image in a data set and the number of abuse is eight over which we want to distribute or trade our modern.

  • So let's run this model alone.

  • And as the model is starting to train, let's take a look at some off the court changes that are involved in to change the resident modern function So this is a resident man function in the garden.

  • Modern Report.

  • First Winston, she ate the middle strategy object.

  • Then we pass it to the run, can fake.

  • It's part of the train.

  • Distribute argument.

  • We create an estimator object with the run conflict and then we call train on this estimator object and that's it.

  • Those are all the court changes you need to distribute the resident model.

  • Let's go back and see how our training is going.

  • So we've run out for a few 100 steps at the bottom of the screen.

  • You should see a few metrics.

  • The losses disc Reesing steps the second learning rate.

  • Let's look a tense aboard.

  • So this is from a run where I've run the model for 90,000 steps a little over that.

  • So it's not the run we just thought it so.

  • The orange and red lines are the training and evaluation losses, so as a number of steps increase, you see the lost decreasing.

  • Let's look at evaluation accuracy, and this is when we're training resident 50 or a cheap use.

  • So if we see that around 91,000 steps were able to achieve a 75% accuracy.

  • Let's see what what this looks like when we run it on a single cheap you.

  • So let's toggle the tensile board buttons on the left and look at the train and evaluation lost carbs when we train our model on one ship, you.

  • So the blue lines are one ship you and red or orange and age, and you could see that the Lost doesn't decrees as rapidly as it does with H abuse.

  • Here are the evaluation accuracy carves were able to achieve a higher accuracy.

  • When we distribute our mortal across H abuse as opposed to one, let's compare using wall time.

  • So we've run the same model for the same amount of time.

  • And when we run it over, multiple cheap use were able to achieve higher accuracy faster or train our model faster.

  • Let's look at a few performance benchmarks on the DJ X one DJ X one is a machine on win, which on which we run deep learning models.

  • We're running Mr Mixed Position training with a poor cheap you bad size off to 56.

  • It also has eight Walter over the 100 cheap use, So the graph shows X axis.

  • Ah, the number of GPS on the X axis and images per second on the Y axis.

  • So as we go from one job, you tow it were able to achieve a speed up off seven X and this is performance right out off the box with no tuning, we're actively working on improving performance so that you're able to achieve more speed up and get more images per second when you distribute your model across multiple chip use.

  • So far, we've been talking about the core part off model training and distributing your model using mirrored strategy.

  • Okay, so let's say now you have deployed your mortal on multiple chief use.

  • You're going to expect to see the same kind off boost in images per second when you do that, but that you may not be able to view as many images per second as compared to one ship.

  • You you may not see the boost in performance, and the reason for that is also the input by plane.

  • When you run a model on a single chief, you, the input pipeline is pre processing the data and making the data available on the GPU for training.

  • But chief use our TV was, as you know, process and computer data much faster than a CPU.

  • This means that when you distribute your model across multiple GP use, the input pipeline is often not able to keep up with the training.

  • It quickly becomes a bottleneck for the rest of the talk.

  • I'm going to show you how Tensorflow makes it easy for you to use T FDR data FBI's to Build Efficient and Performance in Port Pipelines Here's a simple and put pipeline for Resident 50.

  • We're going to use D of dot ada, FBI's because data sets are awesome.

  • They help us build complex by plan, using simple, reusable pieces.

  • When you have lots of data and different data for months and you want to perform complex transformations on this data, you want to be using T of data.

  • FBI's to build your input pipeline First, we're going toe call.

  • Use the list files a p I to get the list off import files that contain your image on labels.

  • Then we're going to read these files using the TF record data set Reader, we're gonna shuffle the records, repeat them a few times, depending on.

  • If you want to run your martyr for a couple of each box and finally apply a map transformation.

  • So this process is each recorded applies transformation such a scrapping, flipping image decoding and finally batch the input and finally botched the input into about size that your desire, the import by blind, can be thought off as an ideal process which is extract, transform and load process in the extract face, we're reading from persistent storage which can be local or remote.

  • In the transform face, we're applying the different transformations like shuffle, repeat map and batch.

  • And finally, in the Lord face, we're providing this process data to the accelerator for training.

  • So how does this apply to the example that we just saw and the extract fists realist the fires and read it using the tear free card data set reader.

  • In the transform phase, we apply the shuffle repeat map and batch transformations.

  • And finally, in the lower face, we tell sense of flow.

  • How to grab the data from the data set.

  • This is what our input pipeline looks like.

  • We have the extract transform and Lord Faiz is happening sequentially, followed by the training on the accelerator.

  • This means whether when the CPI was busy pre processing the data, the accelerator resided and where the accelerator is training your model, the CPU is idle, but the different phases off the ideal process used different hardware resources.

  • For example, the extract step uses the persistent storage.

  • The transformed step uses a different course of the CPU, and finally the training happens on the accelerator.

  • So if we can paralyze thes different faces, then we can all up the pre processing of data on the sea.

  • If you were training off the modern on the GPU, this is called pipeline ing, so we can use pipe lining and some parallel ization techniques to build more efficient import pipelines.

  • Let us look at few off these techniques.

  • First, you can paralyze file reading.

  • Let's say you have a lot of data that shard a car across a cloud storage service.

  • You won't read multiple files in parallel, and you can do this using the num parallel reads called as when you instantly that year, ever when you quality Africa data set a P I.

  • This allows you to increase your effective throughput.

  • We can also paralyze map function for transformations, you could date up our lives the just a different transmission for off the map function by using the numbered little calls argument.

  • Typically, the the argument we provide is the number, of course, off the CPU.

  • And finally you want to call pre fetch at the end off your import pipeline prefect duty couples, the time the data is produced from the time it is consumed.

  • This means that you can before data for the next training step, while the accelerator is still training the current step.

  • This is what we had before, and this is what we can get an improvement on.

  • Here are the different phases off the import pipeline are happening in Parliament were training.

  • We're able to see that the CPI was pre processing data for the training step to while the accelerator is still training step one.

  • Neither the CPU nor the accelerator is idle for long periods of time.

  • The training time is now a maximum off pre processing.

  • I'm training on the accelerator.

  • As you can see, the accelerator is still not 100% utilized.

  • There are few advanced techniques that against add to our input by plying.

  • To improve this, we can use fuse transformation ops off for some of these AP I calls shuffle and repeat, for example, can be replaced by its equal and fused up.

  • So this paralyzes buffering elements for epoch and plus one while producing elements for a poke.

  • And we can also replace map and but that's equal infused up.

  • This paralyzes paralyzes the map transformation.

  • But adding the importance is too much with these techniques were able to process data much faster and make it available to the accelerator for training and improve the training speed.

  • I hope this gives you, ah, good idea of how you can use to have got it AP eyes to build efficient and performance in port pipelines when you train your model.

  • So far we've been talking about training on a single machine and multiple devices.

  • But what if you wanted to train on multiple machines you can use as the estimators?

  • Train and evaluate FBI training?

  • Evaluate FBI uses the casing parameter silver approach?

  • This FBI is used widely within who will on its skills well toe a large number of machines.

  • Here's a link to the FBI where you can learn more on how to use it.

  • We're all so excited to be working on a number of new distribution strategies.

  • We're working on a money machine Murad strategy, which allows you to distribute your mortal across many machines with many devices.

  • We're also working on adding distribution strategy, support to TP use and directly into your darker us.

  • In this talk.

  • We've talked a lot about the different concepts related to distribute the training architectures and a p I.

  • But when you go home today, here are three things for you to keep in mind.

  • When you train your model, distribute your training to make it faster.

  • To do this, you want to use distribution strategy.

  • FBI's They're easy to use, and they're fast.

  • Input by blind performance is important.

  • Used to have that data AP Eyes to build efficient import pipelines Here are a few 10th off Low resource is First, we have the distribution strategy.

  • FBI.

  • You can try using mirrored strategy to train your modern across multiple reviews.

  • Here's a link to the Resident 50 model Garden example.

  • So you can try running this example.

  • It has mirrored strategy, FBI support enabled.

  • Here's a link also to the Import Pipeline Performance Guide, which has more techniques that you can use to build efficient import pipelines.

  • And here's the link to the dressed that I mentioned in the demo.

  • You can try setting up your own instance on running the resonant 50 model Garden example.

  • Thank you for attending our talk, and we hope you had a great eye.

My name is Fria.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it