Subtitles section Play video Print subtitles [MUSIC PLAYING] YANG FAN: I'm Yang Fan. I'm a senior engineer at a machine learning platform at Cruise. Today, I'd like to tell you some work we have done the last year, how did the platform team build a machine learning training for our Cruise. So [INAUDIBLE] Cruise is a San Francisco based company. We are building the world's most advanced self-driving ridesharing technology, operable on the San Francisco street. If you visit San Francisco city, you may have a chance to see our orange task cars running on the streets. In order to operate the driverless ridesharing service in San Francisco, our cars have to handle many complex situations. Every day, our test vehicles run through many busy streets. And they interact with double-parked cars, cyclists, delivery trucks, emergency vehicles, pedestrians, and even with their pets. So we see many interesting things on streets all over the city. Our cars are designed to handle most of those situations on it's own. So on our cars are multiple cameras, [INAUDIBLE] sensors detecting the surroundings environment and to make decisions at the same time. Many [INAUDIBLE] tasks are mission critical and driven by machine imaging models. Those machine learning models are trained by a lot of data. You can just see the front row data. Our training data format is very complex and highly dimensional. So we have two work streams in the model development. One is to continuously retrain using the new parameters and data. The other is to develop the experiments and the new models. So on the right side, the chart shows the number of model training jobs by each week. As you can see, the number of training jobs varies week to week. And in the meantime, we want to train machine models fast but not too costly. As a platform, we want to fulfill requirements for both our machine engineers and also for the company. The engineer sides, engineers want to train models at any time. The tools and support of the frameworks are flexible without too much constraints. Our machine's engineers. So once you see jobs start, as soon as they submit training jobs, while they are able to see the results as early as possible. More importantly, they want to have an extremely natural and easy experience using our platform. On the company side, we need to make sure that we spent every penny wisely. It means we need to run our model training jobs efficiently. More importantly, we need to allow our engineers to focus on building the mission critical softwares to impact the car performance. So in order to fulfill those requirements, we decided to build our model training on top of a Google's cloud AI platform. The AI platform offers a fully-managed training service through command line tools and the web APIs. So we could launch our jobs as a single machine or as many machines as our coder allows. Therefore, we can let our machine trainers to scale up the training jobs if they want to. AI platform also provides very customizable hardwares. They could launch our training jobs on a combination of different GPU, CPU and the memory requirements, as they are compatible with each other. The AI platform training service also provide good connectivities to other Google service, for example, like storage service in BigQuery. In order to train model on multiple machines efficiently, we also need good distribution training strategy. We decided to Horovod. Horovod is distribution training framework open sourced by Uber. With a few lines changed in our model training code, we can scale up our training jobs beyond a single machine limit. So far, we have tested a training model using from 16 GPUs to 256 GPUs. While we skill up the training cluster, we noticed the deficiency decreased. That is mostly because of the communication overhead. When there are more GPUs, the more communication CPUs there are. To find the most efficient setup for the training job, at least two factors to be considered. One is the [INAUDIBLE] unit cost. The other is the total training time. On the chart on the right side, it demonstrates one training job example. So if we are training one million images at a one image per second per GPU using Nvidia [INAUDIBLE] and the machine time is the high mapping sense [INAUDIBLE] use with 64 cores. As you could imagine, using more GPU could reduce the total training time. However, to many GPUs would also mean reduce the [INAUDIBLE] efficiency. So the blue line on the chart becomes the flat from left to right, while the number of GPUs increases. While more GPU can save training time, the average cost is increasing. So the red line is showing a total cost of going up from left to right on the chart. For this particular job, we found that using between 40 to 50 GPUs could be most cost effective. We decided to build an automated system to provide the best produce to our users. This is the high level architecture diagram of our district training system. Users interact with a command-line tool. The command-line tool will patch the call and dependencies with parameters into our training jobs, then submit a job to our governance service. The service exam the job parameters to prevent any accidental misconfigured parameters. For example, if this using too many GPUs or too much memories. At the same time, the service will translate the computer requirements into actual machine type AI platform. So users don't need to memorize all the different types of documents neither the cost model themselves. All of our training jobs are registered into a [INAUDIBLE] tracking system, where we can keep the reference to the job source code and its parameters. When we want to analyze the model performance, we can trace back to the job information if that's needed. The training input and output are stored in Google Cloud, the cloud storage service. That's [INAUDIBLE] for short. [INAUDIBLE] provides the cost efficient storage and the highest throughput for our model training jobs. Once the job start to produce some metrics and results, you also can view them through [INAUDIBLE] service. While the job is running, our monitoring service constantly pulls the job metrics through AI platform, APIs, and feeds them into a data log. So far, we care about a GPU and CPU [INAUDIBLE] and job duration. When [INAUDIBLE] is too low, it could mean that job gets stuck. And it doesn't need resource. Then we could notify the jobs-- we could notify the users to inspect a job or adjust the machines that have to save some cost. Our internal framework is a bridge between the training system and our users. It provides our [INAUDIBLE] and the [INAUDIBLE] interactions with the training backhand. So [INAUDIBLE] can focus on writing a lot of code. We also [INAUDIBLE] the distribution training strategy by changing one line in the configuration. As I mentioned in the previous slide, the [INAUDIBLE] tool automatically packages [INAUDIBLE] and a submitted training job. The framework allows the users to specify [INAUDIBLE] by a number of GPUs. Then our training framework will have the backhand to figure out the actual machine set up. The frame while also provides the interface for governance and the monitoring services. This slide demonstrates to the user how do they turn on the Horovod trainings in the config. Our framework, we automatically wrap the optimizer into Horovod's digital optimizer. The framework will also change some other place in the model behavior. One important and change in the model change is the how to start a process.