Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • YANG FAN: I'm Yang Fan.

  • I'm a senior engineer at a machine learning platform

  • at Cruise.

  • Today, I'd like to tell you some work

  • we have done the last year, how did the platform team build

  • a machine learning training for our Cruise.

  • So [INAUDIBLE] Cruise is a San Francisco based company.

  • We are building the world's most advanced self-driving

  • ridesharing technology, operable on the San Francisco street.

  • If you visit San Francisco city, you

  • may have a chance to see our orange task

  • cars running on the streets.

  • In order to operate the driverless ridesharing service

  • in San Francisco, our cars have to handle

  • many complex situations.

  • Every day, our test vehicles run through many busy streets.

  • And they interact with double-parked cars, cyclists,

  • delivery trucks, emergency vehicles, pedestrians,

  • and even with their pets.

  • So we see many interesting things on streets

  • all over the city.

  • Our cars are designed to handle most of those situations

  • on it's own.

  • So on our cars are multiple cameras, [INAUDIBLE] sensors

  • detecting the surroundings environment and to make

  • decisions at the same time.

  • Many [INAUDIBLE] tasks are mission critical and driven

  • by machine imaging models.

  • Those machine learning models are trained by a lot of data.

  • You can just see the front row data.

  • Our training data format is very complex and highly dimensional.

  • So we have two work streams in the model development.

  • One is to continuously retrain using the new parameters

  • and data.

  • The other is to develop the experiments and the new models.

  • So on the right side, the chart shows

  • the number of model training jobs by each week.

  • As you can see, the number of training jobs

  • varies week to week.

  • And in the meantime, we want to train machine models fast

  • but not too costly.

  • As a platform, we want to fulfill requirements

  • for both our machine engineers and also for the company.

  • The engineer sides, engineers want

  • to train models at any time.

  • The tools and support of the frameworks

  • are flexible without too much constraints.

  • Our machine's engineers.

  • So once you see jobs start, as soon as they submit training

  • jobs, while they are able to see the results as

  • early as possible.

  • More importantly, they want to have

  • an extremely natural and easy experience using our platform.

  • On the company side, we need to make sure

  • that we spent every penny wisely.

  • It means we need to run our model

  • training jobs efficiently.

  • More importantly, we need to allow our engineers

  • to focus on building the mission critical softwares

  • to impact the car performance.

  • So in order to fulfill those requirements,

  • we decided to build our model training on top of a Google's

  • cloud AI platform.

  • The AI platform offers a fully-managed training service

  • through command line tools and the web APIs.

  • So we could launch our jobs as a single machine

  • or as many machines as our coder allows.

  • Therefore, we can let our machine trainers

  • to scale up the training jobs if they want to.

  • AI platform also provides very customizable hardwares.

  • They could launch our training jobs

  • on a combination of different GPU, CPU and the memory

  • requirements, as they are compatible with each other.

  • The AI platform training service also

  • provide good connectivities to other Google service,

  • for example, like storage service in BigQuery.

  • In order to train model on multiple machines efficiently,

  • we also need good distribution training strategy.

  • We decided to Horovod.

  • Horovod is distribution training framework open sourced by Uber.

  • With a few lines changed in our model training code,

  • we can scale up our training jobs

  • beyond a single machine limit.

  • So far, we have tested a training model

  • using from 16 GPUs to 256 GPUs.

  • While we skill up the training cluster,

  • we noticed the deficiency decreased.

  • That is mostly because of the communication overhead.

  • When there are more GPUs, the more communication

  • CPUs there are.

  • To find the most efficient setup for the training job, at least

  • two factors to be considered.

  • One is the [INAUDIBLE] unit cost.

  • The other is the total training time.

  • On the chart on the right side, it

  • demonstrates one training job example.

  • So if we are training one million images

  • at a one image per second per GPU using Nvidia [INAUDIBLE]

  • and the machine time is the high mapping sense [INAUDIBLE]

  • use with 64 cores.

  • As you could imagine, using more GPU

  • could reduce the total training time.

  • However, to many GPUs would also mean reduce

  • the [INAUDIBLE] efficiency.

  • So the blue line on the chart becomes the flat from left

  • to right, while the number of GPUs increases.

  • While more GPU can save training time,

  • the average cost is increasing.

  • So the red line is showing a total cost

  • of going up from left to right on the chart.

  • For this particular job, we found

  • that using between 40 to 50 GPUs could be most cost effective.

  • We decided to build an automated system to provide the best

  • produce to our users.

  • This is the high level architecture diagram

  • of our district training system.

  • Users interact with a command-line tool.

  • The command-line tool will patch the call

  • and dependencies with parameters into our training jobs,

  • then submit a job to our governance service.

  • The service exam the job parameters

  • to prevent any accidental misconfigured parameters.

  • For example, if this using too many GPUs or too much memories.

  • At the same time, the service will translate the computer

  • requirements into actual machine type AI platform.

  • So users don't need to memorize all the different types

  • of documents neither the cost model themselves.

  • All of our training jobs are registered

  • into a [INAUDIBLE] tracking system, where

  • we can keep the reference to the job source

  • code and its parameters.

  • When we want to analyze the model performance,

  • we can trace back to the job information if that's needed.

  • The training input and output are

  • stored in Google Cloud, the cloud storage service.

  • That's [INAUDIBLE] for short.

  • [INAUDIBLE] provides the cost efficient storage

  • and the highest throughput for our model training jobs.

  • Once the job start to produce some metrics and results,

  • you also can view them through [INAUDIBLE] service.

  • While the job is running, our monitoring service

  • constantly pulls the job metrics through AI platform,

  • APIs, and feeds them into a data log.

  • So far, we care about a GPU and CPU [INAUDIBLE]

  • and job duration.

  • When [INAUDIBLE] is too low, it could mean that job gets stuck.

  • And it doesn't need resource.

  • Then we could notify the jobs--

  • we could notify the users to inspect a job

  • or adjust the machines that have to save some cost.

  • Our internal framework is a bridge between the training

  • system and our users.

  • It provides our [INAUDIBLE] and the [INAUDIBLE] interactions

  • with the training backhand.

  • So [INAUDIBLE] can focus on writing a lot of code.

  • We also [INAUDIBLE] the distribution

  • training strategy by changing one line in the configuration.

  • As I mentioned in the previous slide,

  • the [INAUDIBLE] tool automatically

  • packages [INAUDIBLE] and a submitted training job.

  • The framework allows the users to specify [INAUDIBLE]

  • by a number of GPUs.

  • Then our training framework will have the backhand to figure out

  • the actual machine set up.

  • The frame while also provides the interface for governance

  • and the monitoring services.

  • This slide demonstrates to the user

  • how do they turn on the Horovod trainings in the config.

  • Our framework, we automatically wrap the optimizer

  • into Horovod's digital optimizer.

  • The framework will also change some other place

  • in the model behavior.

  • One important and change in the model change

  • is the how to start a process.