Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • MINGSHENG HONG: I'm Mingsheng Hong, tech lead and manager

  • of the TensorFlow runtime team.

  • Today, I'm excited to share with you our new project,

  • codenamed TFRT.

  • As you probably guessed, it stands for none other

  • than TensorFlow Runtime.

  • Some people might tell you that, in the world of TensorFlow,

  • runtime is what keeps tensors flowing.

  • I think that if runtime does this job,

  • you should never have to think about it.

  • But since we are here talking about the runtime,

  • let's first take a look at where runtime

  • fits into the TensorFlow stack.

  • Here's a diagram on the training workflow.

  • Runtime can be driven by eager APIs.

  • It can also execute graph programs

  • produced by a graph compiler.

  • Runtime is a low-level component that orchestrates all model

  • execution by calling into the relevant kernels that

  • implement machine learning primitives like matrix

  • multiplications.

  • We're building TFRT to replace the existing runtime.

  • And let's first talk about why.

  • We talked to many of you, our TensorFlow users,

  • and heard from your pain points and requests.

  • First, many of you are pushing the envelope in the performance

  • and scalability of model processing

  • across both eager and graph execution.

  • Second, you are making continuous innovations

  • through the addition of ops, kernels, and devices

  • to TensorFlow.

  • And we need to make such extension work more

  • streamlined and productive.

  • And once you are done with the model research and tuning,

  • you'll want to deploy TensorFlow everywhere,

  • across a diverse set of hardware platforms.

  • For those great reasons, we're building a new runtime

  • to help you, a new runtime that provides the performance,

  • extensibility, and unification that you all are looking for.

  • So how does TFRT fit into the workflow of an ML model?

  • Here, we see the TF training stack again.

  • Through TensorFlow APIs, your program

  • can either eagerly dispatch ops to the runtime,

  • as you can see from the blue arrows on the left side

  • of the diagram.

  • Or as the red arrows on the right side show,

  • in the case of graph execution, your program first

  • generates a computational graph, which

  • gets lowered to the optimized target-specific program,

  • and then dispatched to the runtime.

  • The optimization and loading work

  • uses the MLIR compiler framework, which Jacques just

  • spoke about in his MLIR talk.

  • Finally, in both execution paths,

  • TFRT will call into a set of kernels

  • to complete a model execution, as the purple arrow shows.

  • Again, the term, kernel, here refers

  • to device-specific operations, like a GPU-based matrix

  • multiplication.

  • TFRT orchestrates the efficient kernel execution

  • over a heterogeneous set of hardware.

  • Now let's dive a little more into the technical design

  • and look at how we realized the vision of building

  • a performant, extensible, and unified runtime.

  • First, to achieve high performance,

  • we built a lock-free graph executor

  • that supports concurrent op execution

  • with low synchronization overhead.

  • We have also made the eager op dispatch stack very, very thin.

  • And the eager API course will call into the relevant kernels

  • with minimal runtime overhead.

  • Second, to talk about extensibility,

  • let's first cover some background.

  • Host runtime is the component that

  • drives host CPU and I/O work, and it also

  • drives locally attached devices through the device runtimes.

  • TFRT keeps device runtimes separate from the host runtime,

  • so that wheere you add new device runtimes,

  • you don't have to extend the rest of the runtime.

  • The TFRT design also focuses on building

  • common abstractions, such as shape functions and kernels,

  • to be used in both graph and eager execution.

  • And this way, we get consistent behavior

  • between eager and graph, and also avoid

  • duplicated engineering efforts.

  • Now, if you feel a bit lost in the last slide,

  • don't worry about it.

  • Let's step back and let's look at how these key design

  • decisions will benefit the core TensorFlow use cases.

  • For those of you who care about training,

  • you will see improved performance as well as

  • error reporting.

  • And that should make it easier to debug your models.

  • If you deploy TensorFlow models in production,

  • you'll be glad to see some improved performance

  • and some reduced CPU usage.

  • And I will show you in a benchmarking study shortly.

  • TFRT will also support deployments

  • across diverse hardware platforms.

  • And in the next couple of slides,

  • I will show you some initial results on serving support.

  • TFRT is integrated into TensorFlow Serving

  • to form a flexible, high-performance serving system

  • for production environments.

  • If you follow the orange arrow, it

  • shows a pre-trained model that's loaded

  • into TFRT through TensorFlow same-model API.

  • Now, the blue arrows showed that the serving clients

  • can send the request to the model

  • and get prediction results back.

  • We expect this TFRT integration to be largely

  • transparent to the end users.

  • So TFRT works for serving.

  • How does it perform?

  • In this benchmarking study, we used an MLPerf benchmark model,

  • Resnet 50, and measured the performance of GPU

  • inference over TFRT, and compared to the current stack.

  • We chose to use FP16 and the batch size of 1

  • to focus the performance study on the runtime-related op

  • dispatch overhead.

  • Let's now look at the numbers.

  • Can I have some drumrolls, please?

  • [DRUM ROLL]

  • Thank you.

  • I should first note that the current runtime time is already

  • highly optimized for graph execution and serving needs.

  • Through multiple runs, it had a respectable average inference

  • time of 3.3 milliseconds.

  • In comparison, TFRT had an average inference time

  • of 2.4 milliseconds.

  • Bam.

  • There you have it.

  • This is a handsome improvement of 28%.

  • And there are more optimizations under the way.

  • Our internal testing also showed that TFRT is scoring favorably

  • over alternatives of TensorFlow on this model.

  • We are very, very excited about this.

  • The performance improvements are due to the more efficient use

  • of multi-threaded CPUs, the asynchronous runtime

  • design, and a general focus of low-level efficiency.

  • While your actual mileage might vary depending

  • on your workloads, this encouraging result

  • helps validate our initial work, and prepares us

  • for the ongoing push to make TFRT production-ready.

  • I know.

  • You're excited too, right?

  • And you're probably wondering, when can I have it?

  • We will ship TFRT this year.

  • And this is going to be an exciting journey.

  • In addition to the maintenance and selected enhancement

  • to the current stack, we plan to build out and integrate

  • TFRT with the TensorFlow stack.

  • As we roll out TFRT, we will initially make it available

  • through an opt-in flag, giving us

  • some time to fix issues and fine-tune the performance.

  • Eventually, it will become the default runtime.

  • We also plan to open-source this project in the near future.

  • We would love to get you all more involved.

  • We will keep you updated on our progress

  • through the developers@tensorflow.org

  • mailing list.

  • So please make sure to join it.

  • And also, if you would like to learn more about TFRT design,

  • please join our deep-dive tech talk at the March 19 MLIR

  • open design meeting.

  • The meeting is open to all members of the community.

  • Thank you all, and look forward to following up.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it