Placeholder Image

Subtitles section Play video

  • JARED DUKE: Thanks everybody for showing up.

  • My name is Jared.

  • I'm an engineer on the TensorFlow Lite team.

  • Today I will be giving a very high level overview

  • with a few deep dives into the TensorFlow Lite

  • stack, what it is, why we have it, what it can do for you.

  • Again, this is a very broad topic.

  • So there will be some follow up here.

  • And if you have any questions, feel free to interrupt me.

  • And you know, this is meant to be enlightening for you.

  • But it will be a bit of a whirlwind.

  • So let's get started.

  • First off, I do want to talk about some

  • of the origins of TensorFlow Light

  • and what motivated its creation, why we have it

  • in the first place and we can't just use TensorFlow on devices.

  • I'll briefly review how you actually use TensorFlow Lite.

  • That means how you use the converter.

  • How you use the runtime.

  • And then talk a little bit about performance considerations.

  • How you can get the best performance on device

  • when you're using TensorFlow Lite.

  • OK.

  • Why do you need TensorFlow Lite in your life?

  • Well, again, here's some kind of boilerplate motivation

  • for why we need on device ML.

  • But these are actually important use cases.

  • You don't always have a connection.

  • You can't just always be running inference in the cloud

  • and streaming that to your device.

  • A lot of devices, particularly in developing countries,

  • have restrictions on bandwidth.

  • They can't just be streaming live video

  • to get their selfie segmentation.

  • They want that done locally on their phone.

  • There's issues with latency if you need

  • real time object detection.

  • Streaming to the cloud, again, is problematic.

  • And then there's issues with power.

  • On a mobile device, often the radio

  • is using the most power on your device.

  • So if you can do things locally, particularly with a hardware

  • backend like a DSP or an MPU, you

  • will extend your battery life.

  • But along with mobile ML execution,

  • there are a number of challenges with memory constraints,

  • with the low powered CPUs that we have on mobile devices.

  • There's also a very kind of fragmented and heterogeneous

  • ecosystem of hardware backends.

  • This isn't like the cloud where often

  • you have a primary provider of your acceleration backend

  • with, say, NVIDIA GPUs or TPUs.

  • There's a large class of different kinds

  • of accelerators.

  • And there's a problem with how can we

  • actually leverage all of these.

  • So again, TensorFlow works great on large well-powered devices

  • in the cloud, locally on beefy workstation machines.

  • But TensorFlow Lite is not focused on these cases.

  • It's focused on the edge.

  • So stepping back a bit, we've had TensorFlow

  • for a number of years.

  • And why couldn't we just trim this down

  • and run it on a mobile device?

  • This is actually what we call the TensorFlow mobile project.

  • And we tried this.

  • And after a lot of effort, and a lot of hours,

  • and blood, sweat, and tears, we were

  • able to create kind of a reduced variant of TensorFlow

  • with a reduced operator set and a trimmed down runtime.

  • But we were hitting a lower bound

  • on where we could go in terms of the size of the binary.

  • And there was also issues in how we

  • could make that runtime a bit more extensible,

  • how we could map it onto all these different kinds

  • of accelerators that you get in a mobile environment.

  • And while there have been a lot of improvements

  • in the TensorFlow ecosystem with respect to modularity,

  • it wasn't quite where we needed it

  • to be to make that a reality.

  • AUDIENCE: How small a memory do you need to get to?

  • JARED DUKE: Memory?

  • AUDIENCE: Yeah.

  • Three [INAUDIBLE] seem too much.

  • JARED DUKE: So this is just the binary size.

  • AUDIENCE: Yeah.

  • Yeah.

  • [INAUDIBLE]

  • JARED DUKE: So in app size.

  • In terms of memory, it's highly model dependent.

  • So if you're using a very large model,

  • then you may be required to use lots of memory.

  • But there are different considerations

  • that we've taken into account with TensorFlow Lite

  • to reduce the memory consumption.

  • AUDIENCE: But your size, how small is it?

  • JARED DUKE: With TensorFlow Lite?

  • AUDIENCE: Yeah.

  • JARED DUKE: So the core interpreter runtime

  • is 100 kilobytes.

  • And then with our full set of operators,

  • it's less than a megabyte.

  • So TFMini was a project that shares

  • some of the same origins with TensorFlow Lite.

  • And this was, effectively, a tool chain

  • where you could take your frozen model.

  • You could convert it.

  • And it did some kind of high level operator fusings.

  • And then it would do code gen. And it would kind of

  • bake your model into your actual binary.

  • And then you could run this on your device and deploy it.

  • And it was well-tuned for mobile devices.

  • But again, there are problems with portability

  • when you're baking the model into an actual binary.

  • You can't always stream this from the cloud

  • and rely on this being a secure path.

  • And it's often discouraged.

  • And this is more of a first party solution

  • for a lot of vision-based use cases and not a general purpose

  • solution.

  • So enter TensorFlow Lite.

  • Lightweight machine learning library

  • from all embedded devices.

  • The goals behind this were making ML easier,

  • making it faster, and making the kind of binary size and memory

  • impact smaller.

  • And I'll dive into each of these a bit more in detail

  • in terms of what it looks like in the TensorFlow Lite stack.

  • But again, the chief considerations

  • were reducing the footprint in memory and binary size,

  • making conversion straightforward,

  • having a set of APIs that were focused primarily on inference.

  • So you've already crafted and authored your models.

  • How can you just run and deploy these on a mobile device?

  • And then taking advantage again of mobile-specific hardware

  • like these ARM CPUs, like these DSP and NPUs

  • that are in development.

  • So let's talk about the actual stack.

  • TensorFlow Lite has a converter where

  • you ingest the graph def, the saved model, the frozen graphs.

  • You convert it to a TensorFlow Lite specific model file

  • format.

  • And I'll dig into the specifics there.

  • There's an interpreter for actually executing inference.

  • There's a set of ops.

  • We call it the TensorFlow Lite dialect

  • of operators, which is slightly different than

  • the core TensorFlow operators.

  • And then there's a way to plug in these different hardware

  • accelerators.

  • Just walking through this briefly, again,

  • the converter spits out a TFLite model.

  • You feed it into your runtime.

  • It's got a set of optimized kernels

  • and then some hardware plugins.

  • So let's talk a little bit more about the converter itself

  • and things that are interesting there.

  • It does things like constant folding.

  • It does operator fusing where you're

  • baking the activations and the biased computation

  • into these high level operators like convolution, which

  • we found to provide a pretty substantial speed up

  • on mobile devices.

  • Quantization was one of the chief considerations

  • with developing this converter, supporting

  • both quantization-aware training and post-training quantization.

  • And it was based on flat buffers.

  • So flat buffers are an analog to protobufs, which are

  • used extensively in TensorFlow.

  • But they were developed with more real time considerations

  • in mind, specifically for video games.

  • And the idea is that you can take a flat buffer.

  • You can map it into memory and then read and interpret that