Subtitles section Play video Print subtitles JARED DUKE: Thanks everybody for showing up. My name is Jared. I'm an engineer on the TensorFlow Lite team. Today I will be giving a very high level overview with a few deep dives into the TensorFlow Lite stack, what it is, why we have it, what it can do for you. Again, this is a very broad topic. So there will be some follow up here. And if you have any questions, feel free to interrupt me. And you know, this is meant to be enlightening for you. But it will be a bit of a whirlwind. So let's get started. First off, I do want to talk about some of the origins of TensorFlow Light and what motivated its creation, why we have it in the first place and we can't just use TensorFlow on devices. I'll briefly review how you actually use TensorFlow Lite. That means how you use the converter. How you use the runtime. And then talk a little bit about performance considerations. How you can get the best performance on device when you're using TensorFlow Lite. OK. Why do you need TensorFlow Lite in your life? Well, again, here's some kind of boilerplate motivation for why we need on device ML. But these are actually important use cases. You don't always have a connection. You can't just always be running inference in the cloud and streaming that to your device. A lot of devices, particularly in developing countries, have restrictions on bandwidth. They can't just be streaming live video to get their selfie segmentation. They want that done locally on their phone. There's issues with latency if you need real time object detection. Streaming to the cloud, again, is problematic. And then there's issues with power. On a mobile device, often the radio is using the most power on your device. So if you can do things locally, particularly with a hardware backend like a DSP or an MPU, you will extend your battery life. But along with mobile ML execution, there are a number of challenges with memory constraints, with the low powered CPUs that we have on mobile devices. There's also a very kind of fragmented and heterogeneous ecosystem of hardware backends. This isn't like the cloud where often you have a primary provider of your acceleration backend with, say, NVIDIA GPUs or TPUs. There's a large class of different kinds of accelerators. And there's a problem with how can we actually leverage all of these. So again, TensorFlow works great on large well-powered devices in the cloud, locally on beefy workstation machines. But TensorFlow Lite is not focused on these cases. It's focused on the edge. So stepping back a bit, we've had TensorFlow for a number of years. And why couldn't we just trim this down and run it on a mobile device? This is actually what we call the TensorFlow mobile project. And we tried this. And after a lot of effort, and a lot of hours, and blood, sweat, and tears, we were able to create kind of a reduced variant of TensorFlow with a reduced operator set and a trimmed down runtime. But we were hitting a lower bound on where we could go in terms of the size of the binary. And there was also issues in how we could make that runtime a bit more extensible, how we could map it onto all these different kinds of accelerators that you get in a mobile environment. And while there have been a lot of improvements in the TensorFlow ecosystem with respect to modularity, it wasn't quite where we needed it to be to make that a reality. AUDIENCE: How small a memory do you need to get to? JARED DUKE: Memory? AUDIENCE: Yeah. Three [INAUDIBLE] seem too much. JARED DUKE: So this is just the binary size. AUDIENCE: Yeah. Yeah. [INAUDIBLE] JARED DUKE: So in app size. In terms of memory, it's highly model dependent. So if you're using a very large model, then you may be required to use lots of memory. But there are different considerations that we've taken into account with TensorFlow Lite to reduce the memory consumption. AUDIENCE: But your size, how small is it? JARED DUKE: With TensorFlow Lite? AUDIENCE: Yeah. JARED DUKE: So the core interpreter runtime is 100 kilobytes. And then with our full set of operators, it's less than a megabyte. So TFMini was a project that shares some of the same origins with TensorFlow Lite. And this was, effectively, a tool chain where you could take your frozen model. You could convert it. And it did some kind of high level operator fusings. And then it would do code gen. And it would kind of bake your model into your actual binary. And then you could run this on your device and deploy it. And it was well-tuned for mobile devices. But again, there are problems with portability when you're baking the model into an actual binary. You can't always stream this from the cloud and rely on this being a secure path. And it's often discouraged. And this is more of a first party solution for a lot of vision-based use cases and not a general purpose solution. So enter TensorFlow Lite. Lightweight machine learning library from all embedded devices. The goals behind this were making ML easier, making it faster, and making the kind of binary size and memory impact smaller. And I'll dive into each of these a bit more in detail in terms of what it looks like in the TensorFlow Lite stack. But again, the chief considerations were reducing the footprint in memory and binary size, making conversion straightforward, having a set of APIs that were focused primarily on inference. So you've already crafted and authored your models. How can you just run and deploy these on a mobile device? And then taking advantage again of mobile-specific hardware like these ARM CPUs, like these DSP and NPUs that are in development. So let's talk about the actual stack. TensorFlow Lite has a converter where you ingest the graph def, the saved model, the frozen graphs. You convert it to a TensorFlow Lite specific model file format. And I'll dig into the specifics there. There's an interpreter for actually executing inference. There's a set of ops. We call it the TensorFlow Lite dialect of operators, which is slightly different than the core TensorFlow operators. And then there's a way to plug in these different hardware accelerators. Just walking through this briefly, again, the converter spits out a TFLite model. You feed it into your runtime. It's got a set of optimized kernels and then some hardware plugins. So let's talk a little bit more about the converter itself and things that are interesting there. It does things like constant folding. It does operator fusing where you're baking the activations and the biased computation into these high level operators like convolution, which we found to provide a pretty substantial speed up on mobile devices. Quantization was one of the chief considerations with developing this converter, supporting both quantization-aware training and post-training quantization. And it was based on flat buffers. So flat buffers are an analog to protobufs, which are used extensively in TensorFlow. But they were developed with more real time considerations in mind, specifically for video games. And the idea is that you can take a flat buffer. You can map it into memory and then read and interpret that