Placeholder Image

Subtitles section Play video

  • JARED DUKE: Thanks everybody for showing up.

  • My name is Jared.

  • I'm an engineer on the TensorFlow Lite team.

  • Today I will be giving a very high level overview

  • with a few deep dives into the TensorFlow Lite

  • stack, what it is, why we have it, what it can do for you.

  • Again, this is a very broad topic.

  • So there will be some follow up here.

  • And if you have any questions, feel free to interrupt me.

  • And you know, this is meant to be enlightening for you.

  • But it will be a bit of a whirlwind.

  • So let's get started.

  • First off, I do want to talk about some

  • of the origins of TensorFlow Light

  • and what motivated its creation, why we have it

  • in the first place and we can't just use TensorFlow on devices.

  • I'll briefly review how you actually use TensorFlow Lite.

  • That means how you use the converter.

  • How you use the runtime.

  • And then talk a little bit about performance considerations.

  • How you can get the best performance on device

  • when you're using TensorFlow Lite.

  • OK.

  • Why do you need TensorFlow Lite in your life?

  • Well, again, here's some kind of boilerplate motivation

  • for why we need on device ML.

  • But these are actually important use cases.

  • You don't always have a connection.

  • You can't just always be running inference in the cloud

  • and streaming that to your device.

  • A lot of devices, particularly in developing countries,

  • have restrictions on bandwidth.

  • They can't just be streaming live video

  • to get their selfie segmentation.

  • They want that done locally on their phone.

  • There's issues with latency if you need

  • real time object detection.

  • Streaming to the cloud, again, is problematic.

  • And then there's issues with power.

  • On a mobile device, often the radio

  • is using the most power on your device.

  • So if you can do things locally, particularly with a hardware

  • backend like a DSP or an MPU, you

  • will extend your battery life.

  • But along with mobile ML execution,

  • there are a number of challenges with memory constraints,

  • with the low powered CPUs that we have on mobile devices.

  • There's also a very kind of fragmented and heterogeneous

  • ecosystem of hardware backends.

  • This isn't like the cloud where often

  • you have a primary provider of your acceleration backend

  • with, say, NVIDIA GPUs or TPUs.

  • There's a large class of different kinds

  • of accelerators.

  • And there's a problem with how can we

  • actually leverage all of these.

  • So again, TensorFlow works great on large well-powered devices

  • in the cloud, locally on beefy workstation machines.

  • But TensorFlow Lite is not focused on these cases.

  • It's focused on the edge.

  • So stepping back a bit, we've had TensorFlow

  • for a number of years.

  • And why couldn't we just trim this down

  • and run it on a mobile device?

  • This is actually what we call the TensorFlow mobile project.

  • And we tried this.

  • And after a lot of effort, and a lot of hours,

  • and blood, sweat, and tears, we were

  • able to create kind of a reduced variant of TensorFlow

  • with a reduced operator set and a trimmed down runtime.

  • But we were hitting a lower bound

  • on where we could go in terms of the size of the binary.

  • And there was also issues in how we

  • could make that runtime a bit more extensible,

  • how we could map it onto all these different kinds

  • of accelerators that you get in a mobile environment.

  • And while there have been a lot of improvements

  • in the TensorFlow ecosystem with respect to modularity,

  • it wasn't quite where we needed it

  • to be to make that a reality.

  • AUDIENCE: How small a memory do you need to get to?

  • JARED DUKE: Memory?

  • AUDIENCE: Yeah.

  • Three [INAUDIBLE] seem too much.

  • JARED DUKE: So this is just the binary size.

  • AUDIENCE: Yeah.

  • Yeah.

  • [INAUDIBLE]

  • JARED DUKE: So in app size.

  • In terms of memory, it's highly model dependent.

  • So if you're using a very large model,

  • then you may be required to use lots of memory.

  • But there are different considerations

  • that we've taken into account with TensorFlow Lite

  • to reduce the memory consumption.

  • AUDIENCE: But your size, how small is it?

  • JARED DUKE: With TensorFlow Lite?

  • AUDIENCE: Yeah.

  • JARED DUKE: So the core interpreter runtime

  • is 100 kilobytes.

  • And then with our full set of operators,

  • it's less than a megabyte.

  • So TFMini was a project that shares

  • some of the same origins with TensorFlow Lite.

  • And this was, effectively, a tool chain

  • where you could take your frozen model.

  • You could convert it.

  • And it did some kind of high level operator fusings.

  • And then it would do code gen. And it would kind of

  • bake your model into your actual binary.

  • And then you could run this on your device and deploy it.

  • And it was well-tuned for mobile devices.

  • But again, there are problems with portability

  • when you're baking the model into an actual binary.

  • You can't always stream this from the cloud

  • and rely on this being a secure path.

  • And it's often discouraged.

  • And this is more of a first party solution

  • for a lot of vision-based use cases and not a general purpose

  • solution.

  • So enter TensorFlow Lite.

  • Lightweight machine learning library

  • from all embedded devices.

  • The goals behind this were making ML easier,

  • making it faster, and making the kind of binary size and memory

  • impact smaller.

  • And I'll dive into each of these a bit more in detail

  • in terms of what it looks like in the TensorFlow Lite stack.

  • But again, the chief considerations

  • were reducing the footprint in memory and binary size,

  • making conversion straightforward,

  • having a set of APIs that were focused primarily on inference.

  • So you've already crafted and authored your models.

  • How can you just run and deploy these on a mobile device?

  • And then taking advantage again of mobile-specific hardware

  • like these ARM CPUs, like these DSP and NPUs

  • that are in development.

  • So let's talk about the actual stack.

  • TensorFlow Lite has a converter where

  • you ingest the graph def, the saved model, the frozen graphs.

  • You convert it to a TensorFlow Lite specific model file

  • format.

  • And I'll dig into the specifics there.

  • There's an interpreter for actually executing inference.

  • There's a set of ops.

  • We call it the TensorFlow Lite dialect

  • of operators, which is slightly different than

  • the core TensorFlow operators.

  • And then there's a way to plug in these different hardware

  • accelerators.

  • Just walking through this briefly, again,

  • the converter spits out a TFLite model.

  • You feed it into your runtime.

  • It's got a set of optimized kernels

  • and then some hardware plugins.

  • So let's talk a little bit more about the converter itself

  • and things that are interesting there.

  • It does things like constant folding.

  • It does operator fusing where you're

  • baking the activations and the biased computation

  • into these high level operators like convolution, which

  • we found to provide a pretty substantial speed up

  • on mobile devices.

  • Quantization was one of the chief considerations

  • with developing this converter, supporting

  • both quantization-aware training and post-training quantization.

  • And it was based on flat buffers.

  • So flat buffers are an analog to protobufs, which are

  • used extensively in TensorFlow.

  • But they were developed with more real time considerations

  • in mind, specifically for video games.

  • And the idea is that you can take a flat buffer.

  • You can map it into memory and then read and interpret that

  • directly.

  • There's no unpacking step.

  • And this has a lot of nice advantages.

  • You can actually map this into a page and it's clean.

  • It's not a dirty page.

  • You're not dirtying up your heap.

  • And this is extremely important in mobile environments

  • where you are constrained on memory.

  • And often the app is going in and out of foreground.

  • And there's low memory pressure.

  • And there's also a smaller binary size

  • impact when you use flat buffers relative to protobufs.

  • So the interpreter, again, was built from the ground

  • up with mobile devices in mind.

  • It has fewer dependencies.

  • We try not to depend on really anything at base.

  • We have very few absolute dependencies.

  • I already talked about the binary size here.

  • It's quite a bit smaller than--

  • the minimum binary size we were able to get with TensorFlow

  • Mobile was about three megabytes for just the runtime.

  • And that's without any operators.

  • It was engineered to start up quickly.

  • That's kind of a combination of being able to map your models

  • directly into memory but then also having a static execution

  • plan where there's--

  • during conversion, we basically map out

  • directly what the sequence of nodes that would be executed.

  • And then for the memory planning,

  • basically there's a pass when you're running your model where

  • we prepare each operator.

  • And they kind of cue up a bunch of allocations.

  • And those are all baked into a single pass where we then

  • allocate a single block of memory and tensors

  • are just fused into that large contiguous block of memory.

  • We don't yet support control flow.

  • But I will be talking about that later in the talk.

  • It's something that we're thinking about and working on.

  • It's on the near horizon for actual shipping models.

  • So what about the operator set?

  • So we support float and quantized types

  • for most of our operators.

  • A lot of these are backed by hand-tuned, neon, and assembly

  • based kernels that are specifically

  • optimized for ARM devices.

  • Ruy is our newest GEMM backend for TensorFlow Lite.

  • And it was built from the ground up with mobile execution

  • in mind, a [INAUDIBLE] execution.

  • We support about 120 built-in operators right now.

  • You will probably realize that that's quite

  • a bit smaller than the set of TensorFlow ops, which

  • is probably into the thousands by now.

  • I'm not exactly sure.

  • So that can cause problems.

  • But I'll dig into some solutions we have on the table for that.

  • I already talked about some of the benefits

  • of these high level kernels having

  • fused activations and biases.

  • And then we have a way for you to kind of, at conversion time,

  • stub out custom operators that you would like.

  • Maybe we don't yet support them in TF Lite

  • or maybe it's a one off operator that's not yet

  • supported in TensorFlow.

  • And then you can plug-in your operator implementation

  • at runtime.

  • So the hardware acceleration interface,

  • we call them delegates.

  • This is basically an abstraction that

  • allows you to plug in and accelerate

  • subgraphs of the overall graph.

  • We have NNAPI, GPU, EdgeTPU, and DSP backends on Android.

  • And then on iOS, we have a metal delegate backend.

  • And I'll be digging into some of these and their details

  • here in a few slides.

  • OK.

  • So what can I do with it?

  • Well, I mean this is largely a lot of the same things

  • that you can do with TensorFlow.

  • There's a lot of speech and vision-related use cases.

  • I think often we think of mobile inference

  • as being image classification and speech recognition.

  • But there are quite a few other use

  • cases that are being used now and are in deployment.

  • We're being used broadly across a number of both first party

  • and third party apps.

  • OK.

  • So let's start with models.

  • We have a number of models in this model repo

  • that we host online.

  • You can use models that have already

  • been authored in TensorFlow and feed those into the converter.

  • We have a number of tools and tutorials

  • on how you can apply transfer learning to your models

  • to make them more specific to your use case,

  • or you can author models from scratch

  • and then feed that into the conversion pipeline.

  • So let's dig into conversion and what that actually looks like.

  • Well, here's a brief snippet of how

  • you would take a saved model, feed that into our converter,

  • and output a TFLite model.

  • It looks really simple.

  • In practice, we would like to say

  • that this always just works.

  • That's sadly not yet a reality.

  • There's a number of failure points that people run into.

  • I've already highlighted this mismatch

  • in terms of supported operators.

  • And that's a big pain point.

  • And we have some things in the pipeline to address that.

  • There's also different semantics in TensorFlow that aren't yet

  • natively supported in TFLite, things like control-flow, which

  • we're working on, things like assets,

  • hash tables, TensorLists, those kinds of concepts.

  • Again, they're not yet natively supported in TensorFlow Lite.

  • And then certain types we just don't support.

  • They haven't been prioritized in TensorFlow Lite.

  • You know, double execution, or bfloat16, none of those,

  • or even FP16 kernels are not natively supported

  • by the TFLite built-in operators.

  • So how can we fix that?

  • Well, a number of months ago, we started a project called--

  • well, the name is a little awkward.

  • It's using select TensorFlow operators in TensorFlow Lite.

  • And effectively, what this does is

  • it allows you to, as a last resort,

  • convert your model for the set of operators

  • that we don't yet support.

  • And then at runtime, you could plug-in this TensorFlow

  • select piece of code.

  • And it would let you run these TensorFlow kernels

  • within the TFLite runtime at the expense of a modest increase

  • in your binary size.

  • What does that actually mean?

  • So the converter basically, it recognizes these TensorFlow

  • operators.

  • And if you say, I want to use them,

  • if there's no TFLite built-in counterpart,

  • then it will take that node def.

  • It'll bake it in to the TFLite custom operator that's output.

  • And then at runtime, we have a delegate

  • which resolves this custom operator

  • and then does some data marshaling

  • into the eager execution of TensorFlow, which again would

  • be built into the TFLite runtime and then marshaling

  • that data back out into the TFLite tensors.

  • There's some more information that I've linked to here.

  • And the way you can actually take advantage of this, here's

  • our original Python conversion script.

  • You drop in this line basically saying

  • the target ops set includes these select TensorFlow ops.

  • So that's one thing that can improve the conversion

  • and runtime experience for models that aren't yet

  • natively supported.

  • Another issue that we've had historically--

  • our converter was called TOKO.

  • And its roots were in this TFMini project,

  • which was trying to statically compute and bake

  • this graph into your runtime.

  • And it was OK for it to fail because it would all

  • be happening at build time.

  • But what we saw is that that led to a lot

  • of hard to decipher opaque error messages and crashes.

  • And we've since set out to build a new converter based

  • on MLIR, which is just basically tooling

  • that's feeding into this converter

  • helping us map from the TensorFlow dialect of operators

  • to a TensorFlow Lite dialect of operators

  • with far more formal mechanisms for translating

  • between the two.

  • And this, we think will give us far better debugging,

  • and error messages, and hints on how

  • we can actually fix conversion.

  • And the other reason that motivated this switch

  • to a new converter was to support control flow.

  • This will initially start by supporting functional control

  • flow forms, so if and while conditionals.

  • We're still considering how we can potentially

  • map legacy control flow forms to these new variants.

  • But this is where we're going to start.

  • And so far, we see that this will unlock

  • a pretty large class of useful models,

  • the RNN class type models that so far

  • have been very difficult to convert to TensorFlow Lite.

  • TensorFlow 2.0.

  • It's supported.

  • There's not a whole lot that changes on the conversion

  • end and certainly nothing that changes on the TFLite end

  • except for maybe the change to that saved model

  • is now the primary serialization format with TensorFlow.

  • And we've also made a few tweaks and added some sugar

  • for our conversion APIs when using quantization.

  • OK.

  • So you've converted your model.

  • How do you run it?

  • Here's an example of our API usage in Java.

  • You basically create your input buffer, your output buffer.

  • It doesn't necessarily need to be a byte buffer.

  • It could be a single or multidimensional array.

  • You create your interpreter.

  • You feed it your TFLite model.

  • There are some options that you can give it.

  • And we'll get to those later.

  • And then you run inference.

  • And that's about it.

  • We have different bindings for different platforms.

  • Our first class bindings are Python, C++, and Java.

  • We also have a set of experimental bindings

  • that we're working on or in various states of both use

  • and stability.

  • But soon we plan to have our Objective C and Swift

  • bindings be stable.

  • And they'll be available as the normal deployment libraries

  • that you would get on iOS via CocoaPods.

  • And then for Android, you can use our JCenter or BingeRate

  • ARs for Java.

  • But those are primarily focused on third party developers.

  • There are other ways you can actually reduce

  • the binary size of TFLite.

  • I mentioned that the core runtime is 100 kilobytes.

  • There's about 800 or 900 kilobytes

  • for the full set of operators.

  • But there are ways that you can basically

  • trim that down and only include the operators that you use.

  • And everything else gets stripped by the linker.

  • We expose a few build rules that help with this.

  • You feed it your TFLite model.

  • It'll parse that in output.

  • Basically, a CC file, which does the actual op registration.

  • And then you can rely on your linker

  • to strip the unused kernels.

  • OK.

  • So you've got your model converted.

  • It's up and running.

  • How do you make it run fast?

  • So we have a number of tools to help with this.

  • We have a number of backends that I talked about already.

  • And I'll be digging into a few of these

  • to highlight how they can help and how you can use them.

  • So we have a benchmarking tool.

  • It allows you to identify bottlenecks when actually

  • deploying your model on a given device.

  • It can output profiles for which operator's

  • taking the most amount of time.

  • It lets you plug in different backends

  • and explore how this actually affects inference latency.

  • Here's an example of how you would

  • build this benchmark tool.

  • You would push it to your device.

  • You would then run it.

  • You can give it different configuration options.

  • And we have some helper scripts that kind of help

  • do this all atomically for you.

  • What does the output look like?

  • Well, here you can get a breakdown of timing

  • for each operator in your execution plan.

  • You can isolate bottlenecks here.

  • And then you get a nice summary of where time is actually

  • being spent.

  • AUDIENCE: In the information, there

  • is just about operation type or we also

  • know if it's the earlier convolution of the network

  • or the later convolutions in the network or something like that?

  • JARED DUKE: Yeah.

  • So there's two breakdowns.

  • One is the run order which actually

  • is every single operator in sequence.

  • And then there's the summary where

  • it coalesces each operator into a single class.

  • And you get a nice summary there.

  • So this is useful for, one, identifying bottlenecks.

  • If you have control over a graph and then the authoring side

  • of things, then you can maybe tailor

  • the topology of your graph.

  • But otherwise, you can file a bug on the TFLite team.

  • And we can investigate these bottlenecks

  • and identify where there's room for improvement.

  • But it also affords--

  • it affords you, I guess, the chance

  • to explore some of the more advanced performance techniques

  • like using these hardware accelerators.

  • I talked about delegates.

  • The real power, I think, of delegates

  • is that it's a nice way to holistically optimize

  • your graph for a given backend.

  • That is you're not just delegating each op one by one

  • to this hardware accelerator.

  • But you can take an entire subgraph of your graph

  • and run that on an accelerator.

  • And that's particularly advantageous for things

  • like GPUs or neural accelerators where

  • you want to do as much computation on the device as

  • possible with no CPU interop in between.

  • So NNAPI is the abstraction in Android for accelerating ML.

  • And it was actually developed fairly closely in tandem

  • with TFLite.

  • You'll see a lot of similarities into the high level op

  • definitions that are found in NNAPI

  • and those found in TFLite.

  • And this is effectively an abstraction layer

  • at the platform level that we can hook into on the TensorFlow

  • Lite side.

  • And then vendors can plug in their particular drivers

  • for DSP, for GPUs.

  • And with Android Q, it's really getting to a nice stable state

  • where it's approaching parity in terms of features and ops

  • with TensorFlow Lite.

  • And there are increasingly--

  • there's increased adoption both in terms of user base

  • but also in terms of hardware vendors

  • that are contributing to these drivers more recently

  • we've released our GPO back end and we've also open source.

  • This can yield a pretty substantial speedup

  • on many floating point convolution models,

  • particularly larger models.

  • There is a small binary size cost that you have to pay.

  • But if it's a good match for your model,

  • then it can be a huge win.

  • And this is-- we found a number of clients

  • that are deploying this with things like face detection

  • and segmentation.

  • AUDIENCE: Because if you're on top of [INAUDIBLE] GPU.

  • JARED DUKE: Yeah, so on Android, there's a GLES back end.

  • There's also an OpenCL back end that's

  • in development that will afford a kind of 2

  • to 3x speed up over the GLES back end.

  • There's also a Vulcan back end, and then

  • on iOS, it's metal-based.

  • There's other delegates and accelerators

  • that are in various states of development.

  • One is for the Edge TPU project, which can either

  • use kind of runtime on device compilation,

  • or you can use or take advantage of the conversion step

  • to bake the compiled model into the TFLite graph itself.

  • We also announced, at Google I/O,

  • support for Qualcomm's Hexagon DSPs

  • that we'll be releasing publicly soon-ish.

  • And then there's some more kind of exotic optimizations

  • that we're making for the floating point CPU back end.

  • So how do you take advantage of some of these back ends?

  • Well, here is kind of our standard usage

  • of the Java APIs for inference.

  • If you want to use NnAPI, you create your NnAPI delegate.

  • You feed it into your model options, and away you go.

  • And it's quite similar for using the GPU back end.

  • There are some more sophisticated and advanced

  • techniques for both an API and GPU on our interop.

  • This is one example where you can basically use a GL texture

  • as the input to your graph.

  • That way, you avoid needing to copy--

  • marshal data back and forth from CPU to GPU.

  • What are some other things we've been working on?

  • Well, the default out of the box performance

  • is something that's critical.

  • And we recently landed a pretty substantial speed up there

  • with this ruy library.

  • Historically, we've used what's called gemmlowp

  • for quantized matrix multiplication,

  • and then eigen for floating point multiplication.

  • Ruy was built from the ground up basically

  • to [INAUDIBLE] throughput much sooner

  • in terms of the size of the inputs to, say, a given matrix

  • multiplication operator, whereas more

  • desktop and cloud-oriented matrix multiplication libraries

  • are focused on peak performance with larger sizes.

  • And we found this, for a large class of convolution models,

  • is providing at least a 10% speed-up.

  • But then on kind of our multi-threaded floating point

  • models, we see two to three times the speed-up,

  • and then the same on more recent hardware that has these neon

  • dot product intrinsics.

  • There's some more optimizations in the pipeline.

  • We're also looking at different types--

  • Sparse, fp16 tensors to take advantage of mobile hardware,

  • and we'll be announcing related tooling

  • and features support soon-ish.

  • OK, so a number of best practices here

  • to get the best performance possible--

  • just pick the right model.

  • We find a lot of developers come to us with inception,

  • and it's hundreds of megabytes.

  • And it takes seconds to run inference,

  • when they can get just as good accuracy,

  • sometimes even better, with an equivalent MobileNet model.

  • So that's a really important consideration.

  • We have tools to improve benchmarking and profiling.

  • Take advantage of quantization where possible.

  • I'm going to dig into this in a little bit

  • how you can actually use quantization.

  • And it's really a topic for itself,

  • and there will be, I think, a follow-up

  • session about quantization.

  • But it's a cheap way of reducing the size of your model

  • and making it run faster out of the box on CPU.

  • Take advantage of accelerators, and then

  • for some of these accelerators, you can also

  • take advantage of zero copy.

  • So with this kind of library of accelerators

  • and many different permutations of quantized or floating point

  • models, it can be quite daunting for many developers, probably

  • most developers, to figure out how

  • best to optimize their model and get the best performance.

  • So we're thinking of some more and working on some projects

  • to make this easy.

  • One is just accelerator whitelisting.

  • When is it better to use, say, a GPU or NnAPI versus the CPU,

  • both local tooling to identify that for, say,

  • a device you plugged into your dev machine

  • or potentially as a service, where we can farm this out

  • across a large bank of devices and automatically determine

  • this.

  • There's also cases where you may want to run parts of your graph

  • on different accelerators.

  • Maybe parts of it map better to a GPU or a DSP.

  • And then there's also the issue of when different apps are

  • running ML simultaneously, so you have a hotware detection

  • running at the same time you're running selfie segmentation

  • with a camera feed.

  • And they're both trying to access the same accelerator.

  • How can you coordinate efforts to make sure everyone's playing

  • nicely?

  • So these are things we're working out.

  • We plan on releasing tooling that can improve this

  • over the next quarter or two.

  • So we talked about quantization.

  • There are a number of tools available now

  • to make this possible.

  • There are a number of things being worked on.

  • In fact, yesterday, we just announced

  • our new post-training quantization

  • that does full quantization.

  • I'll be talking about that more here

  • in the next couple of slides.

  • Actually, going back a bit, we've

  • long had what's called our legacy quantized training

  • path, where you would instrument your graph at authoring time

  • with these fake quant nodes.

  • And then you could use that to actually generate

  • a fully quantized model as the output from the TFLite

  • conversion process.

  • And this worked quite well, but it was--

  • it can be quite painful to use and quite tedious.

  • And we've been working on tooling

  • to make that a lot easier to get the same performance

  • both in terms of model size reduction

  • and runtime acceleration speed-up.

  • AUDIENCE: Is part about the accuracy-- it seems

  • like training time [INAUDIBLE].

  • JARED DUKE: Yeah, you generally do.

  • So we first introduced this post-training quantization

  • path, which is hybrid, where we are effectively just quantizing

  • the waits, and then dequantizing that at runtime

  • and running everything in fp32, and there

  • was an accuracy hit here.

  • It depends on the model, how bad that is,

  • but sometimes it was far enough off

  • the mark from quantization aware training

  • that it was not usable.

  • And so that's where--

  • so again, with the hybrid quantization,

  • there's a number of benefits.

  • I'm flying through slides just in the interest of time.

  • The way to enable that post-training quantization--

  • you just add a flag to the conversion paths,

  • and that's it.

  • But on the accuracy side, that's where we came up

  • with some new tooling.

  • We're calling it per axis or per channel quantization, where

  • with the waits, you wouldn't just

  • have a single quantized set of parameters

  • for the entire tensor, but it would

  • be per channel in the tensor.

  • And we found that that, in combination with feeding it

  • kind of an evaluation data set during conversion time,

  • where you would explore the range of possible quantization

  • parameters, we could get accuracy that's almost on par

  • with quantization aware training.

  • AUDIENCE: I'm curious, are some of these techniques

  • also going to be used for TensorFlow JS,

  • or did they not have this--

  • do they not have similarities?

  • They use MobileNet, right, for a browser?

  • JARED DUKE: They do.

  • These aren't yet, as far as I'm aware, used or hooked

  • into the TFJS pipeline.

  • There's no reason it couldn't be.

  • I think part of the problem is just very different tool

  • chains for development.

  • But--

  • AUDIENCE: How do you do quantized operations

  • in JavaScript?

  • [INAUDIBLE]

  • JARED DUKE: Yeah, I mean I think the benefit isn't

  • as clear, probably not as much as if you were just

  • quantizing to fp16.

  • That's where you'd probably get the biggest win for TFJS.

  • In fact, I left it out of these slides,

  • but we are actively working on fp16 quantization.

  • You can reduce the size of your model by half,

  • and then it maps really well to GPU hardware.

  • But I think one thing that we want to have is

  • that quantization is not just a TFLite thing,

  • but it's kind of a universally shared concept

  • in the TensorFlow ecosystem.

  • And how can we take the tools that we already

  • have that are sort of coupled to TFLite

  • and make them more generally accessible?

  • So to use this new post-training quantization

  • path, where you can get comparable accuracy to training

  • time quantization, effectively, the only difference

  • here is feeding in this representative data

  • set of what the inputs would look like to your graph.

  • It can be a-- for an image-based model,

  • maybe you feed it 30 images.

  • And then it is able to explore the space of quantization

  • and output values that would largely

  • match or be close to what you would

  • get with training-aware quantization.

  • We have lots of documentation available.

  • We have a model repo that we're going to be investing

  • heavily in to expand this.

  • What we find is that a lot of TensorFlow developers--

  • or not even TensorFlow developers-- app developers

  • will find some random graph when they search Google or GitHub.

  • And they try to convert it, and it fails.

  • And a lot of times, either we have

  • a model that's already been converted

  • or a similar model that's better suited for mobile.

  • And we would rather have a very robust repository

  • that people start with, and then only

  • if they can't find an equivalent model,

  • they resort to our conversion tools or even authoring tools.

  • AUDIENCE: Is there a TFLite compatible section in TFHub?

  • JARED DUKE: Yeah, we're working on that.

  • Talked about the model repo training.

  • So what if you want to do training on device?

  • That is a thing.

  • We have an entire project called The [INAUDIBLE]

  • Team Federated Learning Team, who's focused on this.

  • But we haven't supported this in TensorFlow Lite

  • for a number of reasons, but it's something

  • that we're working on.

  • There's quite a few bits and components that still have yet

  • to land to support this, but it's something

  • that we're thinking about, and there is increasing demand

  • for this kind of on-device tuning or transfer

  • learning scenario.

  • In fact, this is something that was announced at WWDC, so.

  • So we have a roadmap up.

  • It's now something that we publish publicly

  • to make it clear what we're working on,

  • what our priorities are.

  • I touched on a lot of the things that

  • are in the pipeline, things like control flow and training,

  • proving our runtime.

  • Another thing that we want to make easier

  • is just to use TFLite in the kind of native types

  • that you are used to using.

  • If you're an Android developer, say, if you have a bitmap,

  • you don't want to convert it to a byte buffer.

  • You just want to feed us your bitmap, and things just work.

  • So that's something that we're working on.

  • A few more links here to authoring apps

  • with TFLite, different roadmaps for performance and model

  • optimization.

  • That's it.

  • So any questions, any areas you'd

  • like to dive into more deeply?

  • AUDIENCE: So this [INAUDIBLE].

  • So what is [INAUDIBLE] has more impact like a fully connected

  • [INAUDIBLE]?

  • JARED DUKE: Sorry.

  • What's--

  • AUDIENCE: For a speed-up.

  • JARED DUKE: Oh.

  • Why does it?

  • AUDIENCE: Yeah.

  • JARED DUKE: So certain operators have been,

  • I guess, more optimized to take advantage of quantization

  • than others.

  • And so in the hybrid quantization path,

  • we're not always doing computation in eight-bit types.

  • We're doing it in a mix of floating point and eight-bit

  • types, and that's why there's not always the same speed-up

  • you would get with like an LSTM and an RNN versus a [INAUDIBLE]

  • operator.

  • AUDIENCE: So you mentioned that TFLite is

  • on billions of mobile devices.

  • How many apps have you seen added to the Play Store that

  • have TFLite in them?

  • JARED DUKE: Tim would have the latest numbers.

  • It's-- I want to say it's into the tens of thousands,

  • but I don't know that I can say that.

  • It's certainly in the several thousands,

  • but we've seen pretty dramatic uptick, though, just

  • tracking Play Store analytics.

  • AUDIENCE: And in the near term, are you thinking more

  • about trying to increase the number of devices that

  • are using TFLite or trying to increase

  • the number of developers that are including it

  • in the applications that they built.

  • JARED DUKE: I think both.

  • I mean there are projects like the TF Micro, where

  • we want to support actual microcontrollers

  • and running TFLite on extremely restricted, low power arm

  • devices.

  • So that's one class of efforts on--

  • we have seen demand for actually running TFLite in the cloud.

  • There's a number of benefits with TFLite

  • like the startup time, the lower memory footprint that

  • do make it attractive.

  • And some developers actually want

  • to run the same model they're running on device in the cloud,

  • and so there is demand for having

  • like a proper x86 optimized back end.

  • But at the same time, I think one of our big focuses

  • is just making it easier to use-- meet developers

  • where they're at.

  • And part of that is a focus on creating

  • a very robust model repository and more idiomatic

  • APIs they can use on Android or iOS

  • and use the types they're familiar with,

  • and then just making conversion easy.

  • Right now, if you do take kind of a random model

  • that you found off the cloud and try

  • to feed it into our converter, chances

  • are that it will probably fail.

  • And some of that is just teaching developers

  • how to convert just the part of the graph they want,

  • not necessarily all of the training that's surrounding it.

  • And part of it is just adding the features and types

  • to TFLite that would match the semantics of TensorFlow.

  • I mean, I will say that in the long run,

  • we want to move toward a more unified path with TensorFlow

  • and not live in somewhat disjoint worlds,

  • where we can take advantage of the same core runtime

  • libraries, the same core conversion pipelines,

  • and optimization pipelines.

  • So that's things that we're thinking about for the longer

  • term future.

  • AUDIENCE: Yeah, and also [INAUDIBLE]

  • like the longer term.

  • I'm wondering what's the implication of the ever

  • increasing network speed on the [INAUDIBLE] TFLite?

  • [INAUDIBLE],, which maybe [INAUDIBLE] faster than current

  • that we've [INAUDIBLE] take [INAUDIBLE] of this.

  • JARED DUKE: We haven't thought a whole lot about that,

  • to be honest.

  • I mean, I think we're still kind of betting on the reality

  • that there will always be a need for on device ML.

  • I do think, though, that 5G probably

  • unlocks some interesting hybrid scenarios, where you're

  • doing some on device, some cloud-based ML,

  • and I think for a while, the fusion of on device hotware

  • detection, as soon as the OK, Google is detected,

  • then it starts feeding things into the cloud.

  • That's kind of an example of where there is room

  • for these hybrid solutions.

  • And maybe those will become more and more practical.

  • Everyone is going to run to your desk

  • and start using TensorFlow Lite after this?

  • AUDIENCE: You probably already are, right?

  • [INAUDIBLE] if you have one of the however many apps that was

  • listed on Tim's slide, right?

  • JARED DUKE: I mean, yeah.

  • If you've ever done, OK, Google, then you're

  • using TensorFlow Lite.

  • AUDIENCE: [INAUDIBLE].

  • Thank you.

  • JARED DUKE: Thank you.

  • [APPLAUSE]

JARED DUKE: Thanks everybody for showing up.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it