Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • SACHIN JOGLEKAR: Hey.

  • I'm Sachin from the TensorFlow Lite team,

  • and I'm here to talk about delegates.

  • Before I go into the details, I would

  • like to go over some of the basics of what delegation is.

  • Typically, a user would start with a TensorFlow model

  • and use a converter to convert the model into the TFLite

  • format.

  • This TFLite file would then be handed down

  • to our interpreter, which runs the model on device.

  • By default, models run in the CPU.

  • So the interpreter would call out

  • to our CPU Op Kernels that are highly optimized for the ARM

  • Neon instruction set.

  • However, most devices these days, especially mobile phones,

  • have a lot of other chips, like mobile GPUs or DSPs.

  • And this is where delegates come in.

  • Our Delegate API acts like a bridge

  • between the TensorFlow Lite runtime and lower level

  • accelerated APIs.

  • For example, our NNAPI delegate acts

  • as an interface between TensorFlow Lite and Android's

  • neural network API.

  • Or the GPU delegate uses OpenCL and OpenGL

  • to run inference on mobile GPUs on Android devices.

  • A natural question here is why would you use delegates at all.

  • The most obvious benefit is foster inference.

  • The classic example here is the GPU delegate.

  • Because of the highly parallelized nature of the GPU,

  • it is very good at performing matrix

  • math, such as convolutions or fully connected layers.

  • As a result, when we use our GPU delegate with TensorFlow Lite,

  • we observe up to seven speed ups with a lot of division models

  • that are currently used on mobile devices.

  • Another great benefit is lower power consumption.

  • A good example here is the DSP, or the digital signal

  • processor.

  • DSPs are meant for applications such as multimedia

  • and communication, which inherently require less power

  • consumption.

  • So when you use a DSP for inference,

  • it consumes up to 70% less power,

  • which is what we observed when we

  • used our delegates that leverage Qualcomm's Hexagon DSP to run

  • even some of the mobile optimized models,

  • such as the MobileNet or the MobileNet SSD.

  • Now, suppose you have your own secret accelerator,

  • and you want to use our delegate API to write your own delegate.

  • Let's see how it would work in code.

  • So the bulk of how the interpreter delegates nodes

  • is in this function that we like to call DelegatePrepare.

  • This function gets an object called the TFLite context,

  • which is essentially an interface into the TensorFlow

  • Lite runtime for the delegate.

  • Using the context, the delegate first

  • gets the execution plan, which is

  • nothing but a list of nodes that are going

  • to be executed in sequence.

  • For each node, the delegate can look

  • at different kinds of information,

  • such as what op it executes or what are the types of input

  • of tensors or the shapes.

  • This has the delegate make an informed decision

  • about which ops it can accept to be delegated.

  • Once this list of supported nodes is populated,

  • the delegate calls a function called

  • ReplaceNodeSubse tsWithDelegateKernels.

  • This function takes two main arguments.

  • One is this list of supported nodes,

  • and the other is what we call the kernel administration.

  • We'll get to that in a minute.

  • But let's look at what the runtime does when

  • the delegate calls this method.

  • Here we have a simple example of a model of Add and Mul ops,

  • and let's say our delegate only supports the Add operation.

  • So once the delegate calls this method with the nodes,

  • the runtime partitions all the nodes into two subsets,

  • or two types of partitions.

  • One is delegated, and the other is non-delegated.

  • Now, there are two reasons why this happens.

  • First is many of the delegates optimize inference

  • by fusing a lot of the ops together.

  • So this way, the delegate can maximize the fusings and fuse

  • as many ops as possible.

  • Another great reason is that it's very expensive

  • to go back and forth between the CPU and the accelerator,

  • especially due to memory transfers.

  • And therefore, the lesser the number

  • of partitions, the more optimized

  • the inference becomes.

  • Once this partitioning is done, the TensorFlow Lite runtime

  • replaces each delegated partition

  • with one single delegate op.

  • At this point, the delegate op behaves just

  • like any other TFLite node for the runtime.

  • And the behavior of this delegate

  • up is what is defined by the kernel implementation,

  • or the TFLite illustration that we saw in the previous slide.

  • The two main methods that need to be

  • implemented for this kernel administration--

  • the first one is Init.

  • This method is run at initialization time.

  • That is when delegation is happening.

  • Here the delegate, or the delegate kernel,

  • gets a list of delegate patterns, which

  • is essentially nodes that it is responsible for,

  • and parallel tensors, and information like that.

  • With this information, the delegate

  • is free to initialize any opaque object that it can create.

  • The return type is void* so the runtime is completely agnostic

  • to what type of object is written,

  • as long as it is not null.

  • Then at inference time, we run this method called Invoke.

  • In Invoke, the kernel gets back the object returned

  • during Init, and it is free to do whatever it wants to as long

  • as the implementation is semantically similar to what

  • the delegated partition would have done.

  • Now that we know what happens under the hood,

  • let's look at some of the delegation options in TFLite.

  • The first delegate that we have is

  • the NNAPI delegate, which supports

  • a lot of different accelerators, such as the DSP, GPU, and NPU.

  • It draws a variety of vendors.

  • It runs on Android P and above.

  • It supports more than 30 ops on Android P

  • and over 90 ops on Android Q. This

  • is one of the very few delegates that accepts both floating

  • point and integer models.

  • This is how you would typically run inference

  • with the NNAPI delegate using our Java interface.

  • The main idea is that you initialize the delegate

  • instance and you pass it on to our interpreters.

  • And the rest of your business logic

  • remains pretty much the same.

  • There's not much else you have to do for delegates, apart

  • from just these couple of lines of initialization and cleanup

  • at the end.

  • Then we have our GPU delegate, which, as you mentioned before,

  • gives up to seven times speed up on a lot of the version models

  • that involve a lot of convolutions

  • and fully connected layers.

  • It uses OpenCL and OpenGL on Android and Metal for iOS.

  • Currently it only accepts 14 point models, both 16-bit

  • and 32-bit.

  • We are working to add Vulkan support to the GPU delegate,

  • as well as inference for Quantized Models.

  • So stay tuned for that.

  • This is how you would do things with a GPU delegate.

  • The thing to note is that apart from the class name, which

  • is GPU delegate, instead of an API delegate,

  • everything else pretty much remains the same.

  • We are excited to announce the release of the Qualcomm Hexagon

  • DSP delegate that we announced a couple of weeks back.

  • This delegate provides up to 25 speed up

  • for quantized uint8 models.

  • Our general directive is to use this delegate on Android O

  • and below, and use NNAPI delegate on Android P

  • and above, or in environments where you may not have

  • the Android operating system.

  • We are working with Qualcomm to add support

  • for models which are per-channel quantized.

  • So you can make use of our post-channel quantization

  • tooling to run those same models with the Hexagon delegate.

  • Again, the inference is pretty similar.

  • The only difference is now that you

  • have to initialize the Hexagon delegate instead of the GPU

  • delegate object.

  • We are so excited to announce the Core ML delegate, which

  • uses Apple's neural engine to run faster inference on iOS

  • devices.

  • It runs on the A12 SoC and above,

  • and it provides up to 14x speedup

  • on a lot of the mobile models that are

  • used in on-device inference.

  • It is available on iOS 11 and later.

  • This is how you would run the inference with the Core ML

  • delegate using Swift, which is the language of choice

  • for iOS development.

  • The basic idea remains the same, that you initialize the object

  • and then you pass it on to an interpreter,

  • with the rest of the logic, apart from the inference,

  • remaining the same.

  • Of course, it's not as easy as taking any random model

  • and giving it any delegate.

  • You have to think about a few questions

  • before you choose a delegate to use with your model.

  • So the first consideration is whether the model

  • is supported on the delegate.

  • For example, if you pass a floating point model to the DSP

  • delegate, nothing will happen.

  • Or if you give the GPU delegate a quantized model,

  • for now, it won't run.

  • It won't crash, but the delegates, if given a model

  • that it doesn't support, will simply reject all the nodes

  • and everything will run on CPU.

  • So there'll be no improvement to performance at all.

  • The second question is about the trade-offs.

  • For example, a lot of the fixed-point model

  • or fixed-point delegates, such as the DSP delegate,

  • tend to sacrifice a bit of the accuracy

  • to gain a lot of speed, because of reasons

  • like using lower [INAUDIBLE] or fusing all of the operations

  • together.

  • So if your application requested a lot of precision,

  • this might be a problem for you.

  • Or with the GPU delegate, there is

  • an overhead in RAM usage during initialization time.

  • Also, all of the delegates come with some binary site

  • associated with them.

  • Except the NNAPI delegate, which ships with the TFLite

  • runtime by default. So you have to keep an eye

  • on the binary size increase when you use a delegate.

  • All these numbers are provided in the documentation,

  • so be sure to check it out before you apply a delegate.

  • And the last question, obviously,

  • is whether the delegate actually improves performance.

  • Now, this depends on a lot of different factors,

  • such as supporting ops.

  • If there are a lot of unsupported ops in your model,

  • there will be a lot of ops between the CPU

  • and the accelerator, which will result in more latency

  • sometimes.

  • So you have to take care of which ops are in your model.

  • Another factor is whether the environment

  • supports the delegate.

  • For example, if you give the Core ML delegate

  • a model on an old iPhone, it might not do you any good.

  • But the good news is that we have

  • some tools to help you to figure out

  • which delegate to use in any given

  • environment for your model.

  • We have our favorite benchmark_model tool,

  • which is used for latency profiling on Android devices.

  • So you basically build the binary using bezel,

  • and you push it to the device.

  • And then you can run it to get a lot of statistics

  • about latency performance.

  • This is kind of the output that you get with the tool.

  • It tells you whether it applied the delegate,

  • and then it gives you a bunch of statistics

  • about latency in microseconds.

  • It also sometimes tells you about the CPU memory usage.

  • So if that is important for you, you can check that out.

  • Then we also have the inference_diff tool

  • that we released recently, which is basically

  • a way to compare CPU performance or CPU

  • accuracy with the accelerator's performance.

  • So what it does is it runs the model

  • in two different environments.

  • One is a CPU, and the other is the accelerator.

  • And it does this for a bunch of runs with random data,

  • and it compares the output tensors at the end.

  • The result looks something like this,

  • where you get a structure of output that

  • is for each output tensor.

  • So if you know what your output tensors mean in your model,

  • then you have a good idea of how close the accelerator

  • performance is to the CPU.

  • And we also have our recently released profilers for Android,

  • which are a great way to dig down into how model behavior is

  • seen on Android devices.

  • Let's take an example.

  • Suppose you're using Perfetto, which

  • is a great tool for Android debugging.

  • You see that you have delegation occurring,

  • but you also see that the latency is higher than what

  • you would typically expect.

  • You zoom in and you see that there is a fully connected op

  • which is running after the delegate,

  • and you know that your delegate only supports one partition.

  • So it cannot delegate that fully connected op.

  • You dig in further, and you see that there

  • is a squeeze op between the delegate partition and fully

  • connected, which is causing this problem.

  • So if this op was supported on the delegate,

  • then the entire thing would run on the same partition.

  • And this is a real example of a ResNet with our GPU delegate.

  • So if this squeeze op was substituted with a reshape op,

  • the entire thing would run on the GPU delegate.

  • So this is how you can use profilers and a tool

  • like Perfetto to figure out why performance might not

  • be what you expect on an Android device.

  • In the works, we are working, in the next coming months,

  • on better tooling for delegates for you

  • to figure out how and why performance

  • is different from what you would expect.

  • We're also working on improved performance across all

  • our delegates and improved moral support with ops

  • and different kinds of models, such as 14 point and context

  • models.

  • And we are also working on revamping our documentation so

  • that you have better support for using and writing

  • your own delegates.

  • That's all.

  • You can look at our documentation

  • on /lite/performance/delegates for all things delegates,

  • the different options, and how to write your own.

  • If you have any questions, feel free to reach out

  • to us at tflite@tensorflow.org.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it