Placeholder Image

Subtitles section Play video

  • SUHARSH SIVAKUMAR: We'll get started.

  • So hi, everyone, I'm Suharsh, and I'm here

  • to talk about the TensorFlow Model Optimization

  • Toolkit, which we have techniques

  • for quantization and pruning.

  • And feel free to ask questions or interrupt along the way.

  • I want this to be super interactive.

  • So what we're going to talk about today

  • is the high level of what quantization is, the challenge

  • it poses, and why it matters.

  • And then more of the specifics on,

  • in TensorFlow, what we are doing to work

  • on quantization and pruning.

  • So overall, quantization, the idea

  • is that you have your floating point network

  • with your inference graph, which is a floating point program.

  • And we're going to make modifications

  • to this program in the general sense

  • that we take these floating point calculations

  • and make them lower precision.

  • And the goal is to get as close in accuracy

  • as possible while providing some performance improvements.

  • So usually, this involves--

  • this is very general--

  • there's some function from the floating point to the integer

  • value.

  • There's a process to do the conversion

  • to make it valid for a particular hardware.

  • And then there's various algorithms

  • we have to get these parameters needed

  • for this function in the most efficient way.

  • So this is really general, and it may not make sense now,

  • but we'll make it more specific later.

  • Mhm?

  • AUDIENCE: Do the same conversion functions

  • work for mobile devices as well as specialized hardware?

  • SUHARSH SIVAKUMAR: No, and that's one of the challenges.

  • And we'll get to all the challenges.

  • That's a really good question [INAUDIBLE]..

  • Mhm?

  • AUDIENCE: I had another question.

  • Would do you also be [? motivating ?] soon why is

  • this not as simple as a [? downcast ?] from float

  • to int [INAUDIBLE]?

  • SUHARSH SIVAKUMAR: Yes, it'll all make sense, I hope.

  • AUDIENCE: We're obviously very interested.

  • SUHARSH SIVAKUMAR: So why does this matter?

  • So the first thing is that the ML programs have

  • lots of parameters, and by using lower precision,

  • we can instantly get these models a lot smaller, which

  • can help with memory bandwidth and network

  • costs of downloading models.

  • Second, if you have all your calculations in integers,

  • you could have lots of optimizations that

  • make the execution super fast.

  • Third, integers are super power efficient.

  • So on mobile, this is really important.

  • And then finally, this lets us explore a whole new avenue

  • of hardware design, where we can make custom chips like seastar

  • was the first, then Edge TPU, the new TPUs [? are ?]

  • having integer operations.

  • And this can get us cheap, power efficient, fast hardware.

  • [INAUDIBLE]

  • So--

  • AUDIENCE: I think it helps if instead

  • of saying integer operations, you

  • say fixed point [? fraction ?] operations.

  • [INAUDIBLE]

  • SUHARSH SIVAKUMAR: So I avoid it because it's

  • only kind of fixed point.

  • It's not like-- it is fixed point, but when I--

  • so I've said fixed point in the past,

  • and then folks always say it's not truly fixed point

  • because fixed point applies a rescale every time you combine

  • the two values, and sometimes I get pushback.

  • So I'm going to avoid--

  • because I used to say quantization,

  • and then people would say there's

  • a hundred steps of quantization.

  • So the integers are the key here because that's

  • what's providing the acceleration that's

  • specific to what we're doing in the TensorFlow stack.

  • And the specifics, I guess, will make sense

  • after we go into the equations.

  • So why is quantization hard?

  • And this was your point that we have different chips.

  • So each chip has its own specific tradeoffs

  • it chose to make.

  • Some may only support int8, some may support int16,

  • some way want power of 2 rescales.

  • All these really one-off decisions

  • to make the deployment story of how

  • do you take a general TensorFlow program,

  • put it on one of these chips-- really hard.

  • For float, we started to get to a world

  • where we can just say float can run anywhere.

  • But for these things, there's not

  • a lot of standardization on how to do this.

  • The second reason it's hard is it often requires custom tool

  • because you need extra metadata that often can only

  • be gathered by running inferences to know

  • how to quantize the values.

  • And we'll get more into that in detail.

  • So there's often an extra step in the process.

  • And then finally, for every specific ML problem,

  • we don't have a good answer for how

  • quantization will affect it.

  • You can use the same architecture,

  • but just do something else for your particular task

  • with the outputs of that architecture,

  • and quantization may help or hurt.

  • And it's pretty empirical right now,

  • where we just try it and see.

  • And we're still in the process of gathering a lot of examples.

  • But one of the goals we need to work on in ML research

  • is understand these models more to determine how quantization

  • error will impact things.

  • So now more into the detail.

  • So currently, what most hardwares implement

  • and what the TensorFlow and TensorFlow Lite stack

  • implement is affine quantization, which

  • this is like us milking y equals mx

  • plus b since seventh grade for the rest of our lives.

  • [LAUGHING]

  • So basically, you uniformly distribute your range

  • into fewer chunks than you had before,

  • and then bucketize them.

  • And this is effectively what all quantization is.

  • Currently, we have different ways of gathering statistics

  • to determine how to quantize.

  • So going back to this picture for a second,

  • we need some sort of min and max value to know how to quantize.

  • So this implies that we need tooling

  • to get this information.

  • And we have two types of tooling right now.

  • There's during training tooling, where

  • you can incorporate this as part of your training pipeline.

  • At the end of the day, you have a trained model

  • that has information on how to quantize it.

  • And you can also do this post-training,

  • and we'll talk about the trade-offs later.

  • AUDIENCE: I have a question.

  • So why are you [INAUDIBLE] not beyond the boundary

  • of your possible values?

  • Why do you choose-- or do you purposefully choose

  • to leave some values out?

  • SUHARSH SIVAKUMAR: So, yeah.

  • The min and max, it's kind of an open question

  • on what is the optimal min-max given a tensor, if I answered

  • the question right.

  • So you could choose to put your min-max much smaller

  • than your actual value seen, and you'll get some clipping.

  • And depending on the model and the problem,

  • we wouldn't really know if it's useful or not.

  • Because sometimes models don't care

  • about those extraneous values, and sometimes, they're

  • the most important thing the whole model.

  • AUDIENCE: The tricky thing is that when you set your min

  • and max, and if you're using int8,

  • you only have 255 values between the min and max.

  • Every [? activation ?] has to be cast

  • into one of those 255 values.

  • If you [INAUDIBLE] minus infinity [INAUDIBLE]

  • plus infinity, that's really useless.

  • But if your mean is 0 and your max is 0.01,

  • you can represent computations with a lot of precision,

  • so it's the trade-off.

  • SUHARSH SIVAKUMAR: Yeah.

  • And we do different types of these depending on the model.

  • And we've seen weird things where--

  • and it's always this battle between how much

  • does the network care about these extreme values versus how

  • much is it care about the average rounding

  • error along the way.

  • So it's always this rounding versus clipping-- that's all--

  • we just play with this lot.

  • AUDIENCE: You mentioned about min and max being primarily

  • influenced from training.

  • But [? I do ?] like to also do this [? add ?] [? infinite-- ?]

  • there's a constant feedback loop from--

  • SUHARSH SIVAKUMAR: So it's training or post-training.

  • So post-training might influence for like model compilation

  • time, and it doesn't stop.

  • AUDIENCE: Could I just clarify?

  • So the point is that it's no-- it wouldn't be considered

  • quantization if you just reduced float32

  • to float16, for example-- float8 or whatever.

  • So you still have a separate exponent

  • and you have just kind of fewer bits.

  • That's not considered quantization?

  • SUHARSH SIVAKUMAR: So technically, it is.

  • So the textbook term of quantization,

  • it is quantization.

  • But the quantization we're talking about here

  • is this integer quantization where

  • you have a shared min-max.

  • AUDIENCE: Where you really don't want it to have that--

  • SUHARSH SIVAKUMAR: [? We're ?] using that scale.

  • AUDIENCE: ----[INAUDIBLE] the exponent.

  • So that's the only thing that's useful for the hardware

  • [INAUDIBLE].

  • SUHARSH SIVAKUMAR: Exactly.

  • So in other like DSP literature, it's

  • sometimes called "block floating point," where

  • you have the exponent shared across all values of a tensor

  • rather than one exponent per element.

  • So in a way, float is just per element quantization.

  • Yeah.

  • So during training, the idea of during training quantization

  • is that you want to somehow get this network

  • to be robust to this error that quantization introduces.

  • So you emulate the effect of quantization

  • in the forward pass.

  • So if you ever see these TensorFlow fake_quant

  • operations, or the [? contra ?] [? quantized ?] rewriter tool,

  • this is its goal.

  • It's saying given a graph, we'll rewrite the forward pass

  • to emulate the error due to quantization,

  • and then in the backward pass, we'll

  • do some tricks to skip over those non-differential parts

  • that quantization introduces.

  • And then the goal is that that [? prompt ?] will magically

  • make the weights better for quantization.

  • And this can often get the best accuracy given

  • a particular schema of quantization,

  • but it's also really hard to train sometimes.

  • And machine learning, as we all know,

  • is the art of making as few changes

  • to your training to get it to converge.

  • And the second you do more, oftentimes, you

  • won't even converge if you go to too low of a precision.

  • And you just have to play around a lot

  • if you try the training route.

  • Additionally, the error introduced in training

  • is specific for a particular target.

  • So if you want the result of your training

  • to be portable and work across many different ships,

  • you're kind of in trouble now if they

  • have different characteristics.

  • AUDIENCE: So by emulating quantization,

  • does that mean on the forward pass after every op,

  • you just apply the quantization?

  • SUHARSH SIVAKUMAR: Yeah, and it's

  • a bit trickier than after every op,

  • because it's after every rescale that the hardware expects.

  • So a specific example is like, in TensorFlow,

  • you have con [? bias ?] [? that ?] [? value. ?] In most

  • of these inference backends, those are fused into one

  • [? fat ?] con [? bias ?] [? that ?] [? value. ?]

  • And your rescales are only at the inputs of the con

  • and the outputs of the [? value. ?] So only

  • your [? support ?] should emulate quantization there.

  • So you kind of need knowledge of what the target's expectations

  • are to decide where to put it.

  • So it's not just before and after every op.

  • AUDIENCE: And do you just use the current running max

  • and min?

  • SUHARSH SIVAKUMAR: Yeah.

  • So right now, we do moving average.

  • For certain models, we played with absolute min

  • and absolute max.

  • And it's really-- sometimes, we use schedules to slowly

  • manually constrain it.

  • And this is where the art part, and it's not really

  • well understood how to do that generally.

  • So right now, for all the mobile--

  • like, all the vision models, we do moving average.

  • And it seems to work pretty well,

  • but we don't know if that's optimal or not.

  • It just turns out backprop is kind of magical.

  • AUDIENCE: And backprop you don't apply this at all.

  • SUHARSH SIVAKUMAR: Backprop, we use

  • this thing called "straight through estimator," which

  • the main problem with this quantization

  • is it's a step function, so it's not differentiable.

  • So we pretend it's an identity, and we just

  • pass the gradient right through, and this gets [? it to ?]

  • train.

  • AUDIENCE: It works in practice.

  • SUHARSH SIVAKUMAR: Mhm?

  • AUDIENCE: And just to clarify, there's

  • never a case where quantization is used in training just

  • to speed up the training.

  • It's only used in training because of the idea

  • that it would speed up inference.

  • Is that correct?

  • SUHARSH SIVAKUMAR: So there is some work.

  • I don't know if it's ever used in practice,

  • but there's been a few papers over the years that

  • do do quantization for speeding up training as well.

  • But this particular one is always--

  • everything in this talk, the goal is for inference.

  • And so this is purely to emulate what's happening at inference.

  • And oftentimes, it will be slower than--

  • slower to train these models than to actually

  • just do a float point

  • AUDIENCE: And just to be sure, I thought

  • more than actually speeding up inference,

  • the goal with quantization [? over ?] training

  • is to actually reduce errors?

  • SUHARSH SIVAKUMAR: Yeah, to reduce the accuracy

  • that you get--

  • that you lose when you eventually go to inference.

  • But the ultimate goal of this whole tooling

  • is to enable inference performance

  • for some particular hardware.

  • So that being said, we've been trying

  • to work really hard to avoid the need for this

  • in most general cases.

  • During training will always be the most accurate

  • because you're letting that effort make up for it,

  • but we think we can get pretty far

  • with post-training techniques.

  • In after training, the trade-offs

  • are that you can't rely on this magical, huge hammer

  • of backpropagation to fix all your accuracies,

  • but you can do some things.

  • And additionally, the main benefit

  • is that the user doesn't have to retrain,

  • which is a pain because oftentimes, it won't converge,

  • you have to mess with hyperparameters,

  • your portability is gone.

  • So here, there's a compile step.

  • Or sometimes, like you were saying, even at runtime,

  • there's a step to collect these statistics to do that min-max.

  • So the second technique we have-- so

  • we'll get back to quantization for the majority of the talk,

  • but I just want to mention pruning.

  • So the other technique you have is

  • pruning, which the goal is to result in tensors in your model

  • that have many zeros.

  • And these-- so if you do arbitrary pruning,

  • where your resulting model has many zeros,

  • it's much more compressible.

  • And additionally, if you have a certain structure

  • to your pruning, or a certain percentage of sparsity,

  • you can have optimized kernels that accelerate things.

  • So the benefit is that you have so many repeated values in them

  • that you can just zip your file and you're good to go.

  • And then if you actually have hardware support for sparsity,

  • you can get faster kernels.

  • And one more point on pruning, which I think is kind of cool,

  • is that all the zeros--

  • since you have so many repeated zeros,

  • and zeros in quantization represent exactly,

  • it actually works really, really well with quantization,

  • and often helps quantization, which is kind of-- they're

  • like, compressing in two orthogonal ways, which

  • is kind of neat.

  • So now we'll talk about all the tools.

  • So yeah, last year, we released this model optimization toolkit

  • which is a suite of TensorFlow and TensorFlow Lite tools

  • that aim to make all these techniques doable,

  • and let us play around with trying out new things

  • with quantization and pruning.

  • So you can check that out here.

  • So here's my world famous hand.

  • This went on Twitter, and this is my hand [INAUDIBLE]..

  • AUDIENCE: You have [? tweeted ?] your hand, I think.

  • SUHARSH SIVAKUMAR: Yeah, that's true.

  • [LAUGHING]

  • We've been reusing these pictures way too much.

  • So we have quantization and sparsity.

  • So first, we'll deep dive in all the tools in quantization

  • in a bit more detail on how we actually do quantization.

  • So the first thing we've done in TensorFlow Lite

  • is try to understand for many of the canonical models

  • all the operations that are in there.

  • And what are some standard recipes

  • on how to implement these fixed point quantized kernels?

  • And the goal here is that we want some sort of endorsement

  • for a new hardware that comes in.

  • And we know that this is going to be a work in progress

  • because new chips are coming all the time.

  • They have different constraints, and they don't

  • want to listen to one standard.

  • But we want to be like some reference point

  • to where we can compare, oh, this new quantization scheme,

  • how does it compare to this?

  • So the goal with this is for a bunch of CPU reference ops

  • that have been tried on many models,

  • and we understand them to some extent.

  • So this is a bit more detail on how we actually

  • do the quantization.

  • So the bottom number line is the floating point scale,

  • and that histogram is a pretend distribution of values

  • in a particular tensor.

  • And the idea of quantization is instead of wasting all our bits

  • representing this range that we don't even use,

  • let's figure out only the part that the histogram lies in,

  • and only represent that with a smaller number of bits.

  • So the top number line is the integer equivalent

  • of that, where we took that histogram

  • and we just use these 255 buckets to represent the number

  • line.

  • So this is just that same affine equation.

  • At inference time, we actually have--

  • we change this min-max to two different things called

  • "scale" and "zero point."

  • And scale is the floating point size of every bucket,

  • and zero point is an integer value that corresponds exactly

  • to floating point 0.

  • And this turns out to be really important.

  • [? C ?] started to do this, and it

  • resulted in a lot of bias issues,

  • where for every multiply accumulate you have,

  • if you don't represent 0 exactly,

  • you just push this bias.

  • And then it also has a convenient thing of--

  • oftentimes, in the models, we do padding,

  • and it's just zero is just a special number

  • that we have to represent.

  • But the main thing is the [? cumulated ?] thing.

  • So this just to give some insight

  • into what these tools are actually--

  • why do we need the information.

  • So we won't go too much into depth here,

  • but here's the summary of our quantization spec.

  • And we have a per axis symmetric weights, per layer asymmetric

  • activations, and then the zero point

  • is-- all these things are in a [? sine ?] integer value.

  • And I'll explain each of these, actually, because right now it

  • won't make any sense.

  • So the first part of the specification is symmetry.

  • And the idea here is, do you want

  • to make your scale be able to represent values that are

  • really not centered around 0?

  • And this means often that that zero point--

  • I'll go back to the equation real quick--

  • that zero point here, do we want to have

  • the cost of that addition?

  • And depending on where this happens in your math,

  • it can be really expensive or not too expensive.

  • And so for symmetry, we've decided

  • to make weights symmetric, and the reason

  • is that since weights are constants,

  • the zero point is multiplied by the dynamic activations.

  • So this is a cost that you'd have

  • to do that's dependent on the input every time.

  • So having weights be asymmetric, every inference

  • has a cost that's additional.

  • And so weights being symmetric avoids this whole zero point

  • multiplication of activations.

  • And we can answer more later, but I won't

  • go too much in depth here.

  • So it's faster if we make weight symmetric.

  • And activations, they're only multiplied by a constant value,

  • so having them have this zero point is not too expensive.

  • So we leave them asymmetric, and the activations

  • are often [INAUDIBLE] and stuff, which are super asymmetric.

  • So we'd be throwing away a bit if we don't do that.

  • So the second thing we can play around with in quantization

  • is the granularity in which we decide to have these min-maxes

  • or scales.

  • And traditionally, we were doing per layer quantization--

  • or per tensor quantization.

  • For a given tensor, you only have one min-max.

  • But it turns out for convolutions and [INAUDIBLE]

  • convolutions, often, each channel of the convolution

  • has a really different distribution.

  • And when you only have one scale or one

  • min-max for the entire tensor, you're

  • doing a really poor job in each of these distributions.

  • So the idea of per channel quantization

  • is you have a min-max per channel.

  • And since this is not in the inner loop of your kernels,

  • it's really not too expensive, and gets a huge benefit

  • in accuracy--

  • effectively like an extra bit.

  • So now to the tools.

  • So the tool fragmentation is all,

  • how do we get these min-max values

  • that we need to do the quantization?

  • And so for weights, it's super easy.

  • Weights are static, so we can anytime just

  • look at the weights, read the min-max,

  • and quantize using those min-max.

  • So the problem always comes in dynamic values and activations

  • that you can only get an idea of the distribution

  • by actually running realistic inputs.

  • So the first, most naive, simplest idea

  • on how to do quantization is let's

  • read the quantization at the second we know it,

  • which is right at inference.

  • So during runtime, our graph is actually different.

  • Before our expensive multipliers, or [? math ?]

  • models, we take the float input value, measure the min-max,

  • use those to quantize on the fly.

  • So this is like a On operation of quantizing on the fly.

  • Then get the speed up of doing an int8, an int8 multiply

  • on your [? math ?] model, and then go back

  • to float at the end.

  • So the idea is here is you get the most realistic

  • min-max range for your activations

  • because you're using the one for this particular inference.

  • The flaws are that you can only really do

  • this on chips that have float support.

  • The second time we could do this is--

  • if we want the whole graph to be integer,

  • we want to avoid this runtime cost of measuring

  • the min-max because we don't want any float

  • on any edge of this graph.

  • So what we can do is simply move that to compile time.

  • And so you have your float model,

  • and we want to do some post-training figuring out

  • of what the values are for all these dynamic values.

  • So to do this, we need some representative data

  • that we can run through the model,

  • collect ranges then, and then fix those min-max values

  • for the activations.

  • And this means that we're not using the perfect min-max,

  • like we were for hybrid quantization before.

  • But we are working on getting a representative one,

  • and we never have to have float in our inference graphs,

  • so this can go to all those integer accelerators.

  • AUDIENCE: So wait, I had a question kind

  • of related to the previous slide.

  • So the choice of whether to do hybrid or not,

  • is that multifaceted based on improving accuracy

  • because now you get better min-maxes,

  • but also the hybrid needs to support the float biases,

  • right?

  • SUHARSH SIVAKUMAR: Yeah.

  • So it's really problem specific.

  • So we'll get a little bit into that later as well,

  • but the short answer is, yes, it's

  • multifaceted in that it usually is a good choice if you're

  • going to CPU.

  • It's a bad choice if you have models

  • that have large activations.

  • Like image models don't get a huge benefit from hybrid

  • because your cost of doing this on the fly quantization

  • is pretty big.

  • And then accuracy really improves

  • for models with small activations

  • because you're kind of getting a more representative range

  • for that small tensor.

  • AUDIENCE: And also if you want truly low latency inference,

  • maybe it's harder [INAUDIBLE].

  • SUHARSH SIVAKUMAR: Yeah.

  • Mhm?

  • AUDIENCE: I was going to ask, how much [INAUDIBLE] do you

  • get from the hybrid approach?

  • And that's pretty expensive if you have to--

  • SUHARSH SIVAKUMAR: Yeah.

  • It can be, and it really depends on the model.

  • So I think we have some specific numbers.

  • But it really shines in models that

  • are kind of memory bound, because your main cost is

  • this n cubed thing.

  • Your activations may not be too big,

  • but you're getting this huge benefit of really driving

  • that [? math ?] model.

  • So then the third tool is integer-only quantization--

  • or during training integer-only quantization So

  • this results in the same compatible graph

  • as that post-training integer quantization

  • in the previous slide, but the difference

  • is we're doing that introducing the quantization

  • into the training that we talked about before.

  • So we're working on keras APIs [INAUDIBLE]..

  • So the way this looks in--

  • the way this will look is you build your model as before,

  • and you just wrap it in this quantize wrapper.

  • And there'll be-- there's parameters too.

  • We won't go in too much detail.

  • For hybrid quantization, the way it looks

  • is you train your normal graph for TensorFlow,

  • and then you just enable a flag in the TF Lite converter.

  • So right now, we have hybrid and the post-training

  • only enabled in TF Lite because we want to make it general,

  • but right now we only have specifics on the hardware

  • capabilities of TF Lite, and we need

  • to know these to be able to do this.

  • So the way this looks is your normal TF Lite converter

  • indication, and you just add this optimizations default

  • flag.

  • And under the hood, this is just doing this hybrid [? quanta ?]

  • of just quantizing all the weights,

  • and leave the activations in float.

  • So performance.

  • First off, all these approaches get similar model size

  • reduction, in that you're simply taking 32 bits,

  • going to 8 bits, so you're getting a 4x reduction in size.

  • For latency, like here, we see the--

  • we do get a speedup in these image models,

  • but for a lot of them, we don't see too much of a speed

  • up as we would expect in quantization.

  • And it's because on-the-fly cost is actually pretty high.

  • AUDIENCE: What hardware is this?

  • Is this just like a CPU?

  • SUHARSH SIVAKUMAR: This is all CPU.

  • So like on accelerators, this will be--

  • the integer ones will really shine on custom accelerators.

  • So accuracy, we do see an accuracy drop

  • in a lot of these models.

  • And a lot of this, we are working on ways

  • to nudge weights at different times

  • during compilation to fix these accuracy issues.

  • And so all these, this is not like the gold standard

  • in what quantization can get in these techniques.

  • It's just a starting point.

  • So yeah, 4x reduction.

  • You see a 10% to 50% increase in convolution models on the CPU.

  • And then for memory bound models,

  • you really see a lot more.

  • And you often get most of the bang

  • of the buck of quantization from hybrid

  • in those models versus needing the full integer.

  • That being said, for accelerators,

  • you'd still need to go the full integer route.

  • So post-training integer quantization.

  • So this is also enabled in TF Lite.

  • You train the TensorFlow the normal way

  • you would a float graph, and then you provide one more

  • option into the converter.

  • And the way that looks is you do the same flag as before--

  • [? Optimize ?] default.

  • But now we need some data to figure out those dynamic ranges

  • at compile time rather than at runtime.

  • So this data generator you provide

  • needs to yield examples that you would

  • expect to see in practice.

  • And so for like image models, we just grab a few images

  • from [INAUDIBLE].

  • And usually, we see a couple hundred works well enough,

  • but it's probably very problem specific.

  • So under the hood, this is doing that post-training quantization

  • where we measure the absolute min and absolute max we

  • see for particular activations.

  • Mhm?

  • AUDIENCE: Why would a hybrid model be [INAUDIBLE]??

  • I mean, ultimately, you still have inferences still coming

  • in, so even if maybe the first one--

  • like the first 1,000 is slow, after 1,000,

  • you definitely have those statistics.

  • Why would you ever not just [INAUDIBLE] at that point?

  • SUHARSH SIVAKUMAR: That's the question, yeah.

  • And so--

  • AUDIENCE: [INAUDIBLE]

  • SUHARSH SIVAKUMAR: You could do that.

  • So oftentimes, it turns out these--

  • for like the RNN models, you actually get

  • an accuracy benefit from hybrid, which because--

  • AUDIENCE: Even if you had a bunch of data?

  • SUHARSH SIVAKUMAR: Even if you had a bunch

  • because each activation actually is getting a really unique

  • [? bridge. ?]

  • AUDIENCE: Because it's float.

  • SUHARSH SIVAKUMAR: Yeah.

  • And also because you can imagine in RNN,

  • that same op is actually going to change its distribution

  • based on which time step you're on.

  • And so it really ends up being problem specific there.

  • But you're right, for like image models,

  • we absolutely could be doing that.

  • So yeah, the example of representative dataset

  • is just how you would normally load data.

  • And you just yield examples of these images.

  • So now some numbers.

  • So before we had released this, the [? contra ?]

  • quantized rewriter-- which I'm not talking about in this talk

  • because it's deprecated for a more friendly 2.0 capable API.

  • But those are kind of the gold standard

  • in quantization accuracy numbers for these image classification

  • problems.

  • And what we've seen is that with these changes of per

  • channel into our quantization scheme,

  • post-training integer quantization, which

  • is the right column, gets pretty comparable on all these models

  • that matter at the moment.

  • And this is without anything fancy.

  • So Denali has been looking into a lot of cool tricks

  • that are figuring out how to get--

  • where the accuracy is going in post-training.

  • So these numbers should be improving as well.

  • But the takeaway here is that most things--

  • 8-bit-- maybe we're good enough with post-training,

  • and only the experts really need to use

  • quantization-aware training.

  • So this is an example of quantization not working well--

  • is the first call.

  • Where SSD, it's the same base structure of MobileNet,

  • but what you're doing with your [? logits ?] is a lot more.

  • So quantizing actually introduces a lot more error

  • here, and we see over a percent drop

  • in post-training versus quantization-aware training.

  • And this higher better is [? wrong. ?]

  • [LAUGHING]

  • So the other two columns are new models,

  • and no one ever went about doing quantization-aware training

  • here because it was just too much work,

  • and because they tried post-training.

  • These were released after post-training was released,

  • and post-training did really well accuracy wise,

  • so they just didn't bother with quantization-aware training.

  • More models.

  • Style transfer, we got good results on quantization,

  • although there's not really a good metric for style transfer.

  • The metric is like, look at it and it looks good enough.

  • And then some speech models do really good.

  • Everything's great.

  • [INAUDIBLE]

  • So the benefit of post-training integer quantization

  • is similar size reduction, similar speed up on CPU,

  • and similar speed up on the CPU for RNNs and convolutions.

  • Even better for convolution because you

  • don't have this cost.

  • But the main thing this enables is all these integer

  • microcontrollers, all these integer accelerators can now--

  • we can run on them.

  • So here's the summary of the three tools.

  • And the flow should usually look like you try hybrid,

  • you see how you [? get ?] [? on ?] CPU.

  • If you want to go to an accelerator

  • or you want more in CPU, you do the post-training

  • where you just add some representative data set.

  • And then only as a last resort, once you see post-training not

  • getting good accuracy for you, try

  • quantization-aware training.

  • So similarly, we have tools for connection pruning, which

  • are during training techniques.

  • And so they have a similar API to the quantization-aware

  • training API.

  • And so the flow usually [? lets ?]

  • you build your keras model, you apply pruning

  • on the API you train.

  • And often, these pruning APIs are doing a lot less like--

  • they're very localized to your weights.

  • So they're not really tearing apart your graph

  • like quantization is.

  • And [INAUDIBLE] can attest to this, where the pruning was

  • a lot simpler implementation wise than quantization-aware

  • training, because for training-- for quantization,

  • you have to understand all the fusions of your backend,

  • whereas pruning is local to the weights.

  • And so the flow here is you train like normal,

  • and the resulting graph has many tensors

  • that have lots of zeros.

  • And right now, the flow is that you can compress your file

  • and it's smaller.

  • And in the future, we're working on TensorFlow Lite runtime

  • support for these sparse tensors and kernel support.

  • So additionally, you'll get out of the box size reduction

  • instead of having to do this manual compression,

  • and you'll get speed up--

  • AUDIENCE: For the sparsity, in the future,

  • when you say that it might be faster,

  • is it in the case of structured sparsity where you force

  • [INAUDIBLE] sparsity, or is it for arbitrary sparsity that it

  • might--

  • SUHARSH SIVAKUMAR: So it really--

  • yeah, so this is something where we're trying

  • to figure out two things.

  • Those particular questions for a given hardware, what do we

  • want, and how do we expose this in a way that

  • makes sense when all this-- there's

  • so much fragmentation for hardware and problems.

  • So for certain problems, if you do arbitrary sparsity,

  • you probably need like 99.9% sparsity

  • to get a speed up on a particular hardware.

  • And for CPUs, and particularly speech models,

  • we've already been doing structured sparsity

  • with certain block sizes like you're saying.

  • And so this training tool has the ability

  • to set your block size.

  • And right now, we're working on--

  • where we need to work on in the future, for a given

  • hardware, what is the standard block size you need for that.

  • And so yeah, you're absolutely right.

  • There's fragmentation [? too. ?] It's like,

  • will the problem will allow this level of sparsity

  • that you desire, and is the hardware you

  • target going to support that?

  • So yeah, for CPU, usually [INAUDIBLE]..

  • So the API here is similar to the quantization API.

  • You provide parameters on your schedule

  • for how you want to quantize.

  • And here, that final sparsity is an important number.

  • It's basically saying at the end of training,

  • how many values in all your weights do you want to be 0?

  • So yeah, [INAUDIBLE].

  • The coverage we found is very--

  • it works on a lot of models.

  • It seems to be a very general technique.

  • And as I said before, it works really well with quantization

  • as well.

  • So here's a graph that's kind of a confusing graph.

  • But it's how your accuracy is affected on MobileNet.

  • This is an example based on how much pruning you do.

  • And what we noticed is that there is often a lot of pruning

  • you get for free, and then there's a sudden cliff.

  • So the goal here is to, for your problem,

  • to play with the parameters and figure out

  • where is that sudden cliff, or where do you

  • want to lie on this curve?

  • So here, we see around like 75-ish percent.

  • You're doing pretty good until then.

  • AUDIENCE: What technique was used

  • to actually do the pruning?

  • SUHARSH SIVAKUMAR: So here, we do pruning based

  • on the low magnitude value.

  • So there's a mask, and then you update that occasionally,

  • and the mask is updated based on which values of your tensor

  • are closer to 0.

  • So more numbers-- works great.

  • Skip.

  • And then so in summary, quantization is hard

  • because it is problem specific, hardware specific,

  • and the tools have lots of trade-offs

  • depending on which problem with which hardware.

  • And then pruning, we're starting to get

  • into the space of accelerating pruning.

  • And right now, it's a great technique

  • for reducing model size, and we need

  • to explore how that's going to look for various hardware-- is

  • how we're going to expose this in a general way.

  • So otherwise, any questions I can answer?

  • AUDIENCE: So how does a CPU actually do the--

  • make [? its ?] multiplication given two inputs

  • with a min and max for each?

  • SUHARSH SIVAKUMAR: So the way it looks--

  • I don't know if I have anything to look at.

  • [INAUDIBLE]

  • Yeah, I'll try that.

  • AUDIENCE: It's like, the min and max are--

  • SUHARSH SIVAKUMAR: So I'll say it in words first.

  • If it doesn't make sense, I can try to find something.

  • So the way it actually looks at the--

  • works [? at ?] inference.

  • So let's ignore zero point for a moment,

  • because it just gets in the way.

  • So say we're just doing a matrix multiplication.

  • So your input has a certain range

  • which corresponds to a particular scale.

  • Your second input, your weight has a certain range which

  • corresponds to a certain scale.

  • So you have one scale, another scale, and then

  • your output has a third independent range

  • on the third scale.

  • So what we do is your int8 matrix multiplication actually

  • gets accumulated into int32 values.

  • So if you imagine that all those int32 values

  • in the accumulator, they have an implicit scale--

  • because you just multiply it--

  • of these two scales multiplied.

  • If you wanted to recover the float

  • value from these int32 values, you just

  • multiply by these two scales.

  • So that's not how it actually works, but I'm just

  • explaining the math.

  • And so then our goal is to eventually output int8 values

  • that lie on the output scale.

  • So what we do in practice is we want

  • to get from this int32 value that

  • has implicit scale of this scale and this scale--

  • s1 and s2-- and go to s3.

  • So we just multiply by s3 and divide by s1 and s2.

  • So we make a new scale that's those three values--

  • that fraction.

  • So that's how the-- so in practice, the inference just

  • looks like int8 times int8, int32,

  • do this one rescale which is this is s3 over s1, s2.

  • And then you're [INAUDIBLE] value, if that makes sense.

  • I could-- yeah.

  • AUDIENCE: So you don't have to do an integer division?

  • SUHARSH SIVAKUMAR: Yeah.

  • And so that rescale is a floating point value,

  • and we don't want to do that.

  • So we do decompose that into two integers, and sometimes a shift

  • depending if you're like-- sometimes,

  • your target only supports power of 2 scales

  • because it just wants to implement that as a shift.

  • So there's lots of-- that's a whole other thing, where

  • there's lots of ways to implement that

  • rescale [INAUDIBLE] trade-offs.

  • So what TF Lite does by default is we decompose it into two

  • integers, and do like a--

  • we almost emulate float.

  • AUDIENCE: [INAUDIBLE] being used in training.

  • Is that consistent with what--

  • I mean, [INAUDIBLE] in training--

  • could you describe it with something orthogonal?

  • SUHARSH SIVAKUMAR: Yeah.

  • There's a lot of techniques that we need to start including.

  • And so right now, these techniques

  • have been these kind of end-to-end

  • get-something-working type techniques.

  • Where first, there was no training,

  • so we went to quantization during training.

  • But more and more, like [? wesde ?]

  • I think you're talking about.

  • Like there's [? no such wesde, ?] where

  • the idea is if you're given a particular min-max,

  • what is the perfect range--

  • perfect distribution of values such to decrease quantization

  • error?

  • And the answer is the uniform distribution.

  • So [? wesde ?] tries to do this by introducing loss

  • into your training.

  • And we-- these things all are compatible with when

  • you train on the float model, but they're not

  • offered out of the box.

  • Because we have noticed things-- in some of my experiments,

  • I noticed that [? wesde ?] only works

  • well for a particular model after you've trained for a bit.

  • We still don't have general knowledge

  • on when exactly to use it.

  • So we should be offering all of these, and we plan to

  • in this toolkit as choices for users.

  • But yeah, that's a great technique.

  • AUDIENCE: I'm sorry, [INAUDIBLE] asking [INAUDIBLE] question.

  • You mentioned-- is training and post-training techniques,

  • and then also you can do it hybrid or you can do pure int.

  • So in this [? quadrant, ?] one kind of option was missing.

  • So you didn't-- you showed three examples,

  • but you implied that you wouldn't be doing in-training

  • quantization combined with the hybrid.

  • Why is that?

  • SUHARSH SIVAKUMAR: You're absolutely right.

  • And it's just that the tooling is not doing that right now.

  • But that's exactly the direction we want to go.

  • That use-- get some metrics on what error

  • the quantization is using, and use that to drive things like,

  • should we be doing one or the other?

  • Should we be doing 8 bits, or should we--

  • for this one tensor, does it make sense

  • to leave it in float, does it makes sense

  • to bump it up to 16?

  • But that's absolutely right, where--

  • AUDIENCE: So there's nothing inherently wrong with it.

  • It's just another option [INAUDIBLE]??

  • SUHARSH SIVAKUMAR: Absolutely.

  • And for context, like what we've added

  • now is that if you have ops in your graph that

  • don't support quantization, we just leave them in float.

  • So we're already starting to get in the direction

  • of partial quantization, but that's exactly the direction.

  • And the piece that's kind of missing

  • is these two information hooks, where

  • one is what is quantization doing to your problem task--

  • like, your error for your actual problem.

  • We can get things, like signal-to-noise ratio,

  • but oftentimes that's not too representative of what's

  • it doing to your task problem.

  • So one thing we need is for this op,

  • what is it doing to the problem?

  • And then we can make decisions like this.

  • And the other thing is some pluggable specification

  • of hardware that says for this hardware,

  • does it even support hybrid, because then it's

  • not an option.

  • But yeah, that's exactly what we need to be working on.

  • AUDIENCE: Thank you.

  • [MUSIC PLAYING]

SUHARSH SIVAKUMAR: We'll get started.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it