Placeholder Image

Subtitles section Play video

  • RAZIEL ALVEREZ: Hi, my name is Raziel.

  • I lead TensorFlow model optimization.

  • And today I will talk about our toolkit and, in particular,

  • about techniques around quantization

  • and neural connection pruning.

  • So first I'll introduce, what is model optimization?

  • What is our toolkit?

  • And then the reason about why we think it's important, why

  • we're investing in this area.

  • Then I'll cover the tools that we have available.

  • And at the end, I will give an quick overview

  • about the roadmap for the short term and the longer term.

  • And hopefully at the end of the presentation,

  • we still have some minutes to go over Q&A.

  • So our toolkit implements techniques

  • that should allow you to optimize machine learning

  • models for deployment and execution.

  • We think this is important because machine learning is

  • everywhere, right.

  • It's a very important field, and we

  • think that there is a lot of room to make it more efficient.

  • And this has some implications, both like economy--

  • like, you can make all these applications

  • much better, the quality or cheaper to execute.

  • But also we can enable some new models, and new deployments,

  • and new products that otherwise are not possible

  • even if you just tried to execute these machine learning

  • models on servers.

  • So currently machine learning runs either on the server

  • or on the Edge.

  • On the server, you may think that there

  • is a lot of capacity, that there's

  • a lot of compute and memories.

  • What is the benefit of optimizing these models?

  • Well, applications are still bound by latency.

  • There is still a very important metric

  • for a lot of applications, or you

  • want to improve the throughput-- how many tasks

  • can run on your server.

  • And these two are also directly correlated to money, right.

  • So everybody will want to save money,

  • and potentially, we're talking a lot of money.

  • Now on the Edge is a little bit more obvious

  • why we need optimization.

  • These are a very resource-constrained

  • environment even if you're talking

  • about applications in general.

  • We need to deal with reduced memory, computing,

  • power consumption is typically an issue, bandwidth,

  • both downloading, models from the Cloud.

  • Or even we've seen the chips to be able to transfer parameters

  • from memory to the processor, this

  • could be a problem if the model is too large.

  • Plus, we have a wide variety of hardware,

  • more than in the server, and we need

  • to make sure that these models run

  • efficiently on all these different types of hardware.

  • So it follows that if we are optimizing these models,

  • and we have better models, eventually,

  • it starts translating enabling new products that otherwise

  • couldn't exist if we just were running

  • these models on a server.

  • And these opportunities are larger than just

  • on smartphones.

  • Machine learning, it is trickling down

  • into more environments.

  • We have machine learning models, for example,

  • that are used to detect failures in machinery in factories,

  • or we use it in self-driving cars.

  • We use it in the office to scan documents and try

  • to understand them.

  • And just to give you some numbers,

  • right-- like, the size of the smartphone market

  • is really a fraction of the potential for the Edge devices,

  • in general.

  • So basically the two reasons is we want to make machine

  • learning model efficient.

  • It's already very important for the servers,

  • but it is pretty crucial for embedded devices.

  • So we started this toolkit about a year ago.

  • We initially launched post-training quantization

  • with this hybrid type of quantization,

  • and I'll go in more detail later in the presentation.

  • Then earlier this year, we launched the API

  • for neural connection pruning, then

  • we created this specification of quantized operations,

  • integer quantized operations for TensorFlow Lite.

  • And we launched also post-training quantizations

  • for targeting this specification.

  • More recently, we added support for reduced flow precision,

  • and hopefully soon we're going to be launching quantization

  • of our training API and also add in support to TF

  • Lite for Sparse computation.

  • So now I'll go into these techniques and these tools

  • in a little bit more detail.

  • So let's start with quantization.

  • But first I think it's important we

  • have some at least basic understanding of what

  • is quantization, and why it's hard,

  • and why we are approaching our tools the way

  • that we're doing it?

  • So let's start with a simple example.

  • You know, matrix multiply, it's a basic operation

  • for machine learning models.

  • You have two matrices, two tensors A and B,

  • and then you do some multiplication accumulation,

  • and then you get a third tensor, C. So each tensor

  • issues a bunch of values that produce the third tensor.

  • Then just a little reminder about how matrix

  • multiply works, each one of these results of the tensor C

  • are computed as multiplications and accumulations.

  • So if we look at one of them, and then

  • if we think how we're training these models, typically

  • in a higher precision--

  • let's say float 32--

  • then it follows that the operations

  • of the multiplications are float 32 in precision.

  • And then the product will be also a float 32, right.

  • And then the accumulation will be also float 32.

  • So this is fairly straightforward.

  • There is some loss in precision, but machine learning

  • is pretty good at dealing with it at least

  • at this level of precision.

  • So, no problem.

  • Now what does this have to do with quantization?

  • Well, let's go back to what are our goals for quantization.

  • We want to be able to address all these restrictions,

  • and also we want to be able to deploy

  • as much hardware as possible.

  • So a common thing that we do is, let's reduce the precision

  • that we operate in.

  • Let's say, for example, go from the 32-bit floats

  • to 8-bit integers, and then let's operate,

  • let's say, entirely within integer operations.

  • And this will be good because then we are going from 32 bits

  • to 8 bits, so we reduce the memory.

  • You know, the models are four times smaller.

  • Then integer operations are typically faster to execute.

  • They also consume less power.

  • And then because the parameters and also the dynamic values

  • activations are smaller, then we reduce bandwidth pressure.

  • It means that in the pipes in the chips,

  • there is more room for things to flow around, which

  • can also translate into faster compute and reduced power.

  • And then integer operations seem to be a fairly common

  • denominator across hardware.

  • CPUs, DSPs, different NPUs, they support the integer operations.

  • So OK, we are going to reduce the precision,

  • so how do we convert the 32-bit float to 8-bit integers?

  • Well, right now we do something very simple.

  • So we have this linear mapping, where we say,

  • OK, we take the values from a tensor.

  • We compute the minimum and the maximum value,

  • and then based on that, we spread them evenly

  • on the 8-bit range.

  • This basically is very simple, right.

  • So is that all that we need to do?

  • Well, I wish.

  • It's not that's simple, so let's go back to the example.

  • So we have the matrix multiply.

  • And now let's say that we quantized the values,

  • and they are already 8-bit integers.

  • So the operands and the multipliers are 8-bit integers.

  • And then the multiplication, the products, now you

  • need 16 bits to represent it, and then you

  • need to accumulate.

  • And you probably want 32 bits to accumulate on that, right.

  • So what is the problem?

  • Well, the problem is that now you have tensor received

  • a float of 32-bit values.

  • And that is not great when you want

  • to fit that into another matrix multiply,

  • that you really want to execute as 8-bit integers,

  • because we already taught they're

  • resource efficient, right.

  • So what do you do?

  • So you scale them back down, right.

  • So we just sort of quantize them on the fly now back down

  • to 8-bit integers.

  • So then you can feed them to your major 8-bit matrix

  • multiplier, and now it's all good.

  • But what are the implications of all this process?

  • Well, so it means that we're changing the static values,

  • the parameters, the weights.

  • We are changing also the dynamic values,

  • the activations, because we're, you know,

  • scaling them-- quantizing them on the fly.

  • And it also means that we are changing

  • the computation, right.

  • In this case, it's a very simple example.

  • We just added a scaling operation,

  • but it can be a bit more involved.

  • So you could say, OK, that doesn't seem that hard, right.

  • Like, we just added another number scaling operation,

  • and it's just easy, right.

  • Well, some math is a little bit more complicated than that.

  • This example actually is from layer normalization of an LSTM.

  • And this one, aside from looking a bit more complex,

  • it's an example of-- where if you just apply these naive

  • rules of operate, rescale, operate, rescale--

  • you actually end up in sort of a numeric black hole

  • where the scales cancel each other,

  • and basically just things don't work if you just

  • go naively about it.

  • And then it's more complicated because in your decisions

  • about how you're going to represent this computation

  • in integer form--

  • you know, quantization, we want to be efficient,

  • so lower precision is good.

  • But we also want to be accurate, which

  • means lower precision is bad.

  • So it's a lot of trade-offs that you have to do.

  • Then further complicating things,

  • we have heterogeneous hardware.

  • There is all different types of hardware

  • with different capabilities, very different operations

  • that each hardware supporters, and also different preferences.

  • Some hardware is, you know, better at executing operations

  • and produce floats, you know, with different bit widths

  • or different restrictions.

  • And we want to account for all the things

  • when we are creating our quantized recipe,

  • or a quantized program.

  • Then there is the fact that machine learning

  • is hard to interpret.

  • We don't understand it.

  • Like, we don't understand how it works--

  • not to the level that we can't have

  • good proofs to know that the transformation that we're

  • doing to this model, to this program,

  • will actually work or not result in a catastrophic error, right.

  • You don't want to take a model, you quantize it,

  • and then this model suddenly starts

  • giving you some weird results.

  • So it makes it much more complicated

  • to define these transformations.

  • And then finally, I will say that this

  • is a little more complicated, because the model is not

  • enough.

  • The program is not enough.

  • Depending on how the quantization is defined,

  • you might also need some extra data.

  • So an example of the matrix multiplier,

  • we need it to compute the minimum and maximum

  • values of the dynamic activations,

  • and that only can be done if we run inferences

  • through the model, which means that you need to provide

  • some representatives there.

  • So basically, it's just another hurdle

  • that you have to account when quantizing these programs.

  • So, you know, basically, this means

  • that, when we talk about quantization,

  • we're really talking about rewriting,

  • transforming these machine learning programming,

  • to an approximate representation based on the operations

  • that you have available.

  • So now how are we addressing this in our toolkit?

  • Well, the first thing that we decided to do

  • was to try to scope down the problem and say,

  • OK, we're going to define the specifications

  • for common operations--

  • like, in these cases, a diagram for convolution.

  • Have a well-defined quantization behavior.

  • So we know that now, with this information,

  • this low-level information that is relevant to quantization,

  • then hardware can target those specifications.

  • And then our tools can target that specification,

  • and then we can all work at this level.

  • And then we also get the other benefit

  • that, from the user point of view, you can quantize a model,

  • and then this model can run in different hardware

  • without any change.

  • So right now, in order to give you

  • support the three different quantization types.

  • I'm including, here, reduced float as a quantization type.

  • It's just a much simpler thing where we just

  • typically go from float 32 to float 16 parameters

  • and computations, so that's pretty straightforward.

  • The next one is our hybrid quantization

  • which basically makes use of 8-bit for the parameters.

  • Biases and activations, we leave at 32-bit floats.

  • And then we try to be as smart as possible

  • for how we execute this program.

  • So the goal being that, for example,

  • heavy operations like big matrix multipliers

  • are left in the integer domain, and then we

  • use floating point for things like activation functions.

  • So it's a nice trade-off between accuracy, and performance,

  • and optimizations.

  • Then the third one is integer quantization.

  • This means everything is integers.

  • All the parameters are integers, and all the operations

  • are integers.

  • This is obviously the more complicated one.

  • So the benefits of the reduced float is--

  • well, your models are now half size.

  • And then depending on the hardware support,

  • you may get some speed-ups, and then the accuracy losses

  • tend to be very minimal.

  • It pretty much always works.

  • I haven't seen, myself at least, an actual model, that

  • doesn't work in float 16, that was trained in float 32.

  • Hybrid quantization then pushes it further.

  • You now get 4x reduced in size.

  • And then depending on the computations of the operations

  • that you're using, you may get different performance

  • improvements.

  • It tends to be larger for fully-connected models or RNNs.

  • And then the third one is the only integer quantization,

  • so it has the same benefits as hybrid in terms of memory size.

  • But it's faster to execute, and it

  • has a great advantage that it has more hardware coverage.

  • So for example, typical MPUs, some of them

  • are only, like, integer-based like over Edge-TPU.

  • Now let's talk about the tools to actually quantize

  • the models based on those quantization types.

  • So we have two types of tools--

  • one that works post training.

  • So it works directly on the training model.

  • And the other one that is a work-in-progress

  • is during training.

  • So let's talk about the post training.

  • The process is very simple.

  • You basically assume that you just have a train.

  • Well, it doesn't really care how you train it.

  • You just have a TensorFlow model.

  • Then currently, via the TensorFlow Lite converter,

  • you just convert this model to TensorFlow Lite

  • and quantize it on the fly.

  • And then you just have a model that you

  • can execute on whatever hardware is supported

  • in that quantization type.

  • So now let's look at the specific quantization types.

  • So the first one is reduced float.

  • You just add a couple of flags.

  • You just use default of optimizations,

  • and then the type that you're targeting is float 16.

  • And then basically, this will take

  • care of changing all the parameters and the computation,

  • and again, depending on the hardware that you're

  • running this model, you might get a speed-up right now.

  • For example, GPUs support float 16 natively,

  • so you might get some speed-up there

  • either because of the computation

  • or even just because the bandwidth in your chip

  • will be reduced.

  • Like I said, benefits--

  • all the size goes to half.

  • And, you know, the accuracy drop is very minimal.

  • I will say within the noise.

  • Then the next one is our hybrid quantization.

  • So again, this is very easy.

  • You just set the flag now.

  • This is the default for TensorFlow Lite converter.

  • You set it to default. And then again, it

  • will make sure to quantize all the parameters.

  • And then operations that doesn't yet

  • define a specification for the quantized form,

  • they will be kept in the original position.

  • And then you will get some speed-ups,

  • and you will be able to execute whatever hardware complies

  • with the specification.

  • So typically, this one works pretty well for CPUs.

  • And again, benefits-- 4x compression for the models.

  • And then you get some speed-ups.

  • All these are convolution-based models,

  • so that's why the speed-up is not as big.

  • And I will say these are one-year-old numbers,

  • so probably right now it's faster.

  • And the same for accuracy, accuracy is pretty good.

  • And actually we're working on some changes

  • for convolution models.

  • It will even be a bit more accurate soon.

  • Then the third one is the integer quantization.

  • So this one is the one that is a bit more complex, because now

  • you need to provide some data.

  • So you say, OK, I want optimize the model,

  • but I want to use the integer quantization.

  • So now you need to provide some data.

  • And by data, I mean on label samples

  • of what your neural network will typically see in reality.

  • So if it's an image processing model,

  • you need to feed some pre-processed images.

  • And we're not talking about a lot of data.

  • For the results that I'm going to show next,

  • we're just talking about a hundred samples.

  • That works pretty well.

  • So it is a bit more complicated, but it's not very complicated.

  • So these are some results from post training quantization

  • across different models.

  • As you see for the majority of models,

  • the loss is not that big with respect to the full precision

  • train baseline.

  • The only one I will say is the MobileSSD model.

  • So that has a bit more meaningful drop,

  • but again, a variety of models work pretty well

  • with post training quantization.

  • Now I'll talk about during training,

  • because like I showed in the previous results, you know,

  • there is still some models that will

  • benefit from doing this quantization of our training.

  • And by quantization of our training,

  • we mean we tried to emulate the quantization

  • operations, the quantization losses,

  • during the forward pass of the neural network,

  • with the hope that the parameters will

  • be tuned to account for that.

  • So the process for doing the quantization of our training

  • for using our API, it's a little bit more involved.

  • We are, again, trying to make it very simple.

  • So we built this API in Keras, again,

  • to make it very easy to use.

  • So basically, we assume that you already have a Keras model,

  • and then you just need to call our API

  • to apply the quantization.

  • And this might change a little bit,

  • but it will look something like this.

  • So you just have a model that you already

  • built using Keras layers.

  • And why not?

  • And then the only thing that you need to do

  • is call our API on your model, then

  • you get now a model that is rewritten to have

  • all emulation of quantization.

  • And then you just call your fit function, and that's it.

  • So then you just train your model as usual.

  • And then you can go through the TensorFlow Lite converter,

  • and then it will take this model that

  • was trained with quantization.

  • It will have all the data necessary to quantize it,

  • and then it will produce a quantized model

  • that, just like the post-training model,

  • you will be able to execute in different hardware.

  • These are some numbers from quantization of our training

  • preliminary numbers.

  • If you see the delta is a little bit better

  • than post-training quantization, it's

  • not a very big difference except for the MobileSSD.

  • So before it was 4% for post-training quantization.

  • In this case, it's 2.9%.

  • So quantization or our training is still a useful tool.

  • That's why we're building it.

  • Now you may wonder-- that those where

  • a lot of quantization types and tools, so which one should

  • I use?

  • So my recommendation is if you are just

  • starting, just start with try to reduce floats.

  • That's the first one to try.

  • It is just very easy to use.

  • It doesn't require any data.

  • The accuracy will probably be the same.

  • And then latency, depending on the hardware,

  • you might get some benefits--

  • reduced latency.

  • And then compatibility-- basically,

  • everywhere you can execute floating point operations,

  • you will be able to use it.

  • The next thing to try will be the hybrid quantization.

  • Again, there is no data requirements.

  • The accuracy will be still good, probably not as

  • good as float 16 in some cases, but it's still good.

  • It will be faster than the reduced float.

  • And basically, compatibility will be everywhere

  • that you have support for float and integer operations.

  • Then the third one to try is the integer quantization

  • with the post-training tool.

  • This one is a bit more complicated

  • just because you need to provide a little bit of data.

  • The accuracy will be worse or the same as hybrid,

  • but the latency of this will be the fastest.

  • And then it will also give you more hardware coverage.

  • And then the last thing to try will

  • be the integer quantization with quantization during training.

  • And basically, this is good.

  • This will be a little bit more involved, because now you're

  • doing training.

  • You're supposed to have now a training setup, a training

  • script.

  • But the accuracy is will be better

  • than doing just the post-training version,

  • and again, you get the benefits of being

  • the fastest one and the one with more hardware coverage.

  • So that was quantization.

  • And again, all these tools, we're

  • trying to make it very easy to use,

  • so it will be great if you try them out

  • and give us some feedback.

  • Then, connection pruning.

  • So what is neural connection pruning?

  • Well, the way that we have implemented it so far,

  • it means it is a training time technique

  • that, during the training process,

  • it will start removing dropping connections

  • from the neural network.

  • And then these connections will--

  • the dropped connections basically just become

  • zeros in the tensors that you're training,

  • and then that means that you end up with sparse tensors.

  • Now sparse tensors are great, because you

  • can compress them and potentially

  • execute them faster.

  • So this is an example.

  • This is a tensor, how it starts randomly initialized.

  • The dark values means values that are non-zero,

  • and white means values that are zero.

  • And then as the training progresses,

  • then it starts becoming sparser and sparser.

  • And if you see this tensor, it's basically

  • removing most of the parameters there.

  • The process for the API is very similar to the quantization

  • of our training API.

  • Again, we're trying to bring some consistency to our APIs.

  • So it's built on Keras, so it assumes

  • that you have a model that is trainable in Keras.

  • And then you're going to call our API

  • to apply the pruning logic.

  • And this again, we are trying to make this as simple

  • as possible.

  • So the only thing that you need to define

  • is a pruning schedule-- basically,

  • when you want to start dropping these connections,

  • and until when, and how fast, how aggressive

  • you want these prunings to be.

  • And then you just call our prune function,

  • which again will modify your graph to add all the pruning

  • operations internally.

  • And then you just call your fit function,

  • and you train as usual.

  • So basically, you train as usual, and then once you train,

  • you have two options now.

  • Or soon, you will have two options.

  • You can just take the same model,

  • the TensorFlow saved model.

  • You can just compress it, gzip, and then the model

  • will be smaller.

  • And soon, you will be able to convert it via TensorFlow Lite,

  • and you will get also a reduction in size

  • and potentially some speed-ups depending

  • on what prune configuration you're using

  • and the hardware that you're targeting.

  • So this should be done pretty soon.

  • Now what are the benefits of pruning?

  • We've tried it in a lot of tasks,

  • like really a lot of tasks-- on image, speech, audio.

  • And it worked pretty well.

  • And like a lot of techniques that are

  • require hyperparameter tuning, and, you know,

  • careful restarting your models, and things like that.

  • But pruning has worked pretty well

  • without a lot of babysitting.

  • Then it has potential for speed-ups

  • depending on hardware support.

  • And we also have pretty good results.

  • Like, we can make a lot of the parameters basically go away.

  • We see 50% to 90% with negligible accuracy loss.

  • And the other great thing is that it works well also

  • with quantization.

  • So a typical setup that we've tried is with training pruning,

  • and then we use post-training quantization.

  • And basically, the accuracy is pretty good,

  • and you get the compound benefits of all techniques.

  • This is some, older now, results that we have

  • when we launched this.

  • So I mean this is in InceptionV3.

  • We see we can get all the way almost to 90%

  • sparsity with relatively small accuracy losses.

  • And the other--

  • GNMT's neural machine translation, where again,

  • we can take it to almost 90% pruning and also small accuracy

  • losses.

  • And we've done these, for example, speech recognition.

  • We actually had, recently, the Google Pixel event,

  • where the speech recognition models

  • used pruning and quantization and were

  • able to have a model with server-side quality running

  • on a phone, which is pretty good.

  • OK, so now I'll finally cover, really quick, our roadmap.

  • Like I mentioned, quantization-- we're

  • working on a quantization training API,

  • so that should be ready soon.

  • And we are also working on our specs

  • for quantizing RNNs, which are typically trickier to quantize,

  • like LSTMs.

  • Then I didn't include it there, but we're

  • making some improvements to the hybrid quantization

  • to be more accurate, particularly

  • for convolution layers.

  • And then for sparsity, we're adding support

  • for sparse computation in TensorFlow Lite runtime.

  • Longer term, I don't know if you have heard about MLIR,

  • but it's state-of-the-art compiler infrastructure,

  • but this is particularly interesting to us

  • because it's a better way for us to write these transformations.

  • And at the end, like I said at the beginning of the talk,

  • we're taking a model.

  • We're transforming one program into another representation

  • of that program.

  • And some of the things that we want to enable

  • is better targeted hardware, so our specifications

  • are great because users can target

  • our specification in executing different hardware.

  • But some users just want to [INAUDIBLE] hardware and get

  • the best out of it.

  • So we're hoping that, with the new infrastructure that we're

  • building on top of MLIR, it should be possible.

  • And finally, I really just want to encourage you to try it

  • out and give us feedback--

  • what techniques you would like to see.

  • You know, researches-- there is techniques popping up

  • all over the place.

  • And a lot of the work that we have to go through

  • is culling what's useful and what's not-- what is general

  • and what is very specific.

  • So we will love to hear your feedback about that

  • and also about the tools that we already have in the toolkit.

  • We're trying to make them as easy as possible to use.

  • We know that we still have a long way to go,

  • but any feedback that you can provide

  • will be really, really appreciated.

  • And I think there is a little bit of time for questions

  • if any of you have questions.

  • [CLAPPING]

  • Thanks.

  • AUDIENCE: Hi.

  • I have a question regarding the [INAUDIBLE]..

  • Hi.

  • Thank you for the presentation.

  • I have a question regarding the training

  • with integer quantizations.

  • In the pipeline, is that going to be true quantization

  • during training?

  • RAZIEL ALVEREZ: No.

  • So right now, by true, I mean that you

  • expect that all the operation happen in the integer domain?

  • AUDIENCE: Yes.

  • RAZIEL ALVEREZ: Not right now.

  • That's something I really want enabled,

  • because I want to make training faster as well.

  • But right now, the way that we are targeting is--

  • I don't know if you're familiar with TensorFlow APIs,

  • but we have this low-level API, unfortunately called

  • fake quantization, that basically just emulates

  • these losses.

  • And that one is still-- basically,

  • what we do there is we quantize parameters,

  • and then we de-quantize them, and then

  • we do the float operation.

  • So that's what we're using right now.

  • But yeah, longer term, we want to do true integer forward

  • passes.

  • AUDIENCE: Thank you.

  • AUDIENCE: Hi.

  • [INAUDIBLE]

  • Oh, I had just one question.

  • So after you do the quantization,

  • is there a way that you can also visualize the finish quantized

  • model?

  • Yeah, that was one question, and I had another question.

  • Let me think about it.

  • But is there a way that you can also.

  • Oh, the other question was, what sort of tools

  • are you going to provide as far as to sort of do

  • model correctness and--

  • I mean, at least evaluate, you know,

  • whether this quantized model is sort of functionally

  • correct in a sense?

  • RAZIEL ALVEREZ: Yes.

  • Visualization, again, it depends where.

  • But for TensorFlow Lite, you have a visualizer,

  • so you can see the quantized model.

  • I don't know if it will give you a lot of information, depending

  • what you're looking for.

  • We also want to make our tooling a bit better, because perhaps,

  • for whatever reason, you want to get old research in and start

  • looking at the activations, and how they change, and all that.

  • AUDIENCE: Sure, yeah.

  • There's like inserted ops and so forth.

  • RAZIEL ALVEREZ: Yeah.

  • AUDIENCE: [INAUDIBLE]

  • RAZIEL ALVEREZ: So for sure with the TF Lite visualizer,

  • you can see how the graph changes.

  • So the second question about correctness, correctness

  • is really tricky.

  • Because in my experience, the only thing that really works

  • is to really evaluate on the real data

  • that you care to run your model on.

  • AUDIENCE: Yeah, that's right.

  • RAZIEL ALVEREZ: You know, like, we

  • tried to do things like ultra norms to approximate--

  • OK, versus the full precision one versus the quantized one.

  • And then it gives you a sense of maybe some really catastrophic

  • numerical errors, but otherwise, it's really just a guess,

  • right.

  • AUDIENCE: That's right.

  • RAZIEL ALVEREZ: Particularly, depending on the output layers,

  • you know, categories are easier to quantize, because, you know,

  • the error is not very meaningful as long

  • as you get the right category.

  • Regressions are much harder because now you really care

  • about the actual values.

  • Yeah, it's an open problem.

  • AUDIENCE: Yeah, it's a tough problem.

  • Thank you.

  • AUDIENCE: I have a question about the results

  • from the GMNT training with induced sparsity.

  • I was wondering if you had any insights on why

  • the training with 80% sparsity would perform better

  • than the original version?

  • Like, if you looked at the results.

  • RAZIEL ALVEREZ: You know, the hand-waving thing,

  • that we always say in these cases,

  • is some regularization happens.

  • [LAUGHTER]

  • Yeah.

  • And you know, I've seen the same with some quantize models.

  • I've never had the gear to really sit

  • down and try to understand what their reasons are for all this.

  • Sometimes it's just because it's within the noise, right?

  • It all depends on your evaluation set, right.

  • If it's really not that big or not that meaningful,

  • then these jumps are all possible.

  • Like, I've seen some models where, oh, it looks great

  • after you quantize it.

  • Then you throw in a new data set, say from speech

  • recognition and noisier utterances,

  • and then you clearly see the difference

  • between one and the other.

  • So a lot of it can be just noise.

  • AUDIENCE: Hi.

  • You mentioned explainability.

  • And a technique could be like saliency maps.

  • Do you have any insights on how these techniques affect

  • the ability to calculate the gradients to calculate

  • the saliency maps, for example?

  • RAZIEL ALVEREZ: You know, like, that's something

  • that we want to invest more, and we haven't had that much time

  • to do it.

  • And I would love for research to get more excited.

  • They are trying to understand neural networks to understand

  • neural networks that have been approximated,

  • but so far, I haven't gotten any luck

  • trying to get the people on that side excited about it.

  • But yeah, I really don't have any meaningful thing

  • to say because I haven't run many experiments over on it.

  • AUDIENCE: Thank you.

  • RAZIEL ALVEREZ: [INAUDIBLE].

  • AUDIENCE: Hey.

  • So what is the best way to handle

  • fragmentation of hardware?

  • So like, quantization dependent on the target hardware.

  • And more often than not, mobile phones like Android,

  • you have so much [INAUDIBLE] hardware,

  • so what are the best practices there?

  • RAZIEL ALVEREZ: So one way that we

  • tried to do it was again with these specifications.

  • And like, I don't know to what extent

  • it makes our hardware partners happy, because we would like

  • to be able to target their hardware in the most

  • precise and efficient way.

  • But that's one way that we try to address it.

  • You know, with our knowledge of what hardware is there

  • and what is supported, we tried to create these specifications

  • that tried to accommodate for everybody, which again, is good

  • and at the same time is bad.

  • Then longer term, again, I don't want

  • to say too much, because I really don't have

  • a very concrete plan to share.

  • But part of the way we're building

  • with the MLIR infrastructure is we

  • want to be able to better target that hardware-- to better

  • partner with hardware vendors to understand

  • what are their hardware capabilities

  • and better create these transformations that

  • target that hardware.

  • But we were really trying to make it much better.

  • AUDIENCE: So for now, does it mean, like,

  • you go with the lowest common denominator

  • to maybe like a [INAUDIBLE]?

  • Like, imagine the Android app that you

  • have to apply in a lot of things too?

  • RAZIEL ALVEREZ: And that's why we have, like,

  • all these different quantization types.

  • Like, we have three types, right.

  • And soon, hopefully, we'll be able to even just mix and match

  • those different types, because at the end of the day,

  • it's a very arbitrary boundary.

  • Then we say, oh, this is all integer quantized,

  • and this one is hybrid.

  • And the reality is we should be able to take advantage

  • of mixing and matching up precisions

  • to get something better.

  • Thank you.

  • AUDIENCE: I have a question about pruning.

  • As a general rule in layers, operations

  • are converted to matrix multiply because of their efficiency.

  • With pruning, you're now passing in individual multiply

  • operations one by one.

  • There must be some crossover point

  • at which you need to prune by 10%, 15%, 20% before you're

  • crossing over and actually get an improvement.

  • Thoughts on where that is?

  • RAZIEL ALVEREZ: And I don't know if this

  • is exactly what you're asking.

  • So for example, our pruning API supports you specifying

  • what the pruning structure is.

  • So for example, we know that for CPUs [INAUDIBLE]

  • the instructions will typically have registers

  • that can accommodate 16 values.

  • So we know that if we want to speed up on CPU,

  • we expect you to set the setting to say,

  • oh, I want to prune in blocks of, say, 1 by 16.

  • And that's how we can get the speed-ups on CPU, for example.

  • And unfortunately right now, probably it's

  • going to be hardware dependent, but that's one thing

  • that you can do right now.

RAZIEL ALVEREZ: Hi, my name is Raziel.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it

B1 quantization hardware training model integer float

TensorFlow model optimization: Quantization and pruning (TF World '19)

  • 0 0
    林宜悉 posted on 2020/03/31
Video vocabulary

Keywords

process

US /ˈprɑsˌɛs, ˈproˌsɛs/

UK /prə'ses/

  • verb
  • To organize and use data in a computer
  • To deal with official forms in the way required
  • To prepare by treating something in a certain way
  • To adopt a set of actions that produce a result
  • To convert by putting something through a machine
  • noun
  • A series of actions or steps taken in order to achieve a particular end.
  • A summons or writ to appear in court or before a judicial officer.
  • A systematic series of actions directed to some end
  • Dealing with official forms in the way required
  • Set of changes that occur slowly and naturally
  • A series of actions or steps taken in order to achieve a particular end.
  • other
  • To perform a series of operations on (data) by a computer.
  • To deal with (something) according to a particular procedure.
  • Deal with (something) according to a set procedure.
  • To perform a series of mechanical or chemical operations on (something) in order to change or preserve it.
  • To perform a series of mechanical or chemical operations on (something) in order to change or preserve it.
  • Take (something) into the mind and understand it fully.
  • other
  • Deal with (something, especially unpleasant or difficult) psychologically in order to come to terms with it.
basically

US /ˈbesɪkəli,-kli/

UK /ˈbeɪsɪkli/

  • adverb
  • Used before you explain something simply, clearly
  • Used as a filler word or discourse marker, often to indicate a summary or simplification.
  • In the most important respects; fundamentally.
  • In essence; when you consider the most important aspects of something.
  • Primarily; for the most part.
  • In a simple and straightforward manner; simply.
term

US /tɚm/

UK /tɜ:m/

  • noun
  • A condition under which an agreement is made.
  • Conditions applying to an agreement, contract
  • A fixed period for which something lasts, especially a period of study at a school or college.
  • Each of the quantities in a ratio, series, or mathematical expression.
  • A limited period of time during which someone holds an office or position.
  • Length of time something is expected to happen
  • The normal period of gestation.
  • A way in which a person or thing is related to another.
  • Fixed period of weeks for learning at school
  • The (precise) name given to something
  • A word or phrase used to describe a thing or express a concept, especially in a particular kind of language or subject.
  • other
  • Give a specified name or description to.
  • verb
  • To call; give a name to
scale

US /skel/

UK /skeɪl/

  • noun
  • Size, level, or amount when compared
  • Small hard plates that cover the body of fish
  • Device that is used to weigh a person or thing
  • An instrument for weighing.
  • A sequence of musical notes in ascending or descending order.
  • Range of numbers from the lowest to the highest
  • The relative size or extent of something.
  • Dimensions or size of something
  • verb
  • To adjust the size or extent of something proportionally.
  • To change the size of but keep the proportions
  • To climb something large (e.g. a mountain)
  • To climb up or over (something high and steep).
  • To remove the scales of a fish
infrastructure

US /ˈɪnfrəˌstrʌktʃɚ/

UK /'ɪnfrəstrʌktʃə(r)/

  • noun
  • Basic necessary equipment for a country or region
  • other
  • The basic physical and organizational structures and facilities (e.g. buildings, roads, power supplies) needed for a society or enterprise to operate.
  • The basic hardware and software resources of a system.
  • The basic facilities, services, and installations needed for the functioning of a community or society, such as transportation and communication systems, water and power lines, and public institutions including schools, post offices, and prisons.
  • The basic framework of a system or organization, especially the hardware and software required for IT operations.
  • The underlying framework or system of an organization.
potentially

US /pəˈtɛnʃəlɪ/

UK /pə'tenʃəlɪ/

  • adverb
  • That could happen or become reality
  • With the capacity to develop or happen in the future
  • With the capacity to develop or happen in the future.
  • With the capacity to develop or happen in the future
typically

US /ˈtɪpɪklɪ/

UK /ˈtɪpɪkli/

  • adverb
  • In a normal or usual way
  • In a way that is usual or expected.
  • In a way that is usual or expected.
accurate

US /ˈækjərɪt/

UK /ˈækjərət/

  • adjective
  • With no mistake or error; Correct
common

US /ˈkɑmən/

UK /'kɒmən/

  • noun
  • Area in a city or town that is open to everyone
  • A piece of open land for public use.
  • A piece of open land for public use.
  • Field near a village owned by the local community
  • adjective
  • Lacking refinement; vulgar.
  • Occurring, found, or done often; prevalent.
  • (of a noun) denoting a class of objects or a concept as opposed to a particular individual.
  • Without special rank or position; ordinary.
  • Shared; Belonging to or used by everyone
  • Typical, normal; not unusual
  • Lacking refinement; vulgar.
  • Found all over the place.
audience

US /ˈɔdiəns/

UK /ˈɔ:diəns/

  • noun
  • Group of people attending a play, movie etc.