Placeholder Image

Subtitles section Play video

  • RAZIEL ALVEREZ: Hi, my name is Raziel.

  • I lead TensorFlow model optimization.

  • And today I will talk about our toolkit and, in particular,

  • about techniques around quantization

  • and neural connection pruning.

  • So first I'll introduce, what is model optimization?

  • What is our toolkit?

  • And then the reason about why we think it's important, why

  • we're investing in this area.

  • Then I'll cover the tools that we have available.

  • And at the end, I will give an quick overview

  • about the roadmap for the short term and the longer term.

  • And hopefully at the end of the presentation,

  • we still have some minutes to go over Q&A.

  • So our toolkit implements techniques

  • that should allow you to optimize machine learning

  • models for deployment and execution.

  • We think this is important because machine learning is

  • everywhere, right.

  • It's a very important field, and we

  • think that there is a lot of room to make it more efficient.

  • And this has some implications, both like economy--

  • like, you can make all these applications

  • much better, the quality or cheaper to execute.

  • But also we can enable some new models, and new deployments,

  • and new products that otherwise are not possible

  • even if you just tried to execute these machine learning

  • models on servers.

  • So currently machine learning runs either on the server

  • or on the Edge.

  • On the server, you may think that there

  • is a lot of capacity, that there's

  • a lot of compute and memories.

  • What is the benefit of optimizing these models?

  • Well, applications are still bound by latency.

  • There is still a very important metric

  • for a lot of applications, or you

  • want to improve the throughput-- how many tasks

  • can run on your server.

  • And these two are also directly correlated to money, right.

  • So everybody will want to save money,

  • and potentially, we're talking a lot of money.

  • Now on the Edge is a little bit more obvious

  • why we need optimization.

  • These are a very resource-constrained

  • environment even if you're talking

  • about applications in general.

  • We need to deal with reduced memory, computing,

  • power consumption is typically an issue, bandwidth,

  • both downloading, models from the Cloud.

  • Or even we've seen the chips to be able to transfer parameters

  • from memory to the processor, this

  • could be a problem if the model is too large.

  • Plus, we have a wide variety of hardware,

  • more than in the server, and we need

  • to make sure that these models run

  • efficiently on all these different types of hardware.

  • So it follows that if we are optimizing these models,

  • and we have better models, eventually,

  • it starts translating enabling new products that otherwise

  • couldn't exist if we just were running

  • these models on a server.

  • And these opportunities are larger than just

  • on smartphones.

  • Machine learning, it is trickling down

  • into more environments.

  • We have machine learning models, for example,

  • that are used to detect failures in machinery in factories,

  • or we use it in self-driving cars.

  • We use it in the office to scan documents and try

  • to understand them.

  • And just to give you some numbers,

  • right-- like, the size of the smartphone market

  • is really a fraction of the potential for the Edge devices,

  • in general.

  • So basically the two reasons is we want to make machine

  • learning model efficient.

  • It's already very important for the servers,

  • but it is pretty crucial for embedded devices.

  • So we started this toolkit about a year ago.

  • We initially launched post-training quantization

  • with this hybrid type of quantization,

  • and I'll go in more detail later in the presentation.

  • Then earlier this year, we launched the API

  • for neural connection pruning, then

  • we created this specification of quantized operations,

  • integer quantized operations for TensorFlow Lite.

  • And we launched also post-training quantizations

  • for targeting this specification.

  • More recently, we added support for reduced flow precision,

  • and hopefully soon we're going to be launching quantization

  • of our training API and also add in support to TF

  • Lite for Sparse computation.

  • So now I'll go into these techniques and these tools

  • in a little bit more detail.

  • So let's start with quantization.

  • But first I think it's important we

  • have some at least basic understanding of what

  • is quantization, and why it's hard,

  • and why we are approaching our tools the way

  • that we're doing it?

  • So let's start with a simple example.

  • You know, matrix multiply, it's a basic operation

  • for machine learning models.

  • You have two matrices, two tensors A and B,

  • and then you do some multiplication accumulation,

  • and then you get a third tensor, C. So each tensor

  • issues a bunch of values that produce the third tensor.

  • Then just a little reminder about how matrix

  • multiply works, each one of these results of the tensor C

  • are computed as multiplications and accumulations.

  • So if we look at one of them, and then

  • if we think how we're training these models, typically

  • in a higher precision--

  • let's say float 32--

  • then it follows that the operations

  • of the multiplications are float 32 in precision.

  • And then the product will be also a float 32, right.

  • And then the accumulation will be also float 32.

  • So this is fairly straightforward.

  • There is some loss in precision, but machine learning

  • is pretty good at dealing with it at least

  • at this level of precision.

  • So, no problem.

  • Now what does this have to do with quantization?

  • Well, let's go back to what are our goals for quantization.

  • We want to be able to address all these restrictions,

  • and also we want to be able to deploy

  • as much hardware as possible.

  • So a common thing that we do is, let's reduce the precision

  • that we operate in.

  • Let's say, for example, go from the 32-bit floats

  • to 8-bit integers, and then let's operate,

  • let's say, entirely within integer operations.

  • And this will be good because then we are going from 32 bits

  • to 8 bits, so we reduce the memory.

  • You know, the models are four times smaller.

  • Then integer operations are typically faster to execute.

  • They also consume less power.

  • And then because the parameters and also the dynamic values

  • activations are smaller, then we reduce bandwidth pressure.

  • It means that in the pipes in the chips,

  • there is more room for things to flow around, which

  • can also translate into faster compute and reduced power.

  • And then integer operations seem to be a fairly common

  • denominator across hardware.

  • CPUs, DSPs, different NPUs, they support the integer operations.

  • So OK, we are going to reduce the precision,

  • so how do we convert the 32-bit float to 8-bit integers?

  • Well, right now we do something very simple.

  • So we have this linear mapping, where we say,

  • OK, we take the values from a tensor.

  • We compute the minimum and the maximum value,

  • and then based on that, we spread them evenly

  • on the 8-bit range.

  • This basically is very simple, right.

  • So is that all that we need to do?

  • Well, I wish.

  • It's not that's simple, so let's go back to the example.

  • So we have the matrix multiply.

  • And now let's say that we quantized the values,

  • and they are already 8-bit integers.

  • So the operands and the multipliers are 8-bit integers.

  • And then the multiplication, the products, now you

  • need 16 bits to represent it, and then you

  • need to accumulate.

  • And you probably want 32 bits to accumulate on that, right.

  • So what is the problem?

  • Well, the problem is that now you have tensor received

  • a float of 32-bit values.

  • And that is not great when you want

  • to fit that into another matrix multiply,

  • that you really want to execute as 8-bit integers,

  • because we already taught they're

  • resource efficient, right.

  • So what do you do?

  • So you scale them back down, right.

  • So we just sort of quantize them on the fly now back down

  • to 8-bit integers.

  • So then you can feed them to your major 8-bit matrix

  • multiplier, and now it's all good.

  • But what are the implications of all this process?

  • Well, so it means that we're changing the static values,

  • the parameters, the weights.

  • We are changing also the dynamic values,

  • the activations, because we're, you know,

  • scaling them-- quantizing them on the fly.

  • And it also means that we are changing

  • the computation, right.

  • In this case, it's a very simple example.

  • We just added a scaling operation,

  • but it can be a bit more involved.

  • So you could say, OK, that doesn't seem that hard, right.

  • Like, we just added another number scaling operation,

  • and it's just easy, right.

  • Well, some math is a little bit more complicated than that.

  • This example actually is from layer normalization of an LSTM.

  • And this one, aside from looking a bit more complex,

  • it's an example of-- where if you just apply these naive

  • rules of operate, rescale, operate, rescale--

  • you actually end up in sort of a numeric black hole

  • where the scales cancel each other,

  • and basically just things don't work if you just

  • go naively about it.

  • And then it's more complicated because in your decisions

  • about how you're going to represent this computation

  • in integer form--

  • you know, quantization, we want to be efficient,

  • so lower precision is good.

  • But we also want to be accurate, which

  • means lower precision is bad.

  • So it's a lot of trade-offs that you have to do.

  • Then further complicating things,

  • we have heterogeneous hardware.

  • There is all different types of hardware

  • with different capabilities, very different operations

  • that each hardware supporters, and also different preferences.

  • Some hardware is, you know, better at executing operations

  • and produce floats, you know, with different bit widths

  • or different restrictions.

  • And we want to account for all the things

  • when we are creating our quantized recipe,

  • or a quantized program.

  • Then there is the fact that machine learning

  • is hard to interpret.

  • We don't understand it.

  • Like, we don't understand how it works--

  • not to the level that we can't have

  • good proofs to know that the transformation that we're

  • doing to this model, to this program,

  • will actually work or not result in a catastrophic error, right.

  • You don't want to take a model, you quantize it,

  • and then this model suddenly starts

  • giving you some weird results.

  • So it makes it much more complicated

  • to define these transformations.

  • And then finally, I will say that this

  • is a little more complicated, because the model is not

  • enough.

  • The program is not enough.

  • Depending on how the quantization is defined,

  • you might also need some extra data.

  • So an example of the matrix multiplier,

  • we need it to compute the minimum and maximum

  • values of the dynamic activations,

  • and that only can be done if we run inferences

  • through the model, which means that you need to provide

  • some representatives there.

  • So basically, it's just another hurdle

  • that you have to account when quantizing these programs.

  • So, you know, basically, this means

  • that, when we talk about quantization,

  • we're really talking about rewriting,

  • transforming these machine learning programming,

  • to an approximate representation based on the operations

  • that you have available.

  • So now how are we addressing this in our toolkit?

  • Well, the first thing that we decided to do

  • was to try to scope down the problem and say,

  • OK, we're going to define the specifications

  • for common operations--

  • like, in these cases, a diagram for convolution.

  • Have a well-defined quantization behavior.

  • So we know that now, with this information,

  • this low-level information that is relevant to quantization,

  • then hardware can target those specifications.

  • And then our tools can target that specification,

  • and then we can all work at this level.

  • And then we also get the other benefit

  • that, from the user point of view, you can quantize a model,

  • and then this model can run in different hardware

  • without any change.

  • So right now, in order to give you

  • support the three different quantization types.

  • I'm including, here, reduced float as a quantization type.

  • It's just a much simpler thing where we just

  • typically go from float 32 to float 16 parameters

  • and computations, so that's pretty straightforward.

  • The next one is our hybrid quantization

  • which basically makes use of 8-bit for the parameters.

  • Biases and activations, we leave at 32-bit floats.

  • And then we try to be as smart as possible

  • for how we execute this program.

  • So the goal being that, for example,

  • heavy operations like big matrix multipliers

  • are left in the integer domain, and then we

  • use floating point for things like activation functions.

  • So it's a nice trade-off between accuracy, and performance,

  • and optimizations.

  • Then the third one is integer quantization.

  • This means everything is integers.

  • All the parameters are integers, and all the operations

  • are integers.

  • This is obviously the more complicated one.

  • So the benefits of the reduced float is--

  • well, your models are now half size.

  • And then depending on the hardware support,

  • you may get some speed-ups, and then the accuracy losses

  • tend to be very minimal.

  • It pretty much always works.

  • I haven't seen, myself at least, an actual model, that

  • doesn't work in float 16, that was trained in float 32.

  • Hybrid quantization then pushes it further.

  • You now get 4x reduced in size.

  • And then depending on the computations of the operations

  • that you're using, you may get different performance

  • improvements.

  • It tends to be larger for fully-connected models or RNNs.

  • And then the third one is the only integer quantization,

  • so it has the same benefits as hybrid in terms of memory size.

  • But it's faster to execute, and it

  • has a great advantage that it has more hardware coverage.

  • So for example, typical MPUs, some of them

  • are only, like, integer-based like over Edge-TPU.

  • Now let's talk about the tools to actually quantize

  • the models based on those quantization types.

  • So we have two types of tools--

  • one that works post training.

  • So it works directly on the training model.

  • And the other one that is a work-in-progress

  • is during training.

  • So let's talk about the post training.

  • The process is very simple.

  • You basically assume that you just have a train.

  • Well, it doesn't really care how you train it.

  • You just have a TensorFlow model.

  • Then currently, via the TensorFlow Lite converter,

  • you just convert this model to TensorFlow Lite

  • and quantize it on the fly.

  • And then you just have a model that you

  • can execute on whatever hardware is supported

  • in that quantization type.

  • So now let's look at the specific quantization types.

  • So the first one is reduced float.

  • You just add a couple of flags.

  • You just use default of optimizations,

  • and then the type that you're targeting is float 16.

  • And then basically, this will take

  • care of changing all the parameters and the computation,

  • and again, depending on the hardware that you're

  • running this model, you might get a speed-up right now.

  • For example, GPUs support float 16 natively,

  • so you might get some speed-up there

  • either because of the computation

  • or even just because the bandwidth in your chip

  • will be reduced.

  • Like I said, benefits--

  • all the size goes to half.

  • And, you know, the accuracy drop is very minimal.

  • I will say within the noise.

  • Then the next one is our hybrid quantization.

  • So again, this is very easy.

  • You just set the flag now.

  • This is the default for TensorFlow Lite converter.

  • You set it to default. And then again, it

  • will make sure to quantize all the parameters.

  • And then operations that doesn't yet

  • define a specification for the quantized form,

  • they will be kept in the original position.

  • And then you will get some speed-ups,

  • and you will be able to execute whatever hardware complies

  • with the specification.

  • So typically, this one works pretty well for CPUs.

  • And again, benefits-- 4x compression for the models.

  • And then you get some speed-ups.

  • All these are convolution-based models,

  • so that's why the speed-up is not as big.

  • And I will say these are one-year-old numbers,

  • so probably right now it's faster.

  • And the same for accuracy, accuracy is pretty good.

  • And actually we're working on some changes

  • for convolution models.

  • It will even be a bit more accurate soon.

  • Then the third one is the integer quantization.

  • So this one is the one that is a bit more complex, because now

  • you need to provide some data.

  • So you say, OK, I want optimize the model,

  • but I want to use the integer quantization.

  • So now you need to provide some data.

  • And by data, I mean on label samples

  • of what your neural network will typically see in reality.

  • So if it's an image processing model,

  • you need to feed some pre-processed images.

  • And we're not talking about a lot of data.

  • For the results that I'm going to show next,

  • we're just talking about a hundred samples.

  • That works pretty well.

  • So it is a bit more complicated, but it's not very complicated.

  • So these are some results from post training quantization

  • across different models.

  • As you see for the majority of models,

  • the loss is not that big with respect to the full precision

  • train baseline.

  • The only one I will say is the MobileSSD model.

  • So that has a bit more meaningful drop,

  • but again, a variety of models work pretty well

  • with post training quantization.

  • Now I'll talk about during training,

  • because like I showed in the previous results, you know,

  • there is still some models that will

  • benefit from doing this quantization of our training.

  • And by quantization of our training,

  • we mean we tried to emulate the quantization

  • operations, the quantization losses,

  • during the forward pass of the neural network,

  • with the hope that the parameters will

  • be tuned to account for that.

  • So the process for doing the quantization of our training

  • for using our API, it's a little bit more involved.

  • We are, again, trying to make it very simple.

  • So we built this API in Keras, again,

  • to make it very easy to use.

  • So basically, we assume that you already have a Keras model,

  • and then you just need to call our API

  • to apply the quantization.

  • And this might change a little bit,

  • but it will look something like this.

  • So you just have a model that you already

  • built using Keras layers.

  • And why not?

  • And then the only thing that you need to do

  • is call our API on your model, then

  • you get now a model that is rewritten to have

  • all emulation of quantization.

  • And then you just call your fit function, and that's it.

  • So then you just train your model as usual.

  • And then you can go through the TensorFlow Lite converter,

  • and then it will take this model that

  • was trained with quantization.

  • It will have all the data necessary to quantize it,

  • and then it will produce a quantized model

  • that, just like the post-training model,

  • you will be able to execute in different hardware.

  • These are some numbers from quantization of our training

  • preliminary numbers.

  • If you see the delta is a little bit better

  • than post-training quantization, it's

  • not a very big difference except for the MobileSSD.

  • So before it was 4% for post-training quantization.

  • In this case, it's 2.9%.

  • So quantization or our training is still a useful tool.

  • That's why we're building it.

  • Now you may wonder-- that those where

  • a lot of quantization types and tools, so which one should

  • I use?

  • So my recommendation is if you are just

  • starting, just start with try to reduce floats.

  • That's the first one to try.

  • It is just very easy to use.

  • It doesn't require any data.

  • The accuracy will probably be the same.

  • And then latency, depending on the hardware,

  • you might get some benefits--

  • reduced latency.

  • And then compatibility-- basically,

  • everywhere you can execute floating point operations,

  • you will be able to use it.

  • The next thing to try will be the hybrid quantization.

  • Again, there is no data requirements.

  • The accuracy will be still good, probably not as

  • good as float 16 in some cases, but it's still good.

  • It will be faster than the reduced float.

  • And basically, compatibility will be everywhere

  • that you have support for float and integer operations.

  • Then the third one to try is the integer quantization

  • with the post-training tool.

  • This one is a bit more complicated

  • just because you need to provide a little bit of data.

  • The accuracy will be worse or the same as hybrid,

  • but the latency of this will be the fastest.

  • And then it will also give you more hardware coverage.

  • And then the last thing to try will

  • be the integer quantization with quantization during training.

  • And basically, this is good.

  • This will be a little bit more involved, because now you're

  • doing training.

  • You're supposed to have now a training setup, a training

  • script.

  • But the accuracy is will be better

  • than doing just the post-training version,

  • and again, you get the benefits of being

  • the fastest one and the one with more hardware coverage.

  • So that was quantization.

  • And again, all these tools, we're

  • trying to make it very easy to use,

  • so it will be great if you try them out

  • and give us some feedback.

  • Then, connection pruning.

  • So what is neural connection pruning?

  • Well, the way that we have implemented it so far,

  • it means it is a training time technique

  • that, during the training process,

  • it will start removing dropping connections

  • from the neural network.

  • And then these connections will--

  • the dropped connections basically just become

  • zeros in the tensors that you're training,

  • and then that means that you end up with sparse tensors.

  • Now sparse tensors are great, because you

  • can compress them and potentially

  • execute them faster.

  • So this is an example.

  • This is a tensor, how it starts randomly initialized.

  • The dark values means values that are non-zero,

  • and white means values that are zero.

  • And then as the training progresses,

  • then it starts becoming sparser and sparser.

  • And if you see this tensor, it's basically

  • removing most of the parameters there.

  • The process for the API is very similar to the quantization

  • of our training API.

  • Again, we're trying to bring some consistency to our APIs.

  • So it's built on Keras, so it assumes

  • that you have a model that is trainable in Keras.

  • And then you're going to call our API

  • to apply the pruning logic.

  • And this again, we are trying to make this as simple

  • as possible.

  • So the only thing that you need to define

  • is a pruning schedule-- basically,

  • when you want to start dropping these connections,

  • and until when, and how fast, how aggressive

  • you want these prunings to be.

  • And then you just call our prune function,

  • which again will modify your graph to add all the pruning

  • operations internally.

  • And then you just call your fit function,

  • and you train as usual.

  • So basically, you train as usual, and then once you train,

  • you have two options now.

  • Or soon, you will have two options.

  • You can just take the same model,

  • the TensorFlow saved model.

  • You can just compress it, gzip, and then the model

  • will be smaller.

  • And soon, you will be able to convert it via TensorFlow Lite,

  • and you will get also a reduction in size

  • and potentially some speed-ups depending

  • on what prune configuration you're using

  • and the hardware that you're targeting.

  • So this should be done pretty soon.

  • Now what are the benefits of pruning?

  • We've tried it in a lot of tasks,

  • like really a lot of tasks-- on image, speech, audio.

  • And it worked pretty well.

  • And like a lot of techniques that are

  • require hyperparameter tuning, and, you know,

  • careful restarting your models, and things like that.

  • But pruning has worked pretty well

  • without a lot of babysitting.

  • Then it has potential for speed-ups

  • depending on hardware support.

  • And we also have pretty good results.

  • Like, we can make a lot of the parameters basically go away.

  • We see 50% to 90% with negligible accuracy loss.

  • And the other great thing is that it works well also

  • with quantization.

  • So a typical setup that we've tried is with training pruning,

  • and then we use post-training quantization.

  • And basically, the accuracy is pretty good,

  • and you get the compound benefits of all techniques.

  • This is some, older now, results that we have

  • when we launched this.

  • So I mean this is in InceptionV3.

  • We see we can get all the way almost to 90%

  • sparsity with relatively small accuracy losses.

  • And the other--

  • GNMT's neural machine translation, where again,

  • we can take it to almost 90% pruning and also small accuracy

  • losses.

  • And we've done these, for example, speech recognition.

  • We actually had, recently, the Google Pixel event,

  • where the speech recognition models

  • used pruning and quantization and were

  • able to have a model with server-side quality running

  • on a phone, which is pretty good.

  • OK, so now I'll finally cover, really quick, our roadmap.

  • Like I mentioned, quantization-- we're

  • working on a quantization training API,

  • so that should be ready soon.

  • And we are also working on our specs

  • for quantizing RNNs, which are typically trickier to quantize,

  • like LSTMs.

  • Then I didn't include it there, but we're

  • making some improvements to the hybrid quantization

  • to be more accurate, particularly

  • for convolution layers.

  • And then for sparsity, we're adding support

  • for sparse computation in TensorFlow Lite runtime.

  • Longer term, I don't know if you have heard about MLIR,

  • but it's state-of-the-art compiler infrastructure,

  • but this is particularly interesting to us

  • because it's a better way for us to write these transformations.

  • And at the end, like I said at the beginning of the talk,

  • we're taking a model.

  • We're transforming one program into another representation

  • of that program.

  • And some of the things that we want to enable

  • is better targeted hardware, so our specifications

  • are great because users can target

  • our specification in executing different hardware.

  • But some users just want to [INAUDIBLE] hardware and get

  • the best out of it.

  • So we're hoping that, with the new infrastructure that we're

  • building on top of MLIR, it should be possible.

  • And finally, I really just want to encourage you to try it

  • out and give us feedback--

  • what techniques you would like to see.

  • You know, researches-- there is techniques popping up

  • all over the place.

  • And a lot of the work that we have to go through

  • is culling what's useful and what's not-- what is general

  • and what is very specific.

  • So we will love to hear your feedback about that

  • and also about the tools that we already have in the toolkit.

  • We're trying to make them as easy as possible to use.

  • We know that we still have a long way to go,

  • but any feedback that you can provide

  • will be really, really appreciated.

  • And I think there is a little bit of time for questions

  • if any of you have questions.

  • [CLAPPING]

  • Thanks.

  • AUDIENCE: Hi.

  • I have a question regarding the [INAUDIBLE]..

  • Hi.

  • Thank you for the presentation.

  • I have a question regarding the training

  • with integer quantizations.

  • In the pipeline, is that going to be true quantization

  • during training?

  • RAZIEL ALVEREZ: No.

  • So right now, by true, I mean that you

  • expect that all the operation happen in the integer domain?

  • AUDIENCE: Yes.

  • RAZIEL ALVEREZ: Not right now.

  • That's something I really want enabled,

  • because I want to make training faster as well.

  • But right now, the way that we are targeting is--

  • I don't know if you're familiar with TensorFlow APIs,

  • but we have this low-level API, unfortunately called

  • fake quantization, that basically just emulates

  • these losses.

  • And that one is still-- basically,

  • what we do there is we quantize parameters,

  • and then we de-quantize them, and then

  • we do the float operation.

  • So that's what we're using right now.

  • But yeah, longer term, we want to do true integer forward

  • passes.

  • AUDIENCE: Thank you.

  • AUDIENCE: Hi.

  • [INAUDIBLE]

  • Oh, I had just one question.

  • So after you do the quantization,

  • is there a way that you can also visualize the finish quantized

  • model?

  • Yeah, that was one question, and I had another question.

  • Let me think about it.

  • But is there a way that you can also.

  • Oh, the other question was, what sort of tools

  • are you going to provide as far as to sort of do

  • model correctness and--

  • I mean, at least evaluate, you know,

  • whether this quantized model is sort of functionally

  • correct in a sense?

  • RAZIEL ALVEREZ: Yes.

  • Visualization, again, it depends where.

  • But for TensorFlow Lite, you have a visualizer,

  • so you can see the quantized model.

  • I don't know if it will give you a lot of information, depending

  • what you're looking for.

  • We also want to make our tooling a bit better, because perhaps,

  • for whatever reason, you want to get old research in and start

  • looking at the activations, and how they change, and all that.

  • AUDIENCE: Sure, yeah.

  • There's like inserted ops and so forth.

  • RAZIEL ALVEREZ: Yeah.

  • AUDIENCE: [INAUDIBLE]

  • RAZIEL ALVEREZ: So for sure with the TF Lite visualizer,

  • you can see how the graph changes.

  • So the second question about correctness, correctness

  • is really tricky.

  • Because in my experience, the only thing that really works

  • is to really evaluate on the real data

  • that you care to run your model on.

  • AUDIENCE: Yeah, that's right.

  • RAZIEL ALVEREZ: You know, like, we

  • tried to do things like ultra norms to approximate--

  • OK, versus the full precision one versus the quantized one.

  • And then it gives you a sense of maybe some really catastrophic

  • numerical errors, but otherwise, it's really just a guess,

  • right.

  • AUDIENCE: That's right.

  • RAZIEL ALVEREZ: Particularly, depending on the output layers,

  • you know, categories are easier to quantize, because, you know,

  • the error is not very meaningful as long

  • as you get the right category.

  • Regressions are much harder because now you really care

  • about the actual values.

  • Yeah, it's an open problem.

  • AUDIENCE: Yeah, it's a tough problem.

  • Thank you.

  • AUDIENCE: I have a question about the results

  • from the GMNT training with induced sparsity.

  • I was wondering if you had any insights on why

  • the training with 80% sparsity would perform better

  • than the original version?

  • Like, if you looked at the results.

  • RAZIEL ALVEREZ: You know, the hand-waving thing,

  • that we always say in these cases,

  • is some regularization happens.

  • [LAUGHTER]

  • Yeah.

  • And you know, I've seen the same with some quantize models.

  • I've never had the gear to really sit

  • down and try to understand what their reasons are for all this.

  • Sometimes it's just because it's within the noise, right?

  • It all depends on your evaluation set, right.

  • If it's really not that big or not that meaningful,

  • then these jumps are all possible.

  • Like, I've seen some models where, oh, it looks great

  • after you quantize it.

  • Then you throw in a new data set, say from speech

  • recognition and noisier utterances,

  • and then you clearly see the difference

  • between one and the other.

  • So a lot of it can be just noise.

  • AUDIENCE: Hi.

  • You mentioned explainability.

  • And a technique could be like saliency maps.

  • Do you have any insights on how these techniques affect

  • the ability to calculate the gradients to calculate

  • the saliency maps, for example?

  • RAZIEL ALVEREZ: You know, like, that's something

  • that we want to invest more, and we haven't had that much time

  • to do it.

  • And I would love for research to get more excited.

  • They are trying to understand neural networks to understand

  • neural networks that have been approximated,

  • but so far, I haven't gotten any luck

  • trying to get the people on that side excited about it.

  • But yeah, I really don't have any meaningful thing

  • to say because I haven't run many experiments over on it.

  • AUDIENCE: Thank you.

  • RAZIEL ALVEREZ: [INAUDIBLE].

  • AUDIENCE: Hey.

  • So what is the best way to handle

  • fragmentation of hardware?

  • So like, quantization dependent on the target hardware.

  • And more often than not, mobile phones like Android,

  • you have so much [INAUDIBLE] hardware,

  • so what are the best practices there?

  • RAZIEL ALVEREZ: So one way that we

  • tried to do it was again with these specifications.

  • And like, I don't know to what extent

  • it makes our hardware partners happy, because we would like

  • to be able to target their hardware in the most

  • precise and efficient way.

  • But that's one way that we try to address it.

  • You know, with our knowledge of what hardware is there

  • and what is supported, we tried to create these specifications

  • that tried to accommodate for everybody, which again, is good

  • and at the same time is bad.

  • Then longer term, again, I don't want

  • to say too much, because I really don't have

  • a very concrete plan to share.

  • But part of the way we're building

  • with the MLIR infrastructure is we

  • want to be able to better target that hardware-- to better

  • partner with hardware vendors to understand

  • what are their hardware capabilities

  • and better create these transformations that

  • target that hardware.

  • But we were really trying to make it much better.

  • AUDIENCE: So for now, does it mean, like,

  • you go with the lowest common denominator

  • to maybe like a [INAUDIBLE]?

  • Like, imagine the Android app that you

  • have to apply in a lot of things too?

  • RAZIEL ALVEREZ: And that's why we have, like,

  • all these different quantization types.

  • Like, we have three types, right.

  • And soon, hopefully, we'll be able to even just mix and match

  • those different types, because at the end of the day,

  • it's a very arbitrary boundary.

  • Then we say, oh, this is all integer quantized,

  • and this one is hybrid.

  • And the reality is we should be able to take advantage

  • of mixing and matching up precisions

  • to get something better.

  • Thank you.

  • AUDIENCE: I have a question about pruning.

  • As a general rule in layers, operations

  • are converted to matrix multiply because of their efficiency.

  • With pruning, you're now passing in individual multiply

  • operations one by one.

  • There must be some crossover point

  • at which you need to prune by 10%, 15%, 20% before you're

  • crossing over and actually get an improvement.

  • Thoughts on where that is?

  • RAZIEL ALVEREZ: And I don't know if this

  • is exactly what you're asking.

  • So for example, our pruning API supports you specifying

  • what the pruning structure is.

  • So for example, we know that for CPUs [INAUDIBLE]

  • the instructions will typically have registers

  • that can accommodate 16 values.

  • So we know that if we want to speed up on CPU,

  • we expect you to set the setting to say,

  • oh, I want to prune in blocks of, say, 1 by 16.

  • And that's how we can get the speed-ups on CPU, for example.

  • And unfortunately right now, probably it's

  • going to be hardware dependent, but that's one thing

  • that you can do right now.

RAZIEL ALVEREZ: Hi, my name is Raziel.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it