TensorFlow model optimization: Quantization and pruning (TF World '19)

Subtitles section Play video

RAZIEL ALVEREZ: Hi, my name is Raziel.
I lead TensorFlow model optimization.
And today I will talk about our toolkit and, in particular,
about techniques around quantization
and neural connection pruning.
So first I'll introduce, what is model optimization?
What is our toolkit?
And then the reason about why we think it's important, why
we're investing in this area.
Then I'll cover the tools that we have available.
And at the end, I will give an quick overview
about the roadmap for the short term and the longer term.
And hopefully at the end of the presentation,
we still have some minutes to go over Q&A.
So our toolkit implements techniques
that should allow you to optimize machine learning
models for deployment and execution.
We think this is important because machine learning is
everywhere, right.
It's a very important field, and we
think that there is a lot of room to make it more efficient.
And this has some implications, both like economy--
like, you can make all these applications
much better, the quality or cheaper to execute.
But also we can enable some new models, and new deployments,
and new products that otherwise are not possible
even if you just tried to execute these machine learning
models on servers.
So currently machine learning runs either on the server
or on the Edge.
On the server, you may think that there
is a lot of capacity, that there's
a lot of compute and memories.
What is the benefit of optimizing these models?
Well, applications are still bound by latency.
There is still a very important metric
for a lot of applications, or you
want to improve the throughput-- how many tasks
can run on your server.
And these two are also directly correlated to money, right.
So everybody will want to save money,
and potentially, we're talking a lot of money.
Now on the Edge is a little bit more obvious
why we need optimization.
These are a very resource-constrained
environment even if you're talking
about applications in general.
We need to deal with reduced memory, computing,
power consumption is typically an issue, bandwidth,
both downloading, models from the Cloud.
Or even we've seen the chips to be able to transfer parameters
from memory to the processor, this
could be a problem if the model is too large.
Plus, we have a wide variety of hardware,
more than in the server, and we need
to make sure that these models run
efficiently on all these different types of hardware.
So it follows that if we are optimizing these models,
and we have better models, eventually,
it starts translating enabling new products that otherwise
couldn't exist if we just were running
these models on a server.
And these opportunities are larger than just
on smartphones.
Machine learning, it is trickling down
into more environments.
We have machine learning models, for example,
that are used to detect failures in machinery in factories,
or we use it in self-driving cars.
We use it in the office to scan documents and try
to understand them.
And just to give you some numbers,
right-- like, the size of the smartphone market
is really a fraction of the potential for the Edge devices,
in general.
So basically the two reasons is we want to make machine
learning model efficient.
It's already very important for the servers,
but it is pretty crucial for embedded devices.
So we started this toolkit about a year ago.
We initially launched post-training quantization
with this hybrid type of quantization,
and I'll go in more detail later in the presentation.
Then earlier this year, we launched the API
for neural connection pruning, then
we created this specification of quantized operations,
integer quantized operations for TensorFlow Lite.
And we launched also post-training quantizations
for targeting this specification.
More recently, we added support for reduced flow precision,
and hopefully soon we're going to be launching quantization
of our training API and also add in support to TF
Lite for Sparse computation.
So now I'll go into these techniques and these tools
in a little bit more detail.
So let's start with quantization.
But first I think it's important we
have some at least basic understanding of what
is quantization, and why it's hard,
and why we are approaching our tools the way
that we're doing it?
So let's start with a simple example.
You know, matrix multiply, it's a basic operation
for machine learning models.
You have two matrices, two tensors A and B,
and then you do some multiplication accumulation,
and then you get a third tensor, C. So each tensor
issues a bunch of values that produce the third tensor.
Then just a little reminder about how matrix
multiply works, each one of these results of the tensor C
are computed as multiplications and accumulations.
So if we look at one of them, and then
if we think how we're training these models, typically
in a higher precision--
let's say float 32--
then it follows that the operations
of the multiplications are float 32 in precision.
And then the product will be also a float 32, right.
And then the accumulation will be also float 32.
So this is fairly straightforward.
There is some loss in precision, but machine learning
is pretty good at dealing with it at least
at this level of precision.
So, no problem.
Now what does this have to do with quantization?
Well, let's go back to what are our goals for quantization.
We want to be able to address all these restrictions,
and also we want to be able to deploy
as much hardware as possible.
So a common thing that we do is, let's reduce the precision
that we operate in.
Let's say, for example, go from the 32-bit floats
to 8-bit integers, and then let's operate,
let's say, entirely within integer operations.
And this will be good because then we are going from 32 bits
to 8 bits, so we reduce the memory.
You know, the models are four times smaller.
Then integer operations are typically faster to execute.
They also consume less power.
And then because the parameters and also the dynamic values
activations are smaller, then we reduce bandwidth pressure.
It means that in the pipes in the chips,
there is more room for things to flow around, which
can also translate into faster compute and reduced power.
And then integer operations seem to be a fairly common
denominator across hardware.
CPUs, DSPs, different NPUs, they support the integer operations.
So OK, we are going to reduce the precision,
so how do we convert the 32-bit float to 8-bit integers?
Well, right now we do something very simple.
So we have this linear mapping, where we say,
OK, we take the values from a tensor.
We compute the minimum and the maximum value,
and then based on that, we spread them evenly
on the 8-bit range.
This basically is very simple, right.
So is that all that we need to do?
Well, I wish.
It's not that's simple, so let's go back to the example.
So we have the matrix multiply.
And now let's say that we quantized the values,
and they are already 8-bit integers.
So the operands and the multipliers are 8-bit integers.
And then the multiplication, the products, now you
need 16 bits to represent it, and then you
need to accumulate.
And you probably want 32 bits to accumulate on that, right.
So what is the problem?
Well, the problem is that now you have tensor received
a float of 32-bit values.
And that is not great when you want
to fit that into another matrix multiply,
that you really want to execute as 8-bit integers,
because we already taught they're
resource efficient, right.
So what do you do?
So you scale them back down, right.
So we just sort of quantize them on the fly now back down
to 8-bit integers.
So then you can feed them to your major 8-bit matrix
multiplier, and now it's all good.
But what are the implications of all this process?
Well, so it means that we're changing the static values,
the parameters, the weights.
We are changing also the dynamic values,
the activations, because we're, you know,
scaling them-- quantizing them on the fly.
And it also means that we are changing
the computation, right.
In this case, it's a very simple example.
We just added a scaling operation,
but it can be a bit more involved.
So you could say, OK, that doesn't seem that hard, right.
Like, we just added another number scaling operation,
and it's just easy, right.
Well, some math is a little bit more complicated than that.
This example actually is from layer normalization of an LSTM.
And this one, aside from looking a bit more complex,
it's an example of-- where if you just apply these naive
rules of operate, rescale, operate, rescale--
you actually end up in sort of a numeric black hole
where the scales cancel each other,
and basically just things don't work if you just
go naively about it.
And then it's more complicated because in your decisions
about how you're going to represent this computation
in integer form--
you know, quantization, we want to be efficient,
so lower precision is good.
But we also want to be accurate, which
means lower precision is bad.
So it's a lot of trade-offs that you have to do.
Then further complicating things,
we have heterogeneous hardware.
There is all different types of hardware
with different capabilities, very different operations
that each hardware supporters, and also different preferences.
Some hardware is, you know, better at executing operations
and produce floats, you know, with different bit widths
or different restrictions.
And we want to account for all the things
when we are creating our quantized recipe,
or a quantized program.
Then there is the fact that machine learning
is hard to interpret.
We don't understand it.
Like, we don't understand how it works--
not to the level that we can't have
good proofs to know that the transformation that we're
doing to this model, to this program,
will actually work or not result in a catastrophic error, right.
You don't want to take a model, you quantize it,
and then this model suddenly starts
giving you some weird results.
So it makes it much more complicated
to define these transformations.
And then finally, I will say that this
is a little more complicated, because the model is not
enough.
The program is not enough.
Depending on how the quantization is defined,
you might also need some extra data.
So an example of the matrix multiplier,
we need it to compute the minimum and maximum
values of the dynamic activations,
and that only can be done if we run inferences
through the model, which means that you need to provide
some representatives there.
So basically, it's just another hurdle
that you have to account when quantizing these programs.
So, you know, basically, this means
that, when we talk about quantization,
we're really talking about rewriting,
transforming these machine learning programming,
to an approximate representation based on the operations
that you have available.
So now how are we addressing this in our toolkit?
Well, the first thing that we decided to do
was to try to scope down the problem and say,
OK, we're going to define the specifications
for common operations--
like, in these cases, a diagram for convolution.
Have a well-defined quantization behavior.
So we know that now, with this information,
this low-level information that is relevant to quantization,
then hardware can target those specifications.
And then our tools can target that specification,
and then we can all work at this level.
And then we also get the other benefit
that, from the user point of view, you can quantize a model,
and then this model can run in different hardware
without any change.
So right now, in order to give you
support the three different quantization types.
I'm including, here, reduced float as a quantization type.
It's just a much simpler thing where we just
typically go from float 32 to float 16 parameters
and computations, so that's pretty straightforward.
The next one is our hybrid quantization
which basically makes use of 8-bit for the parameters.
Biases and activations, we leave at 32-bit floats.
And then we try to be as smart as possible
for how we execute this program.
So the goal being that, for example,
heavy operations like big matrix multipliers
are left in the integer domain, and then we
use floating point for things like activation functions.
So it's a nice trade-off between accuracy, and performance,
and optimizations.
Then the third one is integer quantization.
This means everything is integers.
All the parameters are integers, and all the operations
are integers.
This is obviously the more complicated one.
So the benefits of the reduced float is--
well, your models are now half size.
And then depending on the hardware support,
you may get some speed-ups, and then the accuracy losses
tend to be very minimal.
It pretty much always works.
I haven't seen, myself at least, an actual model, that
doesn't work in float 16, that was trained in float 32.
Hybrid quantization then pushes it further.
You now get 4x reduced in size.
And then depending on the computations of the operations
that you're using, you may get different performance
improvements.
It tends to be larger for fully-connected models or RNNs.
And then the third one is the only integer quantization,
so it has the same benefits as hybrid in terms of memory size.
But it's faster to execute, and it
has a great advantage that it has more hardware coverage.
So for example, typical MPUs, some of them
are only, like, integer-based like over Edge-TPU.
Now let's talk about the tools to actually quantize
the models based on those quantization types.
So we have two types of tools--
one that works post training.
So it works directly on the training model.
And the other one that is a work-in-progress
is during training.
So let's talk about the post training.
The process is very simple.
You basically assume that you just have a train.
Well, it doesn't really care how you train it.
You just have a TensorFlow model.
Then currently, via the TensorFlow Lite converter,
you just convert this model to TensorFlow Lite
and quantize it on the fly.
And then you just have a model that you
can execute on whatever hardware is supported
in that quantization type.
So now let's look at the specific quantization types.
So the first one is reduced float.
You just add a couple of flags.
You just use default of optimizations,
and then the type that you're targeting is float 16.
And then basically, this will take
care of changing all the parameters and the computation,
and again, depending on the hardware that you're
running this model, you might get a speed-up right now.
For example, GPUs support float 16 natively,
so you might get some speed-up there
either because of the computation
or even just because the bandwidth in your chip
will be reduced.
Like I said, benefits--
all the size goes to half.
And, you know, the accuracy drop is very minimal.
I will say within the noise.
Then the next one is our hybrid quantization.
So again, this is very easy.
You just set the flag now.
This is the default for TensorFlow Lite converter.
You set it to default. And then again, it
will make sure to quantize all the parameters.
And then operations that doesn't yet
define a specification for the quantized form,
they will be kept in the original position.
And then you will get some speed-ups,
and you will be able to execute whatever hardware complies
with the specification.
So typically, this one works pretty well for CPUs.
And again, benefits-- 4x compression for the models.
And then you get some speed-ups.
All these are convolution-based models,
so that's why the speed-up is not as big.
And I will say these are one-year-old numbers,
so probably right now it's faster.
And the same for accuracy, accuracy is pretty good.
And actually we're working on some changes
for convolution models.
It will even be a bit more accurate soon.
Then the third one is the integer quantization.
So this one is the one that is a bit more complex, because now
you need to provide some data.
So you say, OK, I want optimize the model,
but I want to use the integer quantization.
So now you need to provide some data.
And by data, I mean on label samples
of what your neural network will typically see in reality.
So if it's an image processing model,
you need to feed some pre-processed images.
And we're not talking about a lot of data.
For the results that I'm going to show next,
we're just talking about a hundred samples.
That works pretty well.
So it is a bit more complicated, but it's not very complicated.
So these are some results from post training quantization
across different models.
As you see for the majority of models,
the loss is not that big with respect to the full precision
train baseline.
The only one I will say is the MobileSSD model.
So that has a bit more meaningful drop,
but again, a variety of models work pretty well
with post training quantization.
Now I'll talk about during training,
because like I showed in the previous results, you know,
there is still some models that will
benefit from doing this quantization of our training.
And by quantization of our training,
we mean we tried to emulate the quantization
operations, the quantization losses,
during the forward pass of the neural network,
with the hope that the parameters will
be tuned to account for that.
So the process for doing the quantization of our training
for using our API, it's a little bit more involved.
We are, again, trying to make it very simple.
So we built this API in Keras, again,
to make it very easy to use.
So basically, we assume that you already have a Keras model,
and then you just need to call our API
to apply the quantization.
And this might change a little bit,
but it will look something like this.
So you just have a model that you already
built using Keras layers.
And why not?
And then the only thing that you need to do
is call our API on your model, then
you get now a model that is rewritten to have
all emulation of quantization.
And then you just call your fit function, and that's it.
So then you just train your model as usual.
And then you can go through the TensorFlow Lite converter,
and then it will take this model that
was trained with quantization.
It will have all the data necessary to quantize it,
and then it will produce a quantized model
that, just like the post-training model,
you will be able to execute in different hardware.
These are some numbers from quantization of our training
preliminary numbers.
If you see the delta is a little bit better
than post-training quantization, it's
not a very big difference except for the MobileSSD.
So before it was 4% for post-training quantization.
In this case, it's 2.9%.
So quantization or our training is still a useful tool.
That's why we're building it.
Now you may wonder-- that those where
a lot of quantization types and tools, so which one should
I use?
So my recommendation is if you are just
starting, just start with try to reduce floats.
That's the first one to try.
It is just very easy to use.
It doesn't require any data.
The accuracy will probably be the same.
And then latency, depending on the hardware,
you might get some benefits--
reduced latency.
And then compatibility-- basically,
everywhere you can execute floating point operations,
you will be able to use it.
The next thing to try will be the hybrid quantization.
Again, there is no data requirements.
The accuracy will be still good, probably not as
good as float 16 in some cases, but it's still good.
It will be faster than the reduced float.
And basically, compatibility will be everywhere
that you have support for float and integer operations.
Then the third one to try is the integer quantization
with the post-training tool.
This one is a bit more complicated
just because you need to provide a little bit of data.
The accuracy will be worse or the same as hybrid,
but the latency of this will be the fastest.
And then it will also give you more hardware coverage.
And then the last thing to try will
be the integer quantization with quantization during training.
And basically, this is good.
This will be a little bit more involved, because now you're
doing training.
You're supposed to have now a training setup, a training
script.
But the accuracy is will be better
than doing just the post-training version,
and again, you get the benefits of being
the fastest one and the one with more hardware coverage.
So that was quantization.
And again, all these tools, we're
trying to make it very easy to use,
so it will be great if you try them out
and give us some feedback.
Then, connection pruning.
So what is neural connection pruning?
Well, the way that we have implemented it so far,
it means it is a training time technique
that, during the training process,
it will start removing dropping connections
from the neural network.
And then these connections will--
the dropped connections basically just become
zeros in the tensors that you're training,
and then that means that you end up with sparse tensors.
Now sparse tensors are great, because you
can compress them and potentially
execute them faster.
So this is an example.
This is a tensor, how it starts randomly initialized.
The dark values means values that are non-zero,
and white means values that are zero.
And then as the training progresses,
then it starts becoming sparser and sparser.
And if you see this tensor, it's basically
removing most of the parameters there.
The process for the API is very similar to the quantization
of our training API.
Again, we're trying to bring some consistency to our APIs.
So it's built on Keras, so it assumes
that you have a model that is trainable in Keras.
And then you're going to call our API
to apply the pruning logic.
And this again, we are trying to make this as simple
as possible.
So the only thing that you need to define
is a pruning schedule-- basically,
when you want to start dropping these connections,
and until when, and how fast, how aggressive
you want these prunings to be.
And then you just call our prune function,
which again will modify your graph to add all the pruning
operations internally.
And then you just call your fit function,
and you train as usual.
So basically, you train as usual, and then once you train,
you have two options now.
Or soon, you will have two options.
You can just take the same model,
the TensorFlow saved model.
You can just compress it, gzip, and then the model
will be smaller.
And soon, you will be able to convert it via TensorFlow Lite,
and you will get also a reduction in size
and potentially some speed-ups depending
on what prune configuration you're using
and the hardware that you're targeting.
So this should be done pretty soon.
Now what are the benefits of pruning?
We've tried it in a lot of tasks,
like really a lot of tasks-- on image, speech, audio.
And it worked pretty well.
And like a lot of techniques that are
require hyperparameter tuning, and, you know,
careful restarting your models, and things like that.
But pruning has worked pretty well
without a lot of babysitting.
Then it has potential for speed-ups
depending on hardware support.
And we also have pretty good results.
Like, we can make a lot of the parameters basically go away.
We see 50% to 90% with negligible accuracy loss.
And the other great thing is that it works well also
with quantization.
So a typical setup that we've tried is with training pruning,
and then we use post-training quantization.
And basically, the accuracy is pretty good,
and you get the compound benefits of all techniques.
This is some, older now, results that we have
when we launched this.
So I mean this is in InceptionV3.
We see we can get all the way almost to 90%
sparsity with relatively small accuracy losses.
And the other--
GNMT's neural machine translation, where again,
we can take it to almost 90% pruning and also small accuracy
losses.
And we've done these, for example, speech recognition.
We actually had, recently, the Google Pixel event,
where the speech recognition models
used pruning and quantization and were
able to have a model with server-side quality running
on a phone, which is pretty good.
OK, so now I'll finally cover, really quick, our roadmap.
Like I mentioned, quantization-- we're
working on a quantization training API,
so that should be ready soon.
And we are also working on our specs
for quantizing RNNs, which are typically trickier to quantize,
like LSTMs.
Then I didn't include it there, but we're
making some improvements to the hybrid quantization
to be more accurate, particularly
for convolution layers.
And then for sparsity, we're adding support
for sparse computation in TensorFlow Lite runtime.
Longer term, I don't know if you have heard about MLIR,
but it's state-of-the-art compiler infrastructure,
but this is particularly interesting to us
because it's a better way for us to write these transformations.
And at the end, like I said at the beginning of the talk,
we're taking a model.
We're transforming one program into another representation
of that program.
And some of the things that we want to enable
is better targeted hardware, so our specifications
are great because users can target
our specification in executing different hardware.
But some users just want to [INAUDIBLE] hardware and get
the best out of it.
So we're hoping that, with the new infrastructure that we're
building on top of MLIR, it should be possible.
And finally, I really just want to encourage you to try it
out and give us feedback--
what techniques you would like to see.
You know, researches-- there is techniques popping up
all over the place.
And a lot of the work that we have to go through
is culling what's useful and what's not-- what is general
and what is very specific.
So we will love to hear your feedback about that
and also about the tools that we already have in the toolkit.
We're trying to make them as easy as possible to use.
We know that we still have a long way to go,
but any feedback that you can provide
will be really, really appreciated.
And I think there is a little bit of time for questions
if any of you have questions.
[CLAPPING]
Thanks.
AUDIENCE: Hi.
I have a question regarding the [INAUDIBLE]..
Hi.
Thank you for the presentation.
I have a question regarding the training
with integer quantizations.
In the pipeline, is that going to be true quantization
during training?
RAZIEL ALVEREZ: No.
So right now, by true, I mean that you
expect that all the operation happen in the integer domain?
AUDIENCE: Yes.
RAZIEL ALVEREZ: Not right now.
That's something I really want enabled,
because I want to make training faster as well.
But right now, the way that we are targeting is--
I don't know if you're familiar with TensorFlow APIs,
but we have this low-level API, unfortunately called
fake quantization, that basically just emulates
these losses.
And that one is still-- basically,
what we do there is we quantize parameters,
and then we de-quantize them, and then
we do the float operation.
So that's what we're using right now.
But yeah, longer term, we want to do true integer forward
passes.
AUDIENCE: Thank you.
AUDIENCE: Hi.
[INAUDIBLE]
Oh, I had just one question.
So after you do the quantization,
is there a way that you can also visualize the finish quantized
model?
Yeah, that was one question, and I had another question.
Let me think about it.
But is there a way that you can also.
Oh, the other question was, what sort of tools
are you going to provide as far as to sort of do
model correctness and--
I mean, at least evaluate, you know,
whether this quantized model is sort of functionally
correct in a sense?
RAZIEL ALVEREZ: Yes.
Visualization, again, it depends where.
But for TensorFlow Lite, you have a visualizer,
so you can see the quantized model.
I don't know if it will give you a lot of information, depending
what you're looking for.
We also want to make our tooling a bit better, because perhaps,
for whatever reason, you want to get old research in and start
looking at the activations, and how they change, and all that.
AUDIENCE: Sure, yeah.
There's like inserted ops and so forth.
RAZIEL ALVEREZ: Yeah.
AUDIENCE: [INAUDIBLE]
RAZIEL ALVEREZ: So for sure with the TF Lite visualizer,
you can see how the graph changes.
So the second question about correctness, correctness
is really tricky.
Because in my experience, the only thing that really works
is to really evaluate on the real data
that you care to run your model on.
AUDIENCE: Yeah, that's right.
RAZIEL ALVEREZ: You know, like, we
tried to do things like ultra norms to approximate--
OK, versus the full precision one versus the quantized one.
And then it gives you a sense of maybe some really catastrophic
numerical errors, but otherwise, it's really just a guess,
right.
AUDIENCE: That's right.
RAZIEL ALVEREZ: Particularly, depending on the output layers,
you know, categories are easier to quantize, because, you know,
the error is not very meaningful as long
as you get the right category.
Regressions are much harder because now you really care
about the actual values.
Yeah, it's an open problem.
AUDIENCE: Yeah, it's a tough problem.
Thank you.
AUDIENCE: I have a question about the results
from the GMNT training with induced sparsity.
I was wondering if you had any insights on why
the training with 80% sparsity would perform better
than the original version?
Like, if you looked at the results.
RAZIEL ALVEREZ: You know, the hand-waving thing,
that we always say in these cases,
is some regularization happens.
[LAUGHTER]
Yeah.
And you know, I've seen the same with some quantize models.
I've never had the gear to really sit
down and try to understand what their reasons are for all this.
Sometimes it's just because it's within the noise, right?
It all depends on your evaluation set, right.
If it's really not that big or not that meaningful,
then these jumps are all possible.
Like, I've seen some models where, oh, it looks great
after you quantize it.
Then you throw in a new data set, say from speech
recognition and noisier utterances,
and then you clearly see the difference
between one and the other.
So a lot of it can be just noise.
AUDIENCE: Hi.
You mentioned explainability.
And a technique could be like saliency maps.
Do you have any insights on how these techniques affect
the ability to calculate the gradients to calculate
the saliency maps, for example?
RAZIEL ALVEREZ: You know, like, that's something
that we want to invest more, and we haven't had that much time
to do it.
And I would love for research to get more excited.
They are trying to understand neural networks to understand
neural networks that have been approximated,
but so far, I haven't gotten any luck
trying to get the people on that side excited about it.
But yeah, I really don't have any meaningful thing
to say because I haven't run many experiments over on it.
AUDIENCE: Thank you.
RAZIEL ALVEREZ: [INAUDIBLE].
AUDIENCE: Hey.
So what is the best way to handle
fragmentation of hardware?
So like, quantization dependent on the target hardware.
And more often than not, mobile phones like Android,
you have so much [INAUDIBLE] hardware,
so what are the best practices there?
RAZIEL ALVEREZ: So one way that we
tried to do it was again with these specifications.
And like, I don't know to what extent
it makes our hardware partners happy, because we would like
to be able to target their hardware in the most
precise and efficient way.
But that's one way that we try to address it.
You know, with our knowledge of what hardware is there
and what is supported, we tried to create these specifications
that tried to accommodate for everybody, which again, is good
and at the same time is bad.
Then longer term, again, I don't want
to say too much, because I really don't have
a very concrete plan to share.
But part of the way we're building
with the MLIR infrastructure is we
want to be able to better target that hardware-- to better
partner with hardware vendors to understand
what are their hardware capabilities
and better create these transformations that
target that hardware.
But we were really trying to make it much better.
AUDIENCE: So for now, does it mean, like,
you go with the lowest common denominator
to maybe like a [INAUDIBLE]?
Like, imagine the Android app that you
have to apply in a lot of things too?
RAZIEL ALVEREZ: And that's why we have, like,
all these different quantization types.
Like, we have three types, right.
And soon, hopefully, we'll be able to even just mix and match
those different types, because at the end of the day,
it's a very arbitrary boundary.
Then we say, oh, this is all integer quantized,
and this one is hybrid.
And the reality is we should be able to take advantage
of mixing and matching up precisions
to get something better.
Thank you.
AUDIENCE: I have a question about pruning.
As a general rule in layers, operations
are converted to matrix multiply because of their efficiency.
With pruning, you're now passing in individual multiply
operations one by one.
There must be some crossover point
at which you need to prune by 10%, 15%, 20% before you're
crossing over and actually get an improvement.
Thoughts on where that is?
RAZIEL ALVEREZ: And I don't know if this
is exactly what you're asking.
So for example, our pruning API supports you specifying
what the pruning structure is.
So for example, we know that for CPUs [INAUDIBLE]
the instructions will typically have registers
that can accommodate 16 values.
So we know that if we want to speed up on CPU,
we expect you to set the setting to say,
oh, I want to prune in blocks of, say, 1 by 16.
And that's how we can get the speed-ups on CPU, for example.
And unfortunately right now, probably it's
going to be hardware dependent, but that's one thing
that you can do right now.