Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • JOSH DILLON: Hello, everyone.

  • I'm Josh Dillon, and I'm a lead on the TensorFlow probability

  • team.

  • And today, I'm going to talk to you about probability stuff

  • and how it relates to TensorFlow stuff.

  • So let's find out what that means.

  • OK, so these days, machine learning

  • often means specifying deep model architectures

  • and then fitting them under some loss.

  • And happily, Keras makes specifying model architecture

  • relatively easy.

  • But what about the loss?

  • Choosing the right loss is tough.

  • Improving one-- even a reasonable one--

  • can be even tougher.

  • And once you fit your model, how do you know it's good?

  • Does accuracy tell the full picture?

  • Why not use mean, entropy, mode?

  • Wouldn't it be great if there existed

  • some mathematical framework, which unified these ideas?

  • Better still, wouldn't it be nice

  • if it was plug and play with Keras and the rest of TF?

  • This would make comparing models easier by simply maximizing

  • likelihood and having readily available

  • evaluative statistics.

  • We can rapidly prototype different generating

  • assumptions and quickly reject the bad ones.

  • In short, wouldn't it be great if we could do this--

  • just say I want to maximize the log likelihood

  • and then summarize what I learned easily

  • and in a unified way?

  • So let's play with that idea.

  • Here, we have a data set-- these blue dots.

  • And our task-- our pretend task--

  • is to predict the y-coordinate from the x-coordinate.

  • And the way you might do this is specify some deep model.

  • And of course, you might choose the mean squared error

  • as your loss function.

  • OK.

  • But our wish here is to think probabilistically.

  • And so that means maximizing the log likelihood,

  • as indicated here with this lambda function--

  • the negative random variable log_prob under y.

  • And what we want, in addition to that,

  • is to get back a distribution-- a thing

  • that has attached to it statistics

  • that we can use to evaluate what we just learned.

  • If only such a thing were possible.

  • Of course, it is, and you can do this now.

  • Using TensorFlow probability distribution layers,

  • you can specify the model as part of your deep net.

  • And the loss now is actually part

  • of the model, sort of the way it used to be--

  • the way it's meant to be.

  • And so let's unpack what's happening here.

  • So we have two dense layers.

  • That's sort of business as usual.

  • The second one outputs one float,

  • and that one float is parameterizing

  • a normal distributions mean.

  • And that's being done through this distribution lambda layer.

  • In so being, we're able to find this line.

  • That looks great.

  • And the best part is, once we instantiate

  • this model with test points, we have back a distribution

  • instance--

  • for which you get not just the mean, which

  • is what you'd get today, but also--

  • get not just the mean, which is what you'd get today, but also

  • entropy variance, standard deviation, all of these things.

  • And you can even compare between this and other distributions,

  • as we'll see later.

  • But if we look at this data, something's

  • still a little fishy here, right?

  • Notice that as the magnitude of x increases, the variance of y

  • also seems to increase.

  • So that means that maybe our model's a little suspicious.

  • So since we're in this probabilistic framework

  • and we're no longer doing loss hacking--

  • we're actually building a model--

  • what can we do to fix this?

  • Answer-- learn the variance too.

  • It's actually pretty obvious.

  • If we're fitting a normal, why on earth

  • do we think that the variance would just be 1?

  • And by the way, that's what you're

  • doing when you use mean squared error.

  • And so now, to achieve this, all I had to do

  • is make my previous layer output two floats.

  • I pass one in as the mean to the normal, one

  • in as the standard deviation of the normal.

  • And presto chango, now I've learned the standard deviation

  • from the data itself.

  • That's what the green lines are.

  • So this is really cool, because now, if you're a statistician,

  • you would say, hey, I'm able to handle heteroscedasticity.

  • If you want a $10 word, you can call

  • this aleatoric uncertainty.

  • And what this really means is that you're

  • learning known unknowns.

  • It means that the data itself had variance,

  • and you learned it.

  • And it cost you basically nothing

  • to do but a few keystrokes.

  • And furthermore, the way in which you saw how to do this

  • was self-evident from the very fact

  • that you were using a normal distribution which

  • had this curious constant just sitting there.

  • So this is good.

  • But, hm, I don't know.

  • Is there enough data for which we can reliably

  • claim that this red line is actually the mean,

  • and these green lines are actually

  • the standard deviation?

  • How would we know if we have enough data?

  • Is there anything else we can do?

  • Of course, there is.

  • Why learn just a single set of weights?

  • A Keras dense slayer has two components-- a kernel matrix

  • and a bias vector.

  • What makes you think that those point estimates are

  • the best, especially given that your data set itself is random

  • and possibly inadequate to meaningfully and reliably learn

  • those point estimates?

  • Instead, if you use a TensorFlow probability

  • dense variational layer, you can actually learn a distribution

  • overweight.

  • This is the same as learning an ensemble that's

  • infinitely large.

  • But luckily, it doesn't take infinitely

  • long to train this ensemble.

  • In fact, it takes just a little bit longer

  • than what it took to train on the previous slides.

  • And as you can see here, all I had to do

  • is replace Keras.dense with TFP.dense variational layer,

  • and in so doing, achieve this kind

  • of Bayesian weight uncertainty.

  • The $10 word here is epistemic uncertainty.

  • But again, I like to think of it as unknown unknowns.

  • I'm not sure what my data is not telling me,

  • so I'm going to be careful in the bookkeeping I

  • make when tracking the weights that I learn.

  • As a consequence, of course, this

  • means that any model you make, any instantiation of this model

  • is now actually a random variable because the weight's

  • a random variable.

  • And that's why you see here all of the lines.

  • There are many lines.

  • We have an ensemble of them.

  • But if you were to average those and take the sample

  • standard deviation, say, then that

  • would give you an estimate of credible intervals

  • over your prediction.

  • So now you can go to your customer

  • and say, look, here's what I think would happen,

  • and here's how much you should trust me.

  • So this is great, right?

  • But we seem to have lost the heteroscedastic part.

  • Notice that the blue dots are still more

  • dispersed on the right-hand side.

  • So can we do both?

  • Of course, we can.

  • It's all modular.

  • I just have my dense operational layer output 2 floats, instead

  • of one, like we did before.

  • Feed that into my output layer, which is a normal distribution.

  • And presto chango, I'm learning both known and unknown

  • unknowns, and all it cost me was a few keystrokes.

  • And so what you see here now is an ensemble of standard

  • deviations associated with the known unknown parts--

  • the variance present or observable in the y-axis--

  • as well as a number or an ensemble

  • of these mean regressions.

  • OK, that's cool.

  • So I like where this is going.

  • But I have to ask, what makes you think a line is even

  • the right thing to fit here?

  • Is there another distribution we could choose, a richer

  • distribution, that would actually find

  • the right form of the data?

  • And of course, the answer is yes.

  • It's a Gaussian process.

  • By tossing in this fancy distribution,

  • it turns out that the data wasn't linear at all.

  • No wonder we had such a hard time fitting it.

  • It was sinusoidal, and the Gaussian process can see this.

  • How can the Gaussian process see this?

  • Because it treats the loss itself as a random variable.

  • Now, how could you do that, if you're just specifying mean

  • squared error as your loss?

  • You can't.

  • It has to be part of your model, and that's the power

  • of probabilistic modeling.

  • When you bake in these ideas into one model,

  • you get to move things around fluidly between weight

  • uncertainty and variance in the data,

  • and even uncertainty in the loss function you're fitting itself.

  • And so the question is, how can this all be so easy?

  • How did it all fit together?

  • It's TensorFlow Probability.

  • So TensorFlow Probability is a collection

  • of tools designed to make probabilistic reasoning

  • in TensorFlow easier.

  • It is not going to make your job easy.

  • It's just going to give you the tools you need

  • to express the ideas you have.

  • You still have to have domain knowledge and expertise.

  • But you can encode that domain knowledge and expertise

  • in a probabilistic formalism, and TFP

  • has the tools to do that.

  • Statisticians and data scientists

  • will be able to write and launch the same model.

  • Gone are the days of hacking your model in R and importing

  • it over to a faster language, like C++, or even TensorFlow.

  • You can do it all in the same framework.

  • ML researchers and practitioners will

  • be able to make predictions with uncertainty.

  • If you predict the light is green,

  • you'd better be pretty confident that you should go.

  • You can do that with probabilistic modeling

  • and TensorFlow Probability.

  • So we saw one small part of TFP.

  • Broadly speaking, the tools are broken in two components--

  • those tools useful for building models and those tools

  • useful for doing inference on those models.

  • On the model building side, you saw the normal distribution

  • and the variation of Gaussian process distribution.

  • A distribution is just is a collection of simple summary

  • statistics, exactly like it is in every other library.

  • There's a few differences.

  • R distribution support, this concept

  • of batch shape, which automatically takes advantage

  • of vector processing hardware.

  • But for the most part, they should be

  • pretty natural and easy to use.

  • We also have something called bijecters,

  • which is a library for transforming random variables.

  • In the simplest case, this can be

  • like taking the x of the normal, and now you have a lognormal.

  • In more complicated cases, it can

  • involve transforming a random variable with a neural network.

  • This includes things like mask autoregressive flows,

  • if you've heard about it, real MVPs,

  • and other sophisticated probabilistic models.

  • You saw layers.

  • We also have some losses that help you build Monte Carlo

  • approximations to otherwise intractable calculations.

  • Edward2 is our probabilistic programming language

  • that helps you combine different random variables as one.

  • On the inference side, no Bayesian library

  • would be complete without Markov chain Monte Carlo

  • tools, within which we have several transition kernels.

  • One of them is called Hamiltonian Monte Carlo,

  • which naturally takes advantage of TensorFlow's

  • automatic differentiation capability.

  • We also have tools for performing

  • variational inference-- again, taking

  • advantage of TF's automatic differentiation and optimizer

  • toolbox.

  • And of course, we have our own optimizers

  • that often come up in probabilistic modeling

  • problems, such as Nelder-Mead, BFGS, things like that.

  • The point is, this toolbox has maybe not everything,

  • but certainly, it has most of what

  • you might need to do fancier modeling to actually get

  • more out of your machine learning model.

  • And it doesn't have to be hard.

  • You saw the Keras examples were just

  • a sequence of one line changes.

  • So of course, TensorFlow probability

  • is used widely around Alphabet.

  • DeepMind uses it extensively.

  • Google Brain uses it.

  • Google accelerated sciences, product areas-- infrastructure

  • areas even use it for planning purposes.

  • But it's also used outside of Google.

  • So Baker Hughes GE is one of our early adopters of TensorFlow

  • probability, and they use it to build

  • models to detect anomalies.

  • Anomaly detection is a very hard problem because, hopefully,

  • your data set never has the anomaly

  • you're trying to detect.

  • For example, anyone who flew out here

  • would be happy to know that Baker Hughes GE uses

  • its anomaly detection to predict the lifespan of jet engines.

  • And if we had a data set that had a failing jet engine,

  • that would be a tragedy.

  • And so using math, we can get around this by actually--

  • or they get around this--

  • by modeling models and then trying to, in the abstract,

  • figure out if those are going to be good models.

  • So what you see is their data processing pipeline.

  • The orange boxes use TensorFlow probability extensively.

  • The orange bordered box is where they use TensorFlow.

  • And the basic flow is to try to treat the model itself

  • as a random variable, and then determine

  • if it's going to be a good model on otherwise

  • an incomplete data set.

  • And from this, they're able to do--

  • they get remarkable results, dramatic decreases

  • in false positives and false negatives over very large data

  • sets in complicated systems.

  • So the question is, who will be the next success story?

  • Try it out-- it's an open source Python package

  • built using TensorFlow that makes

  • it easy to combine deep learning with probabilistic models.

  • You can PIP install it.

  • Check out tensorflow.org/probability.

  • And if you're interested in learning more

  • about Bayesian approaches, check out this book,

  • which we rewrote using TensorFlow probability,

  • within which you can learn, like I said, Bayesian methods,

  • but also just how to use TensorFlow probability.

  • If you're not a Bayesian, that's fine too.

  • We have numerous tools for frequentists.

  • We have a second order generalized

  • linear model solver, which--

  • you should care, because if you're doing linear regression,

  • it could solve that problem on the order of 30 iterations,

  • which definitely cannot be said of a standard gradient descent.

  • And if you want to find out more about this example,

  • you can check out our GitHub repository,

  • where you'll find several Jupyter notebooks.

  • Thanks.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it