Placeholder Image

Subtitles section Play video

  • Welcome to this next class in Pattern Recognition, we have been looking at density estimation.

  • So, let us briefly recall, what we have been doing in the last couple of classes.

  • We have been looking at, how to estimate densities for given IID samples, for a particular density.

  • We first looked at the maximum likelihood estimation method. For the last couple of

  • classes, we have been considering the Bayesian estimation of density function. So, given

  • a density function f x given theta where, theta is the parameter, we are considering

  • Bayesian estimate for the parameter theta. As I have already told you, the main difference

  • between the maximum likelihood estimation and the Bayesian estimation is that, in the

  • Bayesian estimation, we look at the parameter theta, which may be a vector or a scalar,

  • the parameter theta itself is viewed as a random variable. And because, we view it as

  • a random variable, it has a prior density, which captures our initial uncertainty.

  • Our knowledge or lack of knowledge about the specific value of the parameter is captured

  • through a prior density f theta. So, f theta gives us some idea about, what we think or

  • the possible values for the parameter. Given the given the prior density, we are going

  • to use the data likelihood that is, f D given theta to calculate the posterior density f

  • theta given D. Once again, I would like to drive your attention

  • to a caution on the notation, for simplicity, the densities of all kind of random variables

  • we are using the same symbol f. So, f of theta, f of D given theta, f of theta given D, all

  • these are densities, purely as mathematical notation looks same because, f is the same

  • function. But, we are we are using f as a notation to denote density and density of,

  • which random variable it is, is clear from context.

  • Thus, f x given theta is the density of x conditioned on theta, which is the parameter

  • f of theta, is the density of the parameter theta and so on, f theta given D is the conditional

  • density of theta, conditioned on data D. So, even though we are using the same symbol f,

  • for all densities so, I hope you understand that, the f used in different times is a different

  • function. It refers to densities of different random variables and just to keep the notation

  • uncluttered, we are calling it as the f. So, once again, essentially we start with a prior

  • density f of theta, for the parameter theta then, use the data likelihood of D given theta

  • to calculate the posterior f theta given D, we have seen a couple of examples of this

  • process earlier.

  • And the idea of Bayesian estimation by now, you would have seen is, to choose a right

  • kind of prior, the right kind of prior for us is, what is called the conjugate prior.

  • The conjugate prior is that prior density, which results in the posterior density belonging

  • to the same class of densities. For example, as we saw when, we are estimating the mean

  • of a Gaussian random variable or mean of a Gaussian density where, the variance is assumed

  • known, we choose Gaussian density for the prior then, the posterior is also Gaussian

  • density. So, for that particular estimation problem,

  • the prior density happens to be Gaussian similarly, for a Bernoulli problem where, we have to

  • estimate the parameter p namely, the probability of the random variability taken value 1. For

  • parameter p, the appropriate prior density turns out to be beta appropriate in the sense

  • that, if I take prior density to be beta then, the posterior also becomes beta, so such a

  • prior is called a conjugate prior right. The conjugate prior is that prior density,

  • which results in the posterior density also, to be of the same class of densities, the

  • use is of conjugate prior makes the process of calculation of posterior density easier.

  • As we have seen let us say, in the case of the Bernoulli parameter, that we have seen

  • earlier, if we start with some beta a 0, for the beta a 0 b 0 for the prior then, the posterior

  • is also beta density with possibly some other parameters a and b n.

  • So, calculation of the posterior density is simply a matter of parameter updation, given

  • the parameters of the prior now, we update them into parameters of the posterior. Having

  • obtained the posterior density, we can finally use either the mean or the mode of the posterior

  • density at the final estimate. We have seen examples of both or we can also calculate

  • f of x given D, that is the actual class conditional density, conditioned on data by integrating

  • the posterior density and we have seen example of that also. So, this class, we will look

  • at a couple of more examples of Bayesian estimation and then, closed Bayesian estimation. As you

  • would have by now seen, Bayesian estimation is a little more complicated mainly because,

  • you have to choose the right kind of prior and different kind of estimation problems

  • make different priors of the conjugate.

  • So, we will start with another example, this is the multinomial example that is, we consider

  • estimating the mass function of a discrete random variable, which takes one of M possible

  • values say, a 1 to a M where, p i is probability Z takes the value a i. So, essentially Z takes

  • value a 1 with probability p 1, a 2 with probability p 2, a M with probability p M and we want

  • to estimate this p 1, p 2, p M, given a sample of n iid realizations of Z. We already considered

  • this problem in in the maximum likelihood case and there we told you, this is particularly

  • important in certain class of pattern recognition problem. Especially those to do with say,

  • the text classification and so on where, the discrete random variables for feature that

  • is, features that take only finitely many values are important.

  • We have seen, how to do the maximum likelihood estimation for obtaining this p 1, p 2’s

  • that, p 1, p 2, p M that characterize the mass function of this discrete random variable

  • Z. Now, we will look at, how to do the same thing using Bayesian estimation, I hope the

  • problem is clear, this already is done earlier so, we will we will quickly review the notation

  • that we used earlier.

  • So, as earlier, we represent any realization of Z by M-dimensional Boolean vector x, x

  • has M components x superscript 1, x superscript M. The transpose there is because, as I said,

  • all vectors for us are column vectors, each of these components of this vector x, x superscript

  • i is either 0 or 1, and summation of all of them is 1. That means, essentially x takes

  • only the unit vectors 1 0 0 0, 0 1 0 0’s and so on, the idea is that, if z takes the

  • i th value a i then, we represent it by a vector where, i th component is 1 and all

  • others are 0. We have already seen that, this is a interesting

  • and useful representation for maximum likelihood with this same, we use the same representation

  • here. And now, p i turn out to be, the probability that the i th component of this vector is

  • 1 that is what, I wrote there p x subscript i is equal to 1. Because, p i is the probability

  • with where, Z takes the i th value a i and when Z takes the i th value a i, the i th

  • component of x i becomes 1 where, i th component of x becomes 1.

  • Also, as I told you last time, the reason, why we use the superscripts to denote the

  • components of x is because, subscripts of x are used as to denote different data or

  • data is x 1 to x n that is why, we are using superscripts to denote the components of particular

  • data. So, we also seen last time that, for this x, the mass function with the single

  • vector parameter p, is product i is equal to 1 to M p i x i because, in any given x,

  • only one component of x is 1 and that is the one that, survives this product. So, if x

  • is 1 0 0 0 then, for that x, f of x given p will become p 1 to the power 1 and p to

  • the power 0 and so on that is, p 1. So thus, this correctly represents the mass function,

  • that we are interested in and p is the parameter of the mass , that we need to estimate.

  • As usual, our data has n samples x 1 to x n, and each sample x i is a vector of M components,

  • the components are shown by superscripts. And each component is either 0 or 1, and in

  • each data it is M x i that is, each M vector x i, if I sum all the components, it becomes

  • 1. That simply means, because, each component is on 0 1 and sum is 1 means, exactly one

  • component is 1 and all others are 0 so, this is the nature of our representation and we

  • have n such data items. So, given such data, we want to estimate p

  • 1, p 2, p M thus essentially, what we are doing is, we are estimating the parameters

  • of a multinomial distribution. As you know, a multinomial binomial takes only binomial

  • is important, when there is a random experiment that is repeated, which takes only two values

  • success or failure. In the multinomial case, it is the same thing independent realizations

  • of a random experiment that takes more than two values say, M values.

  • If a random variable takes M different values, I can think of it as a random experiment,

  • which can result in one of M possible outcomes right. So, many samples from that random variable

  • are like, I have a multinomial distribution that is, I repeat a random experiment, that

  • can take one of M possible outcomes, n number of times. So, some of which will be well result

  • in first outcome and so on, some of which will results in second outcome and so on.

  • So, I know for each repetition, what outcome has come and given those things, we want to

  • estimate the multinomial parameters p 1, p 2, p M. Now because, we are in the Bayesian

  • context, the first question we have to answer is, what is the conjugate prior in this case

  • right. Now, as we have already seen from our earlier examples, for this we should examine

  • the form of the data likelihood. So, we already have our model, we for this x, we have the

  • mass function, that we that we have seen earlier that is the mass function so, given this mass

  • function, what is the data likelihood, that is easy to calculate.

  • So, f D given p, p is product of this over i is equal to 1 to n now, we can substitute

  • for f of x i, if I substitute for f of x i, f of x i is once again product p j x i j.

  • Now, if I interchange i and j summation then, p j to the power x i j product over i can

  • be written as product over j p j to the power n j where, n j is summation over i x i j.

  • What does that mean, in any given x i, the j th component is 1, if that particular outcome

  • of Z represents the j th value of Z. So, this n j tells you, out of the n, how

  • many times Z taken the j th value so, for example, n 1 plus n 2 plus n M will be equal

  • to n, the total number of samples. So, out of n samples, n 1 times the first value of

  • Z has come, n 2 times the second value of Z has come or looking at as a multinomial

  • distribution, n 1 times the first outcome has occurred, n 2 times the second outcome

  • has occurred and so on. So, now, in terms of this n’s, the data likelihood is given

  • by product over j, p j to the power n j. Now, if this is the data likelihood, we multiply

  • this with a prior and we should get another expression of the same form, as the prior

  • so, what should be our prior.

  • So, this is the data likelihood so, we expect the prior to have a some density, this proportional

  • to a product of p j to the power a j. If the prior is proportional to product of p j to

  • the power a j and then, I multiply with data likelihood, I get another product of p j to

  • the power some a j prime so, once again that posterior will belong to the same density

  • of the prior right. Let us still remember that, p is a vector parameter, p has M components

  • with all of them are probabilities so, they are greater than or equal to 0. And sum of

  • p a is equal to 1 because, p 1 is the probability of that Z takes first value and so on so,

  • this is needed for the mass function of Z. So, we need a density, which which is a density

  • defined over all p, that satisfy this and that should have a form, which is product

  • p j to the power a j.

  • Now, such a density is, what is known as a Dirichlet density, if you remember when there

  • are only two outcomes when you looking at Bernoulli, the the prior happen to be what

  • is called the beta density. So, we will see that, the Dirichlet density is a kind of generalization

  • of the beta density so, the Dirichlet density is given by f of p, is gamma of a 1 plus a

  • 2 plus a M by gamma of a 1 into gamma of a 2 into gamma of a M, product j is equal to

  • 1 to M p j to the power a j minus 1. Where, this gamma is the gamma function, that

  • we already seen when we discuss the beta function the beta density, gamma function gamma of

  • a is integral 0 to infinity x to the power a minus 1, e power minus x d x. In this, the

  • the parameters of this Dirichlet density or this a j’s that is that is, a 1 a 2 a M,

  • all of them are assumed to be greater than equal to 1.

  • Also, this density has this value, only for those p that satisfy all components greater

  • than or equal to 0 or some of the components is 1, outside of those p’s the density is

  • 0. That means, the density is concentrated on that subset of power M, which satisfies

  • p i greater than or equal to 0 and summation p equal to 1, which is essentially called

  • as simplex. Those you know what is simplex is, they are simplex but, anyway even, if

  • you do not know what is simplex is, this density is non-zero only for those p’s that satisfy

  • p i greater than or equal to 0, summation p i is equal to 1 otherwise, the density value

  • is 0.

  • So, that is the Dirichlet density so, if M is equal to 2, this becomes gamma a 1 plus

  • a 2 by gamma a 1 into gamma a 2, and p 1 to the power a 1 minus 1 and p 2 to the power

  • a 2 minus 1 and that is the beta density. So, when M is equal to 2, this density becomes

  • the beta density and this Dirichlet density happens to be the conjugate prior here. So,

  • as you can see, if I am estimating the parameter of a Bernoulli density then, my conjugate

  • prior happens to be beta. Whereas, if Iam estimating parameters for

  • a multinomial distribution rather than binomial one then, the prior happens to be Dirichlet,

  • which is a kind of a neat generalization of the beta density to, a to more than two case.

  • Of course, this is a strange expression and we have first show that, this is a density

  • on that particular set of p, that we we mentioned. Using the using similar methods, as in the

  • case of beta density we can show that, this is a density, the other thing that we want

  • is, just like in the beta density case, ultimately because, my posterior will be a Dirichlet

  • density. We need to know the movements of the Dirichlet density so that, I I can correctly

  • use my posterior to get my estimates.

  • Once again without proofs, I I just put down these movements so, if p 1, p 2, p M have

  • joint densities as Dirichlet, with parameters a 1, a 2, a M then, expected value of any

  • component say, p j is a j by a 0 where, a 0 is a 1 plus a 2 plus a M. Variance of p

  • j happens to be a 0 into a j into a 0 minus a j, by a 0 square into a 0 plus 1 and similarly,

  • this is the co variance. We do not know the co variance but, this kind of gives us all

  • the movements upto the up to order 2. So, for example, if you are going to use the use

  • the mean of the posterior rather or final estimate, we would need this formula.

  • So, with this let us now go on in this case, compute the posterior density, if by taking

  • prior as Dirichlet, the posterior density we have, to for the posterior density, as

  • we know is f p given D is proportional to the product of f D given p, into f p f D given

  • p is the data likelihood, f p is the prior we have taken prior to be Dirichlet. We already

  • have an expression for the likelihood so, if you substitute those two so, this is the

  • expression for the likelihood, product p j to the power n j.

  • And let us say, we have taken the we have taken the prior to be Dirichlet with parameters

  • a 1, a 2, a M then, this becomes the prior so, this product can now be written as, product

  • over p j of n j plus a j minus 1. So, obviously the reason, why we chosen this particular

  • prior is that, the posterior will belong to same class. So, indeed posterior belongs to

  • the same class right, this is product is proportional to product of p j to the power something.

  • So, if my prior is Dirichlet with parameters a 1, a 2, a M then, the posterior is Dirichlet

  • with parameters n j plus a j where, the n j’s come from the data, This is once again

  • very similar to, what happened in the Bernoulli case.

  • Thus, the posterior is also Dirich let with parameters n j plus a j so, if we take for

  • example, the mean of the posterior as our final Bayesian estimate, we already seen,

  • what the mean is a j by sum so, we know summation n j. So, it will be n j plus a j by summation

  • over j n j plus a j, summation over j n j is n, that we have already seen and a 0 is

  • the notation we have given for summation a j’s.

  • So, the the bayesian density, which is taken as the mean of the posterior turns out to

  • be n j plus a j by n plus a 0. Let us recall, that the ML estimate for this was n j by n

  • right. And the ML estimate is very easy to see, we are asking, what is the probability

  • that Z takes the j th value or what is the probability that, the j th outcome occurs

  • that is, equal to the number of times j th outcome occurred by the total number of samples

  • right, n j means summation over i, x i j is the number of times the j th value has occurred.

  • So, n j by n was, as we have already derived was the ML estimate here, instead of it being

  • n j by n, it becomes n j plus a j by n plus a 0 where, a j and a 0, which is a 1 plus

  • a 2 plus a M are determined by our choice of the prior right, our choice of prior determines,

  • what the value a j are. So, just like in the case of the Bernoulli parameters, the nature

  • of the Bayesian estimate is same so, you can think of the prior as saying, that before

  • I collect data in my mind because, I have some idea of, what the values of p j’s are.

  • I have coded them to say that, if I have done a zero repetitions of Z, they are fictitious

  • repetitions a 1 of them would give me first value, a 2 of them will give me second value,

  • a M of them will give me third value. So, I can choose the a 0 as well as a 1 to a M

  • based on my idea of, what these numbers p 1 to p M are. So then, the final estimate

  • is the actual in the data, how many times j has occurred plus how many time j has occurred

  • in the fictitious trials, divided by the total number of actual trials and the fictitious

  • trials. So, when data when the like in the Bernoulli

  • case, if the data is small then, we do not go very wrong because, our paired beliefs

  • we will ensure that, p j’s do not go into unnatural values. For example, p j breaking

  • 1 or 0, when data is very small but, as n increases for any fixed a j and a 0, as n

  • increases ultimately, this becomes n j by n. So, asymptotically the Bayesian estimate

  • will be same as the maximum likelihood estimate and hence, it will be consistent. But, once

  • again like in the Bernoulli case, the the prior allows me, to allow my initial beliefs

  • to properly moderate data especially, when data is not very large.

  • Now, let us look at another example, last class we considered the example of estimating

  • mean of a normal distribution where, we assumed the variance to be known right. Now, let us

  • do it the other way round, we want to estimate the variance of a normal distribution and

  • we assume mean to be known. It might look a little strange to you, when we did the ML

  • estimate, we did not have to do so much trouble, we directly did only one example to estimate

  • both mean and variance of a one dimensional Gaussian distribution.

  • Because, in the ML case, it is a very straight forward thing here, for each kind of parameter,

  • the corresponding prior would be different for example, when we wanted to estimate the

  • mean, the conjugate prior was Gaussian right. Some of you may be thinking that, because

  • we were estimating Gaussian density, the conjugate what happened to be Gaussian, that is not

  • true. If you want to estimate the variance of a

  • Gaussian where, I assume mean known, the conjugate prior cannot be Gaussian right because, variance

  • as a parameter can take only non-zero values. So, it is density cannot be Gaussian then,

  • we may jump to the conclusion say, may be it is a exponential, exponential is a density

  • that is 0 only only when the random variables takes positive values, it is the density is

  • non-zero, only when the random variable takes positive values.

  • But, exponential is only a special case right, we will see that in general, the the prior

  • is not exponential, exponential is only very special case of the prior. Also, the prior

  • will not be unvariance, as it turns out for this case, is better to take 1 by variance

  • at the parameter, it is often denoted by nu and is often called the precision. While I

  • have not done the vector case, in the vector case, the inverse of the sigma matrix is called

  • the lambda matrix and that is called the precision matrix.

  • In the in the scalar case of course, we simply take the 1 by variance at the precision, which

  • is often denoted by nu so, in terms of the parameter nu, the normal density model is

  • given by 1 root nu. Because, is normally 1 by sigma root 2 pi, 1 by sigma is root nu

  • so, this root nu by root 2 pi exponential minus half normally, x minus mu whole square

  • by sigma square and 1 by sigma square is nu. So, it is written as exponential minus nu

  • by 2 into x minus mu whole square. Note that, we are assuming mu is known and that is why,

  • only mu is shown as the conditioning parameter. And we need to find nu said that, nu is always

  • positive so, for example, when we choose a prior density, we choose density that is,

  • the that is 0 on the negative nu and you have to find, what is the prior density, you have

  • to look at the data likelihood. So, let us look at the data likelihood, the

  • data likelihood is given by in terms of nu, as equal to 1 to n, f of x i given nu, f of

  • x given nu is this. So, if I take a product, this will give me nu to the power root nu

  • to the power n that is, nu to the power n by 2. This 1 by root 2 pi to the power n that

  • is, 2 pi to the power minus n by 2 so, I have a two pi to the power minus n by 2 term, I

  • have a nu to the power n by 2 term. And then, when I take a product of this over x i, it

  • will become exponential minus nu by 2 into sum of this.

  • So, exponential minus nu by 2 into sum over i x i minus mu whole square so now ,to ask

  • what should be the right prior, we should ask what kind of function is this of nu, viewed

  • as a function of nu, what kind of function is this. So, we essentially have an exponential

  • nu into something term and we have a nu to the power something term right. So, the conjugate

  • prior should be something, this proportional to nu power something and that, should be

  • proportional to product of a power of nu and an exponential of a linear function of nu.

  • Because, the data density is some constant into nu to the power something and exponential

  • minus some some k times nu. So, if the prior also has nu to the power something and exponential

  • some constant into nu, I mean then, the product will once again be nu to the power something

  • into exponentials of constant nu. So, the prior should be proportional to a product

  • of a power of nu and an exponential of a linear function in nu. And such a density transfer

  • to be, what is known as a gamma density, such a prior would be what is called the gamma

  • density.

  • So, let us look at the gamma density, the gamma density is given by the density function

  • f nu is 1 by gamma a, b to the power of a nu to the power a minus one e to the power

  • of b nu where, a and b are parameters. So, the gamma is once again the gamma function,

  • as a matter of fact, the actual gamma function comes from making this to be a density. By

  • a simple integration, we can show that this to be a density because, this integral will

  • turn out to be the gamma function. The the gamma density has two parameters a

  • and b, the a comes in nu to the power of a minus one and b comes in e power minus b nu,

  • this b power a is needed so that, the density integrates to 1. So, the nu to the power is

  • controlled by a and exponential of the linear function in nu is controlled by b, these two

  • are the parameters and the mean of gamma density is a by b and the mode is a minus 1 by b.

  • If I actually choose a to be 1 then, the density turns out to be b e power minus b nu right

  • now, when a is 1, as you know gamma of a is a minus gamma, if one is one turns out to

  • be 1 by straight forward integration. So, when a is equal to 1 is simply b e power minus

  • b nu that is nothing but, the exponential density so, exponential density is a special

  • case of gamma density with a is equal to 1. So, let us take the prior to be gamma with

  • parameters a 0 and b 0 that means, it becomes nu the power of a 0 minus 1 e to the power

  • of minus b 0 nu, those are the two important terms, the rest is constant.

  • So, the posterior density becomes f of nu given D is proportional f of D given nu into

  • f nu f of D given nu is this and f of nu is this with a as a 0 and b as b 0.

  • So, we get this, f of D given nu is nu to the power forgetting the constants keeping

  • only the nu terms; it become nu to the power n by 2 exponential minus nu by 2 into summation

  • x i minus mu whole square; and from f nu, I get nu to the power of a 0 minus 1 exponential

  • minus b 0 nu. For the reason we chose this as the prior is now, these two new terms will

  • become nu to the power something and these exponential terms will become exponential

  • something into nu. So, if you do that, it becomes nu to the power

  • of a 0 plus n by 2 minus 1 into exponential minus b 0 nu minus nu by 2 into this. So,

  • we once again have nu to the power something exponential minus a linear function of nu

  • so, the posterior as expected, is once again a gamma density. Now, what kind of gamma density

  • is it. The the gamma density is two parameters a and b essentially, forgetting the constants

  • is proportional to nu to the power of a minus 1 e to the power minus b nu right. So for

  • the posterior density, the a parameter is a 0 plus n by 2 and the b parameter is b 0

  • plus half into this sum right.

  • So, if we think that the posterior is a gamma density with parameters a a n and b n right

  • then, this is what we will get. If the posterior is a is a gamma density with parameters a

  • n and b n. Then a n will be a 0 plus n by 2, 0 plus n by 2 and what will be b n, b n

  • will be b 0 plus half into this summation b 0 plus half into the summation. I can write

  • this summation as, I know 1 by n summation is equal to 1 to n, x n minus mu whole square

  • will be the maximum likelihood estimate for variance plus call it sigma square hat ML.

  • Then, this summation is n times sigma square hat ML so, I can write b n as, b 0 plus n

  • by 2 sigma square hat ML. So, if I chosen the prior to be gamma with parameters a 0

  • and b 0 then, the posterior will become a gamma with parameters a n and b n. Where a

  • n turns out to be a 0 plus n by 2 and b n turns out to be b 0 plus n by 2 times, sigma

  • square hat ML where, sigma square hat ML is the maximum likelihood estimator for variance

  • in this case.

  • So, recall that sigma square hat ML is the estimate for variance, is the maximum likelihood

  • estimate for variance.

  • So now, if I want to take the mean of the posterior as the final estimate, as we know

  • for the gamma density with parameters a and b, the mean is a by b. So, here the posterior

  • density is gamma with parameters a n and b n so, the mean will be a n by b n. So, our

  • Bayesian estimate for nu nu hat would be a n by b n that is, a 0 plus n by 2 or b 0 plus

  • n by 2 sigma square hat ML right. Remember that, nu is 1 by sigma square right

  • so, if I do not did not have the a 0 and b 0, nu hat is 1 by sigma square Ml so, it is

  • same as the maximum likelihood estimate. Because, nu is actually 1 by sigma square so, the estimate

  • for 1 by sigma square will be 1 by sigma square hat ML. Now, the a 0 and b 0 are determined

  • by our choice of prior right, we are choosing a gamma density at the prior so, the kind

  • of gamma density we want, is what determines the values a 0 and b 0.

  • Now, what can we say about this density, once I have this estimate, once again as n tends

  • to infinity, nu hat converges to sigma square ML right because, as n tends to infinity,

  • n by 2 will be greater than both a 0 and b 0. So, this fraction essentially becomes n

  • by 2 by n by two sigma square hat ML so, I am I am sorry about to the typo nu hat converges

  • to 1 by sigma square hat ML. I am sorry, it is not nu hat converges to sigma square ML

  • but, nu hat converges to 1 by sigma square hat ML.

  • Also note, that the variance of the posterior right for a gamma density with parameters

  • a n and b n, the posterior the variance is a n by b n square. So, this is a n, this is

  • b n so, if you take the square, the numerator goes as n whereas, denominator increases as

  • n square so, that the variance goes to 0, as n tends to infinity. So, as n tends to

  • infinity, the posterior essentially becomes same as the mean and the mean is 1 by sigma

  • square hat ML so, once again just as we expect, the the Bayesian estimate is consistent. But,

  • at any small sample size, it is not only determined by the data thus, 1 by sigma square hat ML

  • but, is also determined by the initial a 0 and b 0, we choose for the prior prior density,

  • which is gaussian with parameters a 0 and b 0.

  • So, we have seen both Bayesian estimation for either the mean or the variance of the

  • Gaussian, mean we seen last time, for variance we are seeing just now. So, when I want to

  • estimate only the mean, assuming that the variance is known then, the prior turns out

  • to be a a Gaussian. When I want to estimate only the variance, assuming that the mean

  • is known, the prior turns out to be gamma. So, if I want to estimate both mean and variance

  • now, I have two parameters once again, from my experience in estimating variance, we will

  • choose nu as the parameter, for parameters in the density model.

  • So, my density model now is f of x given mu nu is this remember, that nu is 1 by sigma

  • square so, if both mu and nu are unknown, we need a prior, there is a joint density

  • on mu and nu. We already know that, if nu is known, only mu is unknown then, the prior

  • density is Gaussian, if mu is known and nu is unknown then, I know the prior density

  • is gamma. So, the joint density should be some combination of Gaussian and gamma, the

  • the algebra turns out be a little cumbersome so, I do not give you all the algebra, I will

  • just give you the final expression.

  • So, then the conjugate prior would be, what is called as Gaussian gamma density, the Gaussian

  • gamma density this, any joint density of any two random variables mu and nu here, can be

  • written as a product of the marginal of nu multiplied by the conditional of mu given

  • nu that is, true of anything. So, this is how, we will model this so, the the Gaussian

  • gamma density model is given here, as you can see what we are saying is, f nu is the

  • first term here, upto here first meaning, the first two terms so, this is the density

  • what we already seen, this is a gamma right. So, f nu is gamma with parameters a 0 and

  • b 0 and the density mu given nu is essentially a Gaussian density with nu as it is precision

  • or 1 by nu as it is variance. So, the conditional of mu given nu is Gaussian, with this is a

  • Gaussian in the in the variable mu with it is own mean mu 0 and the variance being a

  • function of the conditioning random variable. That is, this is not just directly 1 by nu

  • but it is 1 by c 0 nu so that is, the marginal for nu is a gamma density and the conditional

  • density of mu, conditioned on nu is a Gaussian. Actually, by looking at the at the data likelihood,

  • we can find that, this is the kind of dependence we need that, nu can always be expressed in

  • terms of nu to the power something and exponential linear in nu.

  • Whereas, the mu dependence can only be expressed by something that is the function of both

  • nu and mu that is why, we have to model this in this kind of a factorization. When I put

  • equality hereby now, we know, that we do not need the actual constants, we are only looking

  • at the form of this densities. So, by now, we have seen enough examples so, I started

  • misusing the abusing the notation, I just put equality even though, it is not really

  • equal. Because, this thing is not really a density,

  • there will always be a, in the second thing there will be some normalizing constant. There

  • will be one constant to make this gamma density a density, another constant to make this into

  • a proper normal density so, there will be some constant. But by now, we know this constants

  • do not matter so, I just, we are abusing notation by not putting that constant. And also, actually

  • we would have some relation between c 0 and b 0 and a 0 but, it really does not matter,

  • we can choose a slightly bigger class of densities at the conjugate prior.

  • Of course, this prior density is quite involved and doing the Bayesian estimation with this

  • prior density is not easy so, I will skip the details, you people can sit and do the

  • algebra, the algebra is more complicated. But ultimately, you get similar looking final

  • essentially, what we get is a convex combination of the sample mean plus something that depends

  • on the on the prior parameters for the for the estimate of nu.

  • And similarly, for the estimate of nu right, it will be some factor involving 1 by sigma

  • square hat ML and something that, depends on your a 0 b 0 in such a way that, as n tends

  • to infinity, once again the ML estimates and the Bayesian estimates will be same. So, we

  • will we will we, I have just given you the prior for this but, we will not actually derive

  • the final estimates. I have not done any of multidimensional examples, they are not conceptually

  • any more difficult than the one dimensional normally we did. But obviously, as you can

  • see compared to maximum likelihood, obtaining Bayesian estimates has lot more algebra so,

  • just the algebraic notation will be more cumbersome so, we will we will skip that.

  • So, we will simply say that, we can obtain Bayesian estimates like this for many standard

  • densities but, there is one part that is, by now evident, obtaining maximum likelihood

  • estimates and Bayesian estimates is not the same. For maximum likelihood estimates, for

  • almost mechanically, I can calculate the likelihood function, differentiate, equate to 0, find

  • the maximum and I get the estimates. For Bayesian estimate, I have to properly choose the right

  • kind of prior, which is the conjugate prior for that particular problem and only then,

  • right the the expressions are amenable to simplification.

  • And then, I have to look at the look at the parameters of the posterior density and based

  • on that, I I have to obtain my Bayesian estimates. As we have seen the conjugate prior would

  • depend on the form of x given theta as a matter of fact, for the same density in our mind

  • say, Gaussian depending on what is the parameterization that we think, what are the unknown parameters

  • that we think, the prior changes. For example, if we think only the mean of

  • the Gaussian is unknown then, the conjugate prior is Gaussian, if we think only the variance

  • is unknown and we choose the variance in terms of the precision parameter then, the density

  • happens to be gamma. If we think both mean and the precision are unknown then, the prior

  • density, the conjugate prior density turns out to be that gaussian gamma, I told you.

  • So, the conjugate prior would depend very much on the form of f of x given theta, the

  • procedure little more involves certainly than the maximum likelihood estimate.

  • So, what is that we gain with it, why is not maximum likelihood estimate sufficient, that

  • we have already seen, when we when we started the Bayesian estimate, that the reason why

  • we came to Bayesian estimate is that, maximum likelihood estimate blindly believes the data.

  • So, if we have some prior information about the kind of values the parameter can take

  • or because, our first few data are bad and we have very little data.

  • There is no way, we can make any incomplete information we have about the parameter to

  • to bear on the final estimate we get. Whereas, the prior density allows us this the this

  • flexibility so, essentially the prior density allows us to incorporate knowledge, that we

  • may have about the parameter that is, in the form of a conjugate prior. And as we seen

  • in the final expressions, it always comes up with an expression whereby, this small

  • sample problems are automatically handled by trading the part that I get only from data,

  • not trading by combining the part, that I get only from the data, with the part I get

  • from the prior. So, at small sample, our beliefs, kind of moderators in not jumping to too drastic

  • at conclusions based on data, that is the essence of the Bayesian estimation.

  • Now, we will slightly move to a few more related issues in estimation, we seen two specific

  • methods of estimation, we have seen how to do the estimation for different densities.

  • For example, we have derived ML and Bayesian estimates for a few standard densities now,

  • let us look at a few more generic issues about estimation. The first thing that we will do

  • is, to look at what is what is a generic representation for many densities so, there is one form for

  • a density. By by now, I suppose you become familiar to

  • this, that as i said right in the beginning, when we started on our estimation, we use

  • this the word density to mean either density or mass function. Depending on the random

  • variable is discrete or continuous so, we use density in a generic sense so, we are

  • saying, we will look first at at the representation of a density function in terms of some parameters,

  • that captures most of the standard densities, such a form is called the exponential family

  • of densities. It is a it is a very important thing because,

  • as we shall see later on, for exponential family of densities ML estimates become very

  • straight forward. So, we get generic ML estimates for all densities within the exponential family,

  • given the exponential family, for all of them, we can write one kind of generic set of equations

  • to solve, to get the ML estimates, we do not have to do individually. And equally importantly,

  • looking at this, kind of generic notion of a density function allows us to introduce

  • an important notion estimation, which is called the sufficient statistic.

  • So, first let us look at, what we call exponential family of densities suppose, you have a density

  • model for a random variable x with parameters eta. Eta could be a single parameter or many

  • parameters, with many parameter we will think of it as a parameter vector. We will write

  • the the density model as f x given eta as h of x that is, some function with only of

  • x multiplied by some function g of eta, is some function only of the parameter eta multiplied

  • by the exponential of eta transpose u x where u x is a vector of functions of the data.

  • So, given data I can make some new functions of data u 1 x, u 2 x for example, data is

  • x 1, x 2, x n, u 1 x could be summation x i, u 2 x could be summation x i square and

  • so on, you have sorry u 1 x could be x, u 2 x could be x square and so on. So, u x are

  • some vector functions of x; so if the density can be written as a product of a time involving

  • only x and a time involving only eta and a time that involves both eta and u and x in

  • a very special way, exponential eta transpose u x where u is a vector of predefined, a vector

  • of given functions of x. The reason, why it is called the exponential

  • family is that, I can always write it as exponential of something all right. I can write the exponential

  • of eta transpose u x plus this h x factor can be brought inside the exponential by writing

  • it as l n h x similarly, this as l n g x. Because, exponential of l n h x will be h

  • x, exponential of l n g x will be g x because, I can write it like this, it is called a exponential

  • family. The important thing is that, many standard

  • densities Bernoulli, binomial, Poisson, gamma, beta, Gaussian, exponential everything can

  • be put in this form. Of course, among these standard densities, the notable one that cannot

  • be written in this form is, uniform density except from uniform density, almost all the

  • standard densities can be put in this form.

  • So, let us look at a simple example of, how to put a density in in the standard exponential

  • let us consider, the Bernoulli distribution so, the mass function with parameter p is

  • given in terms of p power x into 1 minus p to the power 1 minus x. This is obviously,

  • not in the form of a factor involving only a x, multiplied factor involving only the

  • parameter multiplied by a factor like this right. But the point is, by not thinking of

  • p as the parameter but, something else as the parameter, we would be able to put it

  • in this form.

  • So, let us look at that so, starting with this, we can write this as, I can always write

  • anything so, I want you, I can write this p x into1e minus p to the power 1 minus x

  • as exponential l n of that. So, if I if I take l n and put inside exponential, the l

  • n of this will become x l n p plus 1 minus x l n 1 minus p so, that is what I did, exponential

  • x l n p plus 1 minus x l n 1 minus p. Now, there is, this 1 into l n 1 minus p, exponential

  • of l n 1 minus p will be 1 minus p so, let us let me take that factor out that is, 1

  • minus p. Now, I have got x l n p and minus x l n one

  • minus p, I can write it as, x into l n p by 1 minus p so, I can write this as 1 minus

  • p exponential of x l n p by 1 minus p. Now, this I can further write as, 1 by 1 plus p

  • by 1 minus p right. So this now, one can immediately see, if I think of p by 1 minus p as a parameter

  • then, I can write this as let us say, that is what I want to call eta then, this is some

  • factor that is dependent only on eta. And this is a factor that depend exponential

  • of let us say, l n p into 1 minus p is my eta then, eta times a function of x namely,

  • x. So, I have to somehow, write this as also l n p by 1 minus p, this is very easy I can

  • always write p by 1 minus p as exponential l n p by 1 minus p. So, I can write this as,

  • 1 by 1 plus exponential eta into exponential eta x where, eta is l n p by 1 minus p right.

  • So, I can write my Bernoulli mass function in as, 1 by 1 plus exponential eta into exponential

  • eta x where, eta is l n p by 1 minus p.

  • So, what does this mean, this is exactly in the form h x into g eta into exponential eta

  • transpose u x right h x is 1, there’ i no factor, there is only dependent on x right

  • so, h x is 1. What is g eta, g eta is this factor 1 by 1 plus exponential eta right that

  • is, g eta and I want exponential eta transpose u x, I have got eta times x. So, eta is a

  • scalar here so, I can simply take u x to be x right. So, if I take eta as l n p by 1 minus

  • p, h x as 1, g eta as 1 by 1 plus exponential eta and u x is equal to x then, it is in this

  • form where, this transpose is of course, is redundant here because, eta happens to be

  • one dimensional.

  • Whereas this means, that the Bernoulli density belongs to the exponential family in the same

  • way, for all the standard densities, we can put them in this general frame work. Sometimes

  • see for example, if I want to represent the Bernoulli mass function with p as the parameter

  • then, it is not in this generic form, h x into g eta into this. But, if I u instead

  • of using p as the parameter, I use l n p by 1 minus p as the parameter then, I can put

  • in this. After all, if you give me eta that is, l n

  • p by 1 minus p, I can calculate p or if you give me p, I can calculate l n p by 1 minus

  • p so, the eta to p transformation is one to one and invertible. So, by that I think of

  • eta as the parameter, as p as the parameter it does not matter but, if I think of eta

  • as the parameter, the mass function comes to a very standard form so, sometimes eta

  • is called the natural parameter for Bernoulli, this particular eta.

  • In the same, many other densities can be put in the exponential form so, for example, for

  • the Gaussian, the normally I will write it with, mu and sigma square as the two parameters

  • like this. But we can also write it like this, some function represents only acts essentially

  • a constant function, some factor that depends only on some parameters, which I call eta

  • 1 and eta 2. And then, exponential eta 1 time some function

  • of x plus eta 2 times some function of x where, eta 1 happens to be mu by sigma square, eta

  • 2 happens to be minus 1 by 2 sigma square, u 1 x happens to be x and u 2 x happens to

  • be x square. The algebra involved is a little more than the algebra involved in showing

  • this by the Bernoulli density but, once again it is just algebra. So, starting from this

  • expression, one can show that, this is same as this expression where, I make this following

  • changes eta 1 is mu by sigma square, eta 2 is minus 1 by sigma square, u 1 x is equal

  • to x, u 2 x is equal to x square.

  • So, under these things, this density once again recommends the form x h g eta exponential

  • eta transpose u x right. As I said h x can be thought of as it is constant function,

  • this is the g eta function and this is exponential eta 1 u 1 x plus eta 2 u 2 x where, eta 1,

  • eta 2, u 1, u 2 are given right. So, once again these are the form h x g eta exponential

  • eta transpose u x so, Gaussian is also in the exponential family and like this, we can

  • show that almost all standard densities belong to exponential class of densities. Now, what

  • is the use of showing many of these densities belong to the exponential family of densities,

  • the the main utility is that, as I said, we get a very standard generic form for the maximum

  • likelihood estimate.

  • And also, what it would mean is the following now. When a density is in this form, if i

  • take the data likelihood right, the data likelihood will depend on n fold product of this that

  • is, product of x h i g eta to the power n. And when I multiply exponential eta transpose

  • u x 1 into exponential transpose u x 2 and so on ultimately, I get exponential eta transpose

  • summation u x i. So, these functions u x or the quantity summation u x i are the only

  • way, the data affects the data likelihood. So, this form gives us a very interesting

  • generic way, in which data affects the data likelihood and hence, gives us a standard

  • method for calculating maximum likelihood estimates for all such densities.

  • So, in the next class, we will we will look at a few of the examples of the exponential

  • family of densities. And how, looking at all of them as exponential family of densities,

  • allows us to obtain maximum likelihood estimates in a in a generic fashion and then, we will

  • introduce the notion of, what is called a sufficient statistic.

  • Thank you.

Welcome to this next class in Pattern Recognition, we have been looking at density estimation.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it