Subtitles section Play video Print subtitles Welcome to this next class in Pattern Recognition, we have been looking at density estimation. So, let us briefly recall, what we have been doing in the last couple of classes. We have been looking at, how to estimate densities for given IID samples, for a particular density. We first looked at the maximum likelihood estimation method. For the last couple of classes, we have been considering the Bayesian estimation of density function. So, given a density function f x given theta where, theta is the parameter, we are considering Bayesian estimate for the parameter theta. As I have already told you, the main difference between the maximum likelihood estimation and the Bayesian estimation is that, in the Bayesian estimation, we look at the parameter theta, which may be a vector or a scalar, the parameter theta itself is viewed as a random variable. And because, we view it as a random variable, it has a prior density, which captures our initial uncertainty. Our knowledge or lack of knowledge about the specific value of the parameter is captured through a prior density f theta. So, f theta gives us some idea about, what we think or the possible values for the parameter. Given the given the prior density, we are going to use the data likelihood that is, f D given theta to calculate the posterior density f theta given D. Once again, I would like to drive your attention to a caution on the notation, for simplicity, the densities of all kind of random variables we are using the same symbol f. So, f of theta, f of D given theta, f of theta given D, all these are densities, purely as mathematical notation looks same because, f is the same function. But, we are we are using f as a notation to denote density and density of, which random variable it is, is clear from context. Thus, f x given theta is the density of x conditioned on theta, which is the parameter f of theta, is the density of the parameter theta and so on, f theta given D is the conditional density of theta, conditioned on data D. So, even though we are using the same symbol f, for all densities so, I hope you understand that, the f used in different times is a different function. It refers to densities of different random variables and just to keep the notation uncluttered, we are calling it as the f. So, once again, essentially we start with a prior density f of theta, for the parameter theta then, use the data likelihood of D given theta to calculate the posterior f theta given D, we have seen a couple of examples of this process earlier. And the idea of Bayesian estimation by now, you would have seen is, to choose a right kind of prior, the right kind of prior for us is, what is called the conjugate prior. The conjugate prior is that prior density, which results in the posterior density belonging to the same class of densities. For example, as we saw when, we are estimating the mean of a Gaussian random variable or mean of a Gaussian density where, the variance is assumed known, we choose Gaussian density for the prior then, the posterior is also Gaussian density. So, for that particular estimation problem, the prior density happens to be Gaussian similarly, for a Bernoulli problem where, we have to estimate the parameter p namely, the probability of the random variability taken value 1. For parameter p, the appropriate prior density turns out to be beta appropriate in the sense that, if I take prior density to be beta then, the posterior also becomes beta, so such a prior is called a conjugate prior right. The conjugate prior is that prior density, which results in the posterior density also, to be of the same class of densities, the use is of conjugate prior makes the process of calculation of posterior density easier. As we have seen let us say, in the case of the Bernoulli parameter, that we have seen earlier, if we start with some beta a 0, for the beta a 0 b 0 for the prior then, the posterior is also beta density with possibly some other parameters a and b n. So, calculation of the posterior density is simply a matter of parameter updation, given the parameters of the prior now, we update them into parameters of the posterior. Having obtained the posterior density, we can finally use either the mean or the mode of the posterior density at the final estimate. We have seen examples of both or we can also calculate f of x given D, that is the actual class conditional density, conditioned on data by integrating the posterior density and we have seen example of that also. So, this class, we will look at a couple of more examples of Bayesian estimation and then, closed Bayesian estimation. As you would have by now seen, Bayesian estimation is a little more complicated mainly because, you have to choose the right kind of prior and different kind of estimation problems make different priors of the conjugate. So, we will start with another example, this is the multinomial example that is, we consider estimating the mass function of a discrete random variable, which takes one of M possible values say, a 1 to a M where, p i is probability Z takes the value a i. So, essentially Z takes value a 1 with probability p 1, a 2 with probability p 2, a M with probability p M and we want to estimate this p 1, p 2, p M, given a sample of n iid realizations of Z. We already considered this problem in in the maximum likelihood case and there we told you, this is particularly important in certain class of pattern recognition problem. Especially those to do with say, the text classification and so on where, the discrete random variables for feature that is, features that take only finitely many values are important. We have seen, how to do the maximum likelihood estimation for obtaining this p 1, p 2’s that, p 1, p 2, p M that characterize the mass function of this discrete random variable Z. Now, we will look at, how to do the same thing using Bayesian estimation, I hope the problem is clear, this already is done earlier so, we will we will quickly review the notation that we used earlier. So, as earlier, we represent any realization of Z by M-dimensional Boolean vector x, x has M components x superscript 1, x superscript M. The transpose there is because, as I said, all vectors for us are column vectors, each of these components of this vector x, x superscript i is either 0 or 1, and summation of all of them is 1. That means, essentially x takes only the unit vectors 1 0 0 0, 0 1 0 0’s and so on, the idea is that, if z takes the i th value a i then, we represent it by a vector where, i th component is 1 and all others are 0. We have already seen that, this is a interesting and useful representation for maximum likelihood with this same, we use the same representation here. And now, p i turn out to be, the probability that the i th component of this vector is 1 that is what, I wrote there p x subscript i is equal to 1. Because, p i is the probability with where, Z takes the i th value a i and when Z takes the i th value a i, the i th component of x i becomes 1 where, i th component of x becomes 1. Also, as I told you last time, the reason, why we use the superscripts to denote the components of x is because, subscripts of x are used as to denote different data or data is x 1 to x n that is why, we are using superscripts to denote the components of particular data. So, we also seen last time that, for this x, the mass function with the single vector parameter p, is product i is equal to 1 to M p i x i because, in any given x, only one component of x is 1 and that is the one that, survives this product. So, if x is 1 0 0 0 then, for that x, f of x given p will become p 1 to the power 1 and p to the power 0 and so on that is, p 1. So thus, this correctly represents the mass function, that we are interested in and p is the parameter of the mass , that we need to estimate. As usual, our data has n samples x 1 to x n, and each sample x i is a vector of M components, the components are shown by superscripts. And each component is either 0 or 1, and in each data it is M x i that is, each M vector x i, if I sum all the components, it becomes 1. That simply means, because, each component is on 0 1 and sum is 1 means, exactly one component is 1 and all others are 0 so, this is the nature of our representation and we have n such data items. So, given such data, we want to estimate p 1, p 2, p M thus essentially, what we are doing is, we are estimating the parameters of a multinomial distribution. As you know, a multinomial binomial takes only binomial is important, when there is a random experiment that is repeated, which takes only two values success or failure. In the multinomial case, it is the same thing independent realizations of a random experiment that takes more than two values say, M values. If a random variable takes M different values, I can think of it as a random experiment, which can result in one of M possible outcomes right. So, many samples from that random variable are like, I have a multinomial distribution that is, I repeat a random experiment, that can take one of M possible outcomes, n number of times. So, some of which will be well result in first outcome and so on, some of which will results in second outcome and so on. So, I know for each repetition, what outcome has come and given those things, we want to estimate the multinomial parameters p 1, p 2, p M. Now because, we are in the Bayesian context, the first question we have to answer is, what is the conjugate prior in this case right. Now, as we have already seen from our earlier examples, for this we should examine the form of the data likelihood. So, we already have our model, we for this x, we have the mass function, that we that we have seen earlier that is the mass function so, given this mass function, what is the data likelihood, that is easy to calculate. So, f D given p, p is product of this over i is equal to 1 to n now, we can substitute for f of x i, if I substitute for f of x i, f of x i is once again product p j x i j. Now, if I interchange i and j summation then, p j to the power x i j product over i can be written as product over j p j to the power n j where, n j is summation over i x i j. What does that mean, in any given x i, the j th component is 1, if that particular outcome of Z represents the j th value of Z. So, this n j tells you, out of the n, how many times Z taken the j th value so, for example, n 1 plus n 2 plus n M will be equal to n, the total number of samples. So, out of n samples, n 1 times the first value of Z has come, n 2 times the second value of Z has come or looking at as a multinomial distribution, n 1 times the first outcome has occurred, n 2 times the second outcome has occurred and so on. So, now, in terms of this n’s, the data likelihood is given by product over j, p j to the power n j. Now, if this is the data likelihood, we multiply this with a prior and we should get another expression of the same form, as the prior so, what should be our prior. So, this is the data likelihood so, we expect the prior to have a some density, this proportional to a product of p j to the power a j. If the prior is proportional to product of p j to the power a j and then, I multiply with data likelihood, I get another product of p j to the power some a j prime so, once again that posterior will belong to the same density of the prior right. Let us still remember that, p is a vector parameter, p has M components with all of them are probabilities so, they are greater than or equal to 0. And sum of p a is equal to 1 because, p 1 is the probability of that Z takes first value and so on so, this is needed for the mass function of Z. So, we need a density, which which is a density defined over all p, that satisfy this and that should have a form, which is product p j to the power a j. Now, such a density is, what is known as a Dirichlet density, if you remember when there are only two outcomes when you looking at Bernoulli, the the prior happen to be what is called the beta density. So, we will see that, the Dirichlet density is a kind of generalization of the beta density so, the Dirichlet density is given by f of p, is gamma of a 1 plus a 2 plus a M by gamma of a 1 into gamma of a 2 into gamma of a M, product j is equal to 1 to M p j to the power a j minus 1. Where, this gamma is the gamma function, that we already seen when we discuss the beta function the beta density, gamma function gamma of a is integral 0 to infinity x to the power a minus 1, e power minus x d x. In this, the the parameters of this Dirichlet density or this a j’s that is that is, a 1 a 2 a M, all of them are assumed to be greater than equal to 1. Also, this density has this value, only for those p that satisfy all components greater than or equal to 0 or some of the components is 1, outside of those p’s the density is 0. That means, the density is concentrated on that subset of power M, which satisfies p i greater than or equal to 0 and summation p equal to 1, which is essentially called as simplex. Those you know what is simplex is, they are simplex but, anyway even, if you do not know what is simplex is, this density is non-zero only for those p’s that satisfy p i greater than or equal to 0, summation p i is equal to 1 otherwise, the density value is 0. So, that is the Dirichlet density so, if M is equal to 2, this becomes gamma a 1 plus a 2 by gamma a 1 into gamma a 2, and p 1 to the power a 1 minus 1 and p 2 to the power a 2 minus 1 and that is the beta density. So, when M is equal to 2, this density becomes the beta density and this Dirichlet density happens to be the conjugate prior here. So, as you can see, if I am estimating the parameter of a Bernoulli density then, my conjugate prior happens to be beta. Whereas, if Iam estimating parameters for a multinomial distribution rather than binomial one then, the prior happens to be Dirichlet, which is a kind of a neat generalization of the beta density to, a to more than two case. Of course, this is a strange expression and we have first show that, this is a density on that particular set of p, that we we mentioned. Using the using similar methods, as in the case of beta density we can show that, this is a density, the other thing that we want is, just like in the beta density case, ultimately because, my posterior will be a Dirichlet density. We need to know the movements of the Dirichlet density so that, I I can correctly use my posterior to get my estimates. Once again without proofs, I I just put down these movements so, if p 1, p 2, p M have joint densities as Dirichlet, with parameters a 1, a 2, a M then, expected value of any component say, p j is a j by a 0 where, a 0 is a 1 plus a 2 plus a M. Variance of p j happens to be a 0 into a j into a 0 minus a j, by a 0 square into a 0 plus 1 and similarly, this is the co variance. We do not know the co variance but, this kind of gives us all the movements upto the up to order 2. So, for example, if you are going to use the use the mean of the posterior rather or final estimate, we would need this formula. So, with this let us now go on in this case, compute the posterior density, if by taking prior as Dirichlet, the posterior density we have, to for the posterior density, as we know is f p given D is proportional to the product of f D given p, into f p f D given p is the data likelihood, f p is the prior we have taken prior to be Dirichlet. We already have an expression for the likelihood so, if you substitute those two so, this is the expression for the likelihood, product p j to the power n j. And let us say, we have taken the we have taken the prior to be Dirichlet with parameters a 1, a 2, a M then, this becomes the prior so, this product can now be written as, product over p j of n j plus a j minus 1. So, obviously the reason, why we chosen this particular prior is that, the posterior will belong to same class. So, indeed posterior belongs to the same class right, this is product is proportional to product of p j to the power something. So, if my prior is Dirichlet with parameters a 1, a 2, a M then, the posterior is Dirichlet with parameters n j plus a j where, the n j’s come from the data, This is once again very similar to, what happened in the Bernoulli case. Thus, the posterior is also Dirich let with parameters n j plus a j so, if we take for example, the mean of the posterior as our final Bayesian estimate, we already seen, what the mean is a j by sum so, we know summation n j. So, it will be n j plus a j by summation over j n j plus a j, summation over j n j is n, that we have already seen and a 0 is the notation we have given for summation a j’s. So, the the bayesian density, which is taken as the mean of the posterior turns out to be n j plus a j by n plus a 0. Let us recall, that the ML estimate for this was n j by n right. And the ML estimate is very easy to see, we are asking, what is the probability that Z takes the j th value or what is the probability that, the j th outcome occurs that is, equal to the number of times j th outcome occurred by the total number of samples right, n j means summation over i, x i j is the number of times the j th value has occurred. So, n j by n was, as we have already derived was the ML estimate here, instead of it being n j by n, it becomes n j plus a j by n plus a 0 where, a j and a 0, which is a 1 plus a 2 plus a M are determined by our choice of the prior right, our choice of prior determines, what the value a j are. So, just like in the case of the Bernoulli parameters, the nature of the Bayesian estimate is same so, you can think of the prior as saying, that before I collect data in my mind because, I have some idea of, what the values of p j’s are. I have coded them to say that, if I have done a zero repetitions of Z, they are fictitious repetitions a 1 of them would give me first value, a 2 of them will give me second value, a M of them will give me third value. So, I can choose the a 0 as well as a 1 to a M based on my idea of, what these numbers p 1 to p M are. So then, the final estimate is the actual in the data, how many times j has occurred plus how many time j has occurred in the fictitious trials, divided by the total number of actual trials and the fictitious trials. So, when data when the like in the Bernoulli case, if the data is small then, we do not go very wrong because, our paired beliefs we will ensure that, p j’s do not go into unnatural values. For example, p j breaking 1 or 0, when data is very small but, as n increases for any fixed a j and a 0, as n increases ultimately, this becomes n j by n. So, asymptotically the Bayesian estimate will be same as the maximum likelihood estimate and hence, it will be consistent. But, once again like in the Bernoulli case, the the prior allows me, to allow my initial beliefs