Subtitles section Play video Print subtitles The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: So today, we're going to talk about the probability that a random variable deviates by a certain amount from its expectation. Now, we've seen examples where a random variable is very unlikely to deviate much from its expectation. For example, if you flip 100 mutually independent fair coins, you're very likely to wind up with close to 50 heads, very unlikely to wind up with 25 or fewer heads, for example. We've also seen examples of distributions where you are very likely to be far from your expectation, for example, that problem when we had the communications channel, and we were measuring the latency of a packet crossing the channel. There, most of the time, your latency would be 10 milliseconds. But the expected latency was infinite. So you're very likely to deviate a lot from your expectation in that case. Last time, we looked at the variance. And we saw how that gave us some feel for the likelihood of being far from the expectation-- high variance meaning you're more likely to deviate from the expectation. Today, we're going to develop specific tools for bounding or limiting the probability you deviate by a specified amount from the expectation. And the first tool is known as Markov's theorem. Markov's theorem says that if the random variable is always non-negative, then it is unlikely to greatly exceed its expectation. In particular, if R is a non-negative random variable, then for all x bigger than 0, the probability that R is at least x is at most the expected value of R, the mean, divided by x. So in other words, if R is never negative-- for example, say the expected value is smaller. Then the probability R is large will be a small number. Because I'll have a small number over a big number. So it says that you are unlikely to greatly exceed the expected value. So let's prove that. Now, from the theorem of total expectation that you did in recitation last week, we can compute the expected value of R by looking at two cases-- the case when R is at least x, and the case when R is less than x. That's from the theorem of total expectation. I look at two cases. R is bigger than x. Take the expected value there times the probability of this case happening plus the case when R is less than x. OK, now since R is non-negative, this is at least 0. R can't ever be negative. So the expectation can't be negative. A probability can't be negative. So this is at least 0. And this is trivially at least x. Because I'm taking the expected value of R in the case when R is at least x. So R is always at least x in this case. So its expected value is at least x. So that means that the expected value of R is at least x times the probability R is greater than x, R is greater or equal to x. And now I can get the theorem by just dividing by x. I'm less than or equal to the expected value of R divided by x. So it's a very easy theorem to prove. But it's going to have amazing consequences that we're going to build up through a series of results today. Any questions about Markov's theorem and the proof? All right, there's a simple corollary, which is useful. Again, if R is a non-negative random variable, then for all c bigger than 0, the probability that R is at least c times its expected value is at most 1 and c. So the probability you're twice your expected value is at most 1/2. And the proof is very easy. We just set x to be equal to c times the expected value of R in the theorem. So I just plug in x is c times the expected value of R. And I get expected value of R over c times the expected value of R, which is 1/c. So you just plug in that value in Markov's theorem, and it comes out. All right, let's do some examples. Let's let R be the weight of a random person uniformly selected. And I don't know what the distribution of weights is in the country. But suppose that the expected value of R, which is the average weight, is 100 pounds. So if I average over all people, their weight is 100 pounds. And suppose I want to know the probability that the random person weighs at least 200 pounds. What can I say about that probability? Do I know it exactly? I don't think so. Because I don't know what the distribution of weights is. But I can still get an upper bound on this probability. What bound can I get on the probability that a random person has a weight of 200 given the facts here? Yeah. AUDIENCE: [INAUDIBLE] PROFESSOR: Yes, well, it's 100 over 200, right. It's at most the expected value, which is 100, over the x, which is 200. And that's equal to 1/2. So the probability that a random person weighs 200 pounds or more is at most 1/2. Or I could plug it in here. The expected value is 100. 200 is twice that. So c would be 2 here. So the probability of being twice the expectation is at most 1/2. Now of course, I'm using the fact that weight is never negative. That's obviously true. But it is implicitly being used here. So what fraction of the population now can weigh at least 200 pounds? Slightly different question. Before I asked you, if I take a random person, what's the probability they weigh at least 200 pounds? Now I'm asking, what fraction of the population can weigh at least 200 pounds if the average is 100? What is it? Yeah? AUDIENCE: At most 1/2. PROFESSOR: At most 1/2. In fact, it's the same answer. And why? Why can't everybody weigh 200 pounds, so it would be all the population weighs 200 pounds at least? AUDIENCE: [INAUDIBLE] PROFESSOR: Probability would be 1, and that can't happen. And in fact, intuitively, if everybody weighs at least 200 pounds, the average is going to be at least 200 pounds. And we said the average was 100. And this is illustrating this interesting thing that probability implies things about averages and fractions. Because it's really the same thing in disguise. The connection is, if I've got a bunch of people, say, in the country, I can convert a fraction that have some property into a probability by just selecting a random person. Yeah. AUDIENCE: [INAUDIBLE] PROFESSOR: No, the variance could be very big. Because I might have a person that weighs a million pounds, say. So you have to get into that. But it gets a little bit more complicated. Yeah. AUDIENCE: [INAUDIBLE] PROFESSOR: No, there's nothing being assumed about the distribution, nothing at all, OK? So that's the beauty of Markov's theorem. Well, I've assumed one thing. I assume that there is no negative values. That's it. AUDIENCE: [INAUDIBLE] PROFESSOR: That's correct. They can distribute it any way with positive values. But we have a fact here we've used, that the average was 100. So that does limit your distribution. In other words, you couldn't have a distribution where everybody weighs 200 pounds. Because then the average would be 200, not 100. But anything else where they're all positive and they average 100, you know that at most half can be 200. Because if you pick a random one, the probability of getting one that's 200 is at most 1/2, which follows from Markov's theorem. And that's partly why it's so powerful. You didn't know anything about the distribution, really, except its expectation and that it was non-negative. Any other questions about this? I'll give you some more examples. All right, here's another example. Is it possible on the final exam for everybody in the class to do better than the mean score? No, of course not. Because if they did, the mean would be higher. Because the mean is the average. OK, let's do another example. Remember the Chinese appetizer problem? You're at the restaurant, big circular table. There's n people at the table. Everybody has one appetizer in front of them. And then the joker spins the thing in the middle of the table. So it goes around and around. And it stops in a random uniform position. And we wanted to know, what's the expected number of people to get the right appetizer back? What was the answer? Does anybody remember? One. So you expect one person to get the right appetizer back. Well, say I want to know the probability that all n people got the right appetizer back. What does Markov tell you about the probability that all n people get the right appetizer back? 1/n. The expected value is 1. And now you're asking the probability that you get R is at least n. So x is n. So it's 1 in n. And what was the probability, or what is the actual probability? In this case, you know the distribution, that everybody gets the right appetizer back, all n. 1 in n. So in the case of the Chinese appetizer problem, Markov's bound is actually the right answer, right on target, which gives you an example where you can't improve it. By itself, if you just know the expected value, there's no stronger theorem that way. Because Chinese appetizer is an example where the bound you get, 1/n, of n people getting the right appetizer is in fact the true probability. OK, what about the hat check problem? Remember that? So there's n men put the hats in the coat closet. They get uniformly randomly scrambled. So it's a random permutation applied to the hats. Now each man gets a hat back. What's the expected number of men to get the right hat back? One, same as the other one. Because you've got n men each with a 1 in n chance, so it's 1. Markov says the probability that n men get the right hat back is at most 1 in n, same as before. What's the actual probability that all n men get the right hat back? AUDIENCE: [INAUDIBLE] PROFESSOR: 1 in n factorial. So in this case, Markov is way off the mark. It says 1 in n. But in fact the real bound is much smaller. So Markov is not always tight. It's always an upper bound. But it sometimes is not the right answer. And to get the right answer, often you need to know more about the distribution. OK, what if R can be negative? Is it possible that Markov's theorem holds there? Because I use the assumption in the theorem. Can anybody give me an example where it doesn't