Subtitles section Play video Print subtitles PATRICK WINSTON: We've now almost completed our journey. This will be it for talking about several kinds of learning-- the venerable kind, that's the nearest neighbors and identification tree types of learning. Still useful, still the right thing to do if there's no reason not to do the simple thing. Then we have the biologically-inspired approaches. Neural nets. All kinds of problems with local maxima and overfitting and oscillation, if you get the rate constant too big. Genetic algorithms. Like neural nets, both are very naive in their attempt to mimic nature. So maybe they work on a class of problems. They surely do each have a class of problems for which they're good. But as a general purpose first resort, I don't recommend it. But now the theorists have come out and done some things are very remarkable. And in the end, you have to say, wow, these are such powerful ideas. I wonder if nature has discovered them, too? Is there good engineering in the brain, based on good science? Or given the nature of evolution, is it just random junk that is the best ways for doing anything? Who knows? But today, we're going to talk about an idea that I'll bet is in there somewhere, because it's easy to implement, and it's extremely powerful in what it does, and it's the essential item in anybody's repertoire of learning mechanisms. It's also a mechanism which, if you understand only by formula, you will never be able to work the problems on the quiz, that's for sure. Because on the surface, it looks like it'd be very complicated to simulate this approach. But once you understand how it works and look at a little bit of the math and let it sing songs to you, it turns out to be extremely easy. So it's about letting multiple methods work in your behalf. So far, we've been talking about using just one method to do something. And what we're going to do now is we're looking to see if a crowd can be smarter than the individuals in the crowd. But before we get too far down that abstract path, let me just say that the whole works has to do with classification, and binary classification. Am I holding a piece of chalk in my hand, or a hand grenade? Is that a cup of coffee or tea? Those are binary classification problems. And so we're going to be talking today strictly about binary classification. We're not going to be talking about finding the right letter in the alphabet that's written on the page. That's a 26-way choice. We're talking about binary choices. So we assume that there's a set of classifiers that we can draw on. Here's one-- h. And it produces either a minus 1 or a plus 1. So that's how the classification is done. If it's coffee, plus 1. If it's tea, minus 1. Is this chalk, plus one. If it's a hand grenade, minus 1. So that's how the classification works. Now, too bad for us, normally the world doesn't give us very good classifiers. So if we look at the error rate of this classifier or any other classifier, that error rate will range from 0 to 1 in terms of the fraction of the cases got wrong on a sample set. So you'd like your error rate to be way down here. You're dead if it's over there. But what about in the middle? What if it's, say, right there. Just a little bit better than flipping a coin. If it's just a little bit better than flipping a coin, that's a weak classifier. And the question is, can you make a classifier that's way over here, like there, a strong classifier, by combining several of these weak classifiers, and letting them vote? So how would you do that? You might say, well, let us make a big classifier capital H, that works on some sample x, and has its output produces something that depends on the sum of the outputs of the individual classifiers. So we have H1 working on x. We have H2 working on x. And we have H3 also working on x. Let's say three of them, just to start us off. And now let's add those guys up, and take the sign of the output. So if two out of the three of those guys agree, then we'll get an either plus 1 or minus 1. If all three agree, we'll get plus 1 or minus 1. Because we're just taking the sign. We're just taking the sign of the sum of these guys. So this means that one guy can be wrong, as long as the other two guys are right. But I think it's easier to see how this all works if you think of some space of samples, you say, well, let's let that area here be where H1 is wrong, and this area over here is where H2 is wrong. And then this area over here is where H3 is wrong. So if the situation is like that, then this formula always gives you the right answers on the samples. I'm going to stop saying that right now, because I want to be kind of a background thing on the samples set. We're talking about wrapping this stuff over the sample set. Later on, we'll ask, OK, given that you trained this thing on a sample set, how well does it do on some new examples? Because we want to ask ourselves about overfitting questions. But for now, we just want to look and see if we believe that this arrangement, where each of these H's is producing plus 1 or minus 1, we're adding them up and taking the sign, is that going to give us a better result than the tests individually? And if they look like this when draped over a sample set, then it's clear that we're going to get the right answer every time, because there's no area here where any two of those tests are giving us the wrong answer. So the two that are getting the right answer, in this little circle here for H1, these other two are getting the right answer. So they'll outvote it, and you'll get the right answer every time. But it doesn't have to be that simple. It could look like this. There could be a situation where this is H1, wrong answer. This is H2, wrong answer. And this is H3, wrong answer. And now the situation gets a little bit more murky, because we have to ask ourselves whether that area where three out of the three get it wrong is sufficiently big so as to be worse than 1 of the individual tests. So if you look at that Venn diagram, and stare at it long enough, and try some things, you can say, well, there is no case where this will give a worse answer. Or, you might end up with the conclusion that there are cases where we can arrange those circles such that the voting scheme will give an answer that's worst than an individual test, but I'm not going to tell you the answer, because I think we'll make that a quiz question. Good idea? OK. So we'll make that a quiz question. So that looks like a good idea. And we can construct a little algorithm that will help us pick the particular weak classifiers to plug in here. We've got a whole bag of classifiers. We've got H1, we've got H2, we've got H55. We've got a lot of them we can choose from. So what we're going to do is we're going to use the data, undisturbed, to produce H1. We're just going to try all the tests on the data and see which one gives us the smallest error rate. And that's the good guy, so we're going to use that. Then we're going to use the data with an exaggeration of H1 errors. In other words-- this is a critical idea. What we're going to do is we're going to run this algorithm again, but instead of just looking at the number of samples that are got wrong, what we're going to do is