Subtitles section Play video Print subtitles ANNOUNCER: The following program is brought to you by Caltech. YASER ABU-MOSTAFA: Welcome back. Last time, we introduced the learning problem. And if you have an application in your domain that you wonder if machine learning is the right technique for it, we found that there are three criteria that you should check. You should ask yourself: is there a pattern to begin with that we can learn? And we realize that this condition can be intuitively met in many applications, even if we don't know mathematically what the pattern is. The example we gave was the credit card approval. There is clearly a pattern-- if someone has a particular salary, has been in a residence for so long, has that much debt, and so on, that this is somewhat correlated to their credit behavior. And therefore, we know that the pattern exists in spite of the fact that we don't know exactly what the pattern is. The second item is that we cannot pin down the pattern mathematically, like the example I just gave. And this is why we resort to machine learning. The third one is that we have data that represents that pattern. In the case of the credit application, for example, there are historical records of previous customers, and we have the data they wrote in their application when they applied, and we have some years' worth of record of their credit behavior. So we have data that are going to enable us to correlate what they wrote in the application to their eventual credit behavior, and that is what we are going to learn from. Now, if you look at the three criteria, basically there are two that you can do without, and one that is absolutely essential. What do I mean? Let's say that you don't have a pattern. Well, if you don't have a pattern, then you can try learning. And the only problem is that you will fail. That doesn't sound very encouraging. But the idea here is that, when we develop the theory of learning, we will realize that you can apply the technique regardless of whether there is a pattern or not. And you are going to determine whether there's a pattern or not. So you are not going to be fooled and think, I learned, and then give the system to your customer, and the customer will be disappointed. There is something you can actually measure that will tell you whether you learned or not. So if there's no pattern, there is no harm done in trying machine learning. The other one, also, you can do without. Let's say that we can pin the thing down mathematically. Well, in that case, machine learning is not the recommended technique. It will still work. It may not be the optimal technique. If you can outright program it, and find the result perfectly, then why bother generate examples, and try to learn, and go through all of that? But machine learning is not going to refuse. It is going to learn, and it is going to give you a system. It may not be the best system in this case, but it's a system nonetheless. The third one, I'm afraid you cannot do without. You have to have data. Machine learning is about learning from data. And if you don't have data, there is absolutely nothing you can do. So this is basically the picture about the context of machine learning. Now, we went on to focus on one type, which is supervised learning. And in the case of surprised learning, we have a target function. The target function we are going to call f. That is our standard notation. And this corresponds, for example, to the credit application. x is your application, and f of x is whether you are a good credit risk or not, for the bank. So if you look at the target function, the main criterion about the target function is that it's unknown. This is a property that we are going to insist on. And obviously, unknown is a very generous assumption, which means that you don't have to worry about what pattern you are trying to learn. It could be anything, and you will learn it-- if we manage to do that. There's still a question mark about that. But it's a good assumption to have, or lack of assumption, if you will, because then you know that you don't worry about the environment that generated the examples. You only worry about the system that you use to implement machine learning. Now, you are going to be given data. And the reason it's called supervised learning is that you are not only given the input x's, as you can see here. You're also given the output-- the target outputs. So in spite of the fact that the target function is generally unknown, it is known on the data that I give you. This is the data that you are going to use as training examples, and that you are going to use to figure out what the target function is. So in the case of supervising learning, you have the targets explicitly. In the other cases, you have less information than the target, and we talked about it-- like unsupervised learning, where you don't have anything, and reinforcement learning, where you have partial information, which is just a reward or punishment for a choice of a value of y that may or may not be the target. Finally, you have the solution tools. These are the things that you're going to choose in order to solve the problem, and they are called the learning model, as we discussed. They are the learning algorithm and the hypothesis set. And the learning algorithm will produce a hypothesis-- the final hypothesis, the one that you are going to give your customer, and we give the symbol g for that. And hopefully g approximates f, the actual target function, which remains unknown. And g is picked from a hypothesis set, and the general the symbol for a member of the hypothesis set is h. So h is a generic hypothesis. The one you happen to pick, you are going to call g. Now, we looked at an example of a learning algorithm. First, the learning model-- the perceptron itself, which is a linear function, thresholded. That happens to be the hypothesis set. And then, there is an algorithm that goes with it that chooses which hypothesis to report based on the data. And the hypothesis in this case is represented by the purple line. Different hypotheses in the set H will result in different lines. Some of them are good and some of them are bad, in terms of separating correctly the examples which are the pluses and minuses. And we found that there's a very simple rule to adjust the current hypothesis, while the algorithm is still running, in order to get a better hypothesis. And once you have all the points classified correctly, which is guaranteed in the case of the perceptron learning algorithm if the data was linearly separable in the first place, then you will get there, and that will be the g that you are going to report. Now, we ended the lecture on sort of a sad note, because after all of this encouragement about learning, we asked ourselves: well, can we actually learn? So we said it's an unknown function. Unknown function is an attractive assumption, as I said. But can we learn an unknown function, really? And then we realized that if you look at it, it's really impossible. Why is it impossible? Because I'm going to give you a finite data set, and I'm going to give you the value of the function on this set. Good. Now, I'm going to ask you what is the function outside that set? How in the world are you going to tell what the function is outside, if the function is genuinely unknown? Couldn't it assume any value it wants? Yes, it can. I can give you 1000 points, a million points, and on the million-and-first point, still the function can behave any way it wants. So it doesn't look like the statement we made is feasible in terms of learning, and therefore we have to do something about it. And what we are going to do about it is the subject of this lecture. Now, the lecture is called Is Learning Feasible? And I am going to address this question in extreme detail from beginning to end. This is the only topic for this lecture. Now, if you want an outline-- it's really a logical flow. But if you want to cluster it into points-- we are going to start with a probabilistic situation, that is a very simple probabilistic situation. It doesn't seem to relate to learning. But it will capture the idea-- can we say something outside the sample data that we have? So we're going to answer it in a way that is concrete, and where the mathematics is very friendly. And then after that, I'm going to be able to relate that probabilistic situation to learning as we stated. It will take two stages. First, I will just translate the expressions into something that relates to learning, and then we will move forward and make it correspond to real learning. That's the last one. And then after we do that, and we think we are done, we find that there is a serious dilemma that we have. And we will find a solution to that dilemma, and then declare victory-- that indeed, learning is feasible in a very particular sense.