Subtitles section Play video Print subtitles Anant has already talked to you about recommender systems, and so far we have learned how to do, collaborative filtering recommender systems and how to do, profile based recommender systems. What we will do today in this lecture, we will look into latent factor recommender models. So basically the idea will be that we will adopt this machine learning perspective where we'll be using optimization in order to build. Very accurate recommender systems. And in particular, our lecture today will be moderated by, by, by the Netflix price, right? So, Netflix is, is a company that is renting out, DVDs, and in order to increase customer experience, Netflix wants to be able to recommend which customers will like which movies, and this is called the movie recommender system. Right, and a few years ago, actually, Netflix prepared some data and created The Grand Challenge, where the idea was who can build the best recommender system and the person who does that gets a one million dollars prize. So here is how the. Competition was structured. So the idea was that first we would get, scientists would get training data. This is a set of data of 100 million ratings from 480,000, users. And around 17,000 movies. And this is six years' worth of data. From 2000 to 2005. Right? So based on this data the idea is that we build. Or create a recommender system and then we would send these recommendations to Netflix that would check our predicted recommendation. Basically how much does a given user like the movie. We did their test data set which is a set of Ratings that only Netflix has, and this way we would want to predict or, the, and make sure that the predicted ratings are correspond to the actual true ratings of the people who like those movies. So that's basically the idea. So, we are given hundred million movie ratings, and we want to build a recommender system out of this. So, the way we can think of our ratings data in, is in terms of the utility matrix R. And, basically, the way this matrix is composed is that rows corresponds to movies and columns corresponds to users. And this matrix, most of it is empty in a sense that most users haven't seen most of the movies. Right? But whenever a mo, a user sees a movie and rates that movie, that represents an entry in this matrix, right? So for example, this entry, in this particular en, element of the matrix, number 5, that would mean that, a user in the, in the fourth column liked the third movie a lot and rated it given five stars. Right? So now given such a, such a matrix, we want to formulate. The recommendation problem. The way we think of the recommendation problem is in a sense that we want to predict how much will a given user like some movie that they will see in the future. So in this sense, we can use the historical rating data to build our model. This is what we'll call the training data. And then using this strange model we want to predict how much the users like movies that they haven't yet seen and we will call this the test data. And now when I say that we want to predict how much will a user like a given movie the way we want to quantify how accurate is our prediction we will use what is called the root mean square data. So, basically the idea will be that our goal is to make predictions about how much users are going to like the movies, then we will go and ask these users hey, how much do you really like that movie? And then we want these two, the prediction and the true value, to be as close as possible. So, the root mean square error, the way the formula works, is follows. Right? For a given user x in movie i I want to predict how much will that user-movie pair what will be the rating of user x to movie i. Of course I also know the true rating for movie i of user x. And what we want to do is we want the discrepancy between the predicted rating and the true rating. To be, to be small in a sense that we want to take the difference square this difference. The reason we want to square is because the difference could be either positive of negative. So if we square it up, everything is positive. Now we sum all these errors across all the users and all the movies. We take the square root and divide by the total number of ratings, and this is called root-mean-square error. So in some sense, our goal is to in some sense complete this matrix R. Wherever the values for ratings are unknown. And our goal is to complete this matrix as accurately as possible. So in some sense, we want to predict how much is a given user going to like a given movie. So what is the input and what is the output, right? So what Netflix gives us is the training data of hun, 100 million ratings. So half a million people, 20,000 movies, and, and ratings. Our goal is to predict the, a few ratings for every user. Where the evaluation criteria or the good, the way, how we evaluate how good that our predictions, are based on the root mean squared error. And, what we also know, for example, is that, that the root mean square error of the Netflix system that was used in house and at that point in time was zero, 0.95. Okay? So in some sense, on average this, this in, in-house recommender system was about one star away from the, from the true rating. Right? So, this competition that was taking place in 2009 and 2010. More than 2,700 teams joined the competition. And what matrix also said was whoever is able to improve their in-house recommended system by 10% gets $1 million prize. Right? So the idea is how can we beat the in-house recommended system and get. . The root brings square error from .9 to .85 that's the whole idea. Now the structure of how to do this is the following. Basically a modern recommender system today is composed of several different components. And, the idea is that basically we want to adopt this kind of multi-scale modeling approach to the, to the rating data. And the idea is that we want to combine several different views of the data from this type of global top level view of the data to some kind of regional modeling of the data, all the way down to very local interactions between movies and the users. Right? And what the, the goal of today's lecture will be, that we basically focus on the, on the, original or the, you know, the middle level, modeling of the data through matrix factorization. So the methodology that we'll be using to, to achieve, our modeling will be through matrix factorization which relates very nicely to the singular value, the composition and dimensionality reduction. Lecture that we talked about previously. But before we kind of go to the main body of today's lecture, let me talk to you about global modeling and also about collaborative filtering which is very a local way of modeling the data So what do I mean by global modeling of the data. The global modeling of the data is basically that you want to kind of compute some average, average rating over the movies and, and also over the users. Right so if you would use a global way to model how much will a given user like a given movie then we could say the min movie rating in the, in our data set on Netflix is 3.7 stars. And then we could say oh but for example the Sixth Sense movie. Is, is rated half a star above the average rating. And then we could say, oh, but this given user is very critical. So they tend to, they tend to rate things about 0.2 stars below the average rating, right. So we could now say, okay, so 3.7 is our baseline. Now we have a movie that is above average for half a star, so that is 4.2, but this person, Joe, is a critical person, so they tend to be .2 stars below the average. So, 4.2 minus, minus .2 equals 4, right? So, in some sense our prediction how much will Joe like the Sixth Sense, Sense would be 4 stars, as I said because of 3.7 plus .5. Minus 0.2 equals 4.0, right? So this is how basically, just based on the average, average of the user rating and average of the movie rating and overall average rating throughout the data set, we can already make a prediction how much will a user like a given movie. Of course this is a very crude, very global prediction. But it turns out to be very useful. So this is at the end, one end of the spectrum. On the other end of the spectrum we can use very kind of local neighborhood properties of a given user and a given movie to make recommendations. And this is what collaborative filtering does. Right? So in some sense the idea is, would be following, right? So I would like to, to estimate what are other movies that are similar to the Sixth Sense that Joe also rated and based on this I would then make a recommendation. Right? And based on these kinds of signals I would then make a final estimate that Joe is like, going to like our movie with, or rate it with 3.8 stars. Okay? So how does collaborative filtering do what I just said? The idea is basically the following. Is that we want to derive unknown ratings from those, from the ratings of similar movies. Right? This is the item-item base collaborative filtering. So now the, the big question with collaborative filtering. Is how do we operationalize the notion of similar? Right? So, I basically won't need to define a similarity measure that tells me how similar are movies i and j. And now that I, that for every pair of movies for example, I'm able to measure the similarity. Then the idea is the following, I given that I want to make up a recommendation. For movie I, the way I will do this is I would go find K nearest neighbors, basically K most similar movies to movie I, and then I would ask, ask the ratings and, and add them together in a weighted combination where movies that are more similar to my movie I, they have a higher weight. So, this is the formula that I have. At the bottom, So let me explain it. Right. So what we want to do is we want to predict how much will user x like movie i. The way we do this is that for a. For a given movie. And for we go and find let's say k-nearest neighbors. Meaning k most similar movies j to this query movie i. And then we say what was the rating that this user X gave to that given movie I, and then we also multiply this with the similarity to movie I and J right. So the idea basically is that if movies I and J are very similar, then SIJ will be will be high so the rating that user X gave to movie J will have a high weight. If I, if the movie J is of close similarity to I, then what will happen is that, that rating of user X to the movie J will have no weight in this summation. And then, at the bottom, we just normalize, right? And this is one way how we can compute, the rating based on the collaborative filtering. Now how do we put together the local effects and the global effects right how do we combine collaborative filtering with glo, with the global effects model that I described described two slides ago. The way we do this is. That basically we just add the two together. Right? So here's, here's the formula that I will explain again. All right so my recommendation or prediction of how much will user i user x like movie i. Is the, the baseline predictor. Which is what we talked before. Right. So the, the baseline predictor for user x and movie i is simply the average movie rating plus the average deviation for user x plus the average deviation for movie i. Right. So in our previous case v of x was minus point two because our user, Joe, was. Very critical, and in our case the 6th Sense was a good movie, so the bias of that movie was .5, and the overall rating was 3.7. Right? So our baseline recommendation for Joe at that point in time was 4.0. And now, of course, this is our baseline recommendation. And now we can take the part from the collaborative filtering, where we take k near, closest movie is, j, to the given movie, I. And ask, what is the deviation of the, of the user's rating with regard to the, to their baseline for that movie. And take away the combination of this and compute the recommendation right so here the the first part is the is the global part of the model and the the second part is the collaborative filtering part of the model. Of course here are a few caveats a few things to to be careful about. First, first one is that finding coming up with the good similarity measure. Basically the good SIJ is very hard and sometimes it's, it's arbitrary or one has to try many different metrics before one starts to work. But pairwise, the problem with pairwise similarities is that basically they, they, they Consider the similarities between the movies but they kind of ignore the similarities between the users. And another thing here we could say is that taking this weighted average sometimes can be, can be restricted. But these are kind of, if there are some issues with this model. These are the three to talk about. But in practice this already works quite well. So let me show you how I do these different models work, okay? So what I'm showing you with this yellow arrow is, what is the performance of various methods on this Netflix price. Right, for example, if for a given user we would just go and predict the global average, so we would just compute. what is the global average star rating on the whole. Netflix data set and what whenever a user comes you could just say 3.7. That's our prediction. If you do that our root and square add on would be 1.12. Right so if we are kind of very naive and