Subtitles section Play video Print subtitles Welcome back to Mining of Massive Datasets. We're going to continue our lesson on recommender systems by looking at content-based recommendation systems. The main idea behind content-based recommendation systems is to recommend items to a customer x similar to previous items rated highly by the same customer. For example, in, in example of movies you might recommend movies with the same actor of actors, director, genre and so on. In the case of websites, blogs or news, we might recommend articles with similar content, similar topics. In the case of people recommendations, we might recommendation people with many common friends to each other. So here's our plan of action. We're going to start with a user and find out a set of items the user likes using both explicit and implicit data. For example, we might look at the items that the user has rated highly and the set of items the user has purchased. And for eat of, each of those items, we're going to build an item profile. An item profile is a description of the item. For example, in this case, we are dealing with geometric shapes. And let's say the user likes, they, a red circle and, a red triangle. We might build item profiles that say that the user likes red items. Right? Or they, or the user likes circles, for instance. And from these item from these item profiles, you're going to infer a user profile. The user profile infers the likes of the user from the profile of items the user likes. Because the user here likes a red circle and a red triangle we would infer that the user likes the color red. They like circles. And they like triangles. Now once we have a profile of the user we can then match that against a catalog and recommend our items to the user. So, let's say the catalog has a bunch of items in it. Some of those items are red, so we can recommend those to the user. So, let's, let's look at how to build these item profiles. For each item, you want to create an item profile, which we can then use to build user profile. So, the profile is a set of features about the item. In the case of movies for instance. The item profile might include author title actor director and so on. In the case of images and videos. We might use metadata and tags. In the case of people the item profile might be a set of friends of the user. Given that the item profile is a set of features, it's often convenient to think of it as a vector. The vector could be either Boolean or real valued, and there's one entry per feature. For example, in the case of movies, the vector might be, the item profile might be a Boolean vector. And there's a 0 or a 1. For each actor, director and so on, depending on whether that actor or that director actually participated in that movie. We'll look at the special case of text. For example, we might be recommending news articles. Now, what's the item profile in this case. The simplest item profile in this case is to pick the set of important words in the document or the item. How do you pick the important words in the item? The usual heuristic that we get from text mining is the technique called TF-IDF, or term frequency inverse document frequency. Many of you may have come across TF-IDF in the context of information retrieval but for those of you who have not here's a quick refresher. Let's say we are looking at a document or item j, and is computing the score for term or feature i. The term frequency TFIJ for feature I in document J is just the number of times the feature J, the feature I appears in the document J divided by the maximum number of time that same feature appears in any document. For example, let's say the feature is a certain word the word apple. And, int he document that we're looking at, the word apple appears five times. But there's another document where the word apple appears 23 times. And this is the maximum number of times the word apple appears in any document at all. Then the term frequency, TF ij is, it's five divided by 23. Now, I'm glossing over the fact that we need to normalize TF to account for the fact that document lengths are different. Let's just ignore that for the moment. Now, the, term frequency captures the number of times, A term of years in a document. Intuitively, the more often, term of years in a document, the more important a feature it is. For example, if a document mentions about Apple five times, we weight Apple as more important in that document than in other documents that just mentions it once. But how do you compare the weight of different terms? For example, you know a, a rare word appearing just a couple of times might more important than a more common word like the appearing thousands of times. This is where the document frequency comes in. Let n i be the number of documents that mention the term i. And let n be the total number of documents, in the whole system. The inverse document frequency for the term i is obtained by dividing, N by n i, the number of documents that mention the term i and then taking the logarithm of that, of that, fraction. Notice that the more common a term, the larger an i. And the larger an i, the lower the IDF. The IDF function ensures, you know, gives a lower weight to more common words and a higher weight to rarer words. So if you put these two pieces together. The TF-IDF score of feature i for document j is obtained by multiplying the Term Frequency and the IDF. So given a document you compute the TF-IDF scores. For every term in the document. And then you sort all the terms in the document by their TF-IDF scores. And then you have some kind of threshold or you might pick the set of words with the highest TF-IDF scores in the document together with their scores and that would be the top profile. So in this case, a doc profile is a real value vector as opposed to a boolean vector. Now that we have item profiles, our next task is to construct user profiles. Let's say we have a user who has rated items with profiles i1 through i n. Now remember, I want to i-n our vectors of, of entries. Let's say this is i-1, this i-2, i-3 and so on. And here is i-n. These are each is a vector in a high dimensional space, with many many Now the simplest way to construct a user profile from a set of item profiles is just to average the item profiles. [SOUND] Where N is the total number of item profiles. So if I take all the item profiles in the users you know, of, of all the item the user has has rated and then take the average, that would be the simplest way of constructing a user profile. Now this doesn't take into account that the user liked certain items more than others. So in that case we might want to use a weighted average, where the weight is equal to the rating given by the user for for each item. Then you'd have a weighted average item profile. A variant of this is to normalize these weights using the average rating of the user. And you've seen example that makes this idea clear. And of course, much more sophisticated aggregations are possible. Here we're only looking at some very simple examples. Let's look at an example that you know? That'll clarify weighted average item profiles. And how to normalize weights. Let's start with an example of a Boolean Utility Matrix. What's a Boolean Utility Matrix? All we have is information of whether the user purchased an item or not. For example. So each entry is either a zero or a one. Let's say the items are movies and the only feature is actor. The item profile in this case is a vector for zero or one for each actor. Zero if that actor did not appear in that movie. And one if that actor appeared in that movie. Suppose user x has watched five movies and two of those movies feature actor a and three of those movies feature actor b. Now the simplest user profile is just the mean of the item profiles. Remember there are 5 vectors, and 2 of those have a 1 for feature A. And so the data feature A is going to be 2 divided by the total number of item profiles, which is 5, which is 0.4. And the weight of feature B, correspondingly, is going to be 3/5. Let's look at the more complex example of its star ratings. Suppose we have star ratings in the range of one to five. And the user has once again watched five movies. And there are two movies starring actor A and three movies starring actor B. The movies that actor A starred in, the user rated three and five. But with the movie that that actor B acted in, the user rated one, two, and four. So since we have five star ratings and the user gives lower ratings for movies they didn't like and higher ratings for movies they liked. It's somewhat apparent from these ratings that the user liked at least one of the movies from from Actor A and one of the movies from Actor B. But didn't he, but they really didn't like two of Actor B's movies, the ones that were rated 1 and 2. 1 and 2 are in fact, negative ratings. Not positive ratings. And we try to capture this fact. The idea of normalizing ratings helps us capture the idea that some ratings are actually negative ratings and some are positive ratings. But the baseline, you know, users are very different from each other. Some users are just more generous in their ratings than others. So, for, user a, for instance. A four might be a widely positive rating. But if for another, four might just be an average rating. To sort of capture this idea, we want to baseline each user's ratings by their average rating. So in this case, the, this user's average rating is a three. If you, average all the five ratings that the user, has provided, the average rating, is a three. And so what we're going to do is just subtract the average rating from each of the individual movie ratings. So in this case the movies with actor A the normalized ratings in that case a three and five, become zero and plus two. And for actor B, the normalized ratings become minus two, minus one, and plus one. Notice that this captures intuition that the user did not like, the, the first two movies with actor B, whereas he really liked, the, the, the second movie with, with, with actor A. Where the first movie with actor A was, you know, was kind of an average movie. Once you do this normalization, then you can compute the profile, the profile weights. But in this case we divide not by the total number of movies. But by the total number of movies with a specific feature. So in this case there are two movies with actor A. And profile weight for actor A the feature with actor A is zero plus two divided by two which is one. And similarly the feature actor B has a profile weight of -2/3. This indicates a mild positive preference for, for actor A. And a mild negative preference for actor B. Now that we have user profiles and actor profiles, the next task is to recommend certain items to the user. The key step in this is to take a pair of user profile and item profile, and figure out what the rating for that user and item pair is likely to be. Remember that both the user profile and the item profile are vectors in high-dimensional space. In this case I've shown them in a two-dimensional space, when the reality of course, they're embedded in a much higher dimensional space. You might recall from a prior lecture that when you have vectors in higher dimensional space a good distance metric between the pair of vectors is the angle theta between the pair of vectors. In particular, you can estimate the angle using the cosine formula. The cosine of Theta, Theta, the angle between the two vectors is given by the dot product of the two vectors, divided by the product of the magnitudes. And this distance, in, in this case, we'll call this cosine similarity between, the user x and the item i. Now technically the cosine distance is actually the angle theta and not the cosine of the angle. Right? The cosine distance, as we studied in an earlier lecture, is the angle theta and the cosine similarity is the angle 180 minus theta. Now the smaller the angle, the more similar the item x and the the, the more similar the user x and the item i r. And therefore the similarity 180 minus data is going to be larger. But for convenience, were going to actually use the cosine of theta as, as a similarity measure. Notice that as the angle of theta becomes smaller, cost theta becomes larger. And as it angle theta becomes larger and larger, the cosine becomes smaller and smaller in fact, as theta becomes greater than 90, the cosine of theta becomes negative. And, so this captures intuition, that, as the angle becomes smaller and smaller, x and i are more and more similar to each other, and the, and it's more likely that x will give a high rating to item i. So the way we make predictions is as follows. Given the user x, we compute the cosine similarity between that user and all the items in the catalog. And then you pick the items with the highest cosine similarity and recommend those to the user. And that's a theory of content-based recommendations. Now let's look at some of the pros and cons of the content-based recommendation approach. The biggest pro of the content-based recommendation approach is that you don't need data about other users in order to make recommendations to a specific user. This turns out to be a very, very good thing because you know,