Placeholder Image

Subtitles section Play video

  • Welcome back to Mining of Massive Datasets.

  • We're going to continue our lesson on recommender systems by looking at

  • content-based recommendation systems.

  • The main idea behind content-based recommendation systems is to recommend

  • items to a customer x similar to previous items rated highly by the same customer.

  • For example, in, in example of movies you might recommend movies

  • with the same actor of actors, director, genre and so on.

  • In the case of websites, blogs or

  • news, we might recommend articles with similar content, similar topics.

  • In the case of people recommendations,

  • we might recommendation people with many common friends to each other.

  • So here's our plan of action.

  • We're going to start with a user and

  • find out a set of items the user likes using both explicit and implicit data.

  • For example, we might look at the items that the user has rated highly and

  • the set of items the user has purchased.

  • And for eat of, each of those items, we're going to build an item profile.

  • An item profile is a description of the item.

  • For example, in this case, we are dealing with geometric shapes.

  • And let's say the user likes, they, a red circle and, a red triangle.

  • We might build item profiles that say that the user likes red items.

  • Right?

  • Or they, or the user likes circles, for instance.

  • And from these item from these item profiles,

  • you're going to infer a user profile.

  • The user profile infers the likes of

  • the user from the profile of items the user likes.

  • Because the user here likes a red circle and

  • a red triangle we would infer that the user likes the color red.

  • They like circles.

  • And they like triangles.

  • Now once we have a profile of the user we can then match that against a catalog and

  • recommend our items to the user.

  • So, let's say the catalog has a bunch of items in it.

  • Some of those items are red, so we can recommend those to the user.

  • So, let's, let's look at how to build these item profiles.

  • For each item,

  • you want to create an item profile, which we can then use to build user profile.

  • So, the profile is a set of features about the item.

  • In the case of movies for instance.

  • The item profile might include author title actor director and so on.

  • In the case of images and videos.

  • We might use metadata and tags.

  • In the case of people the item profile might be a set of friends of the user.

  • Given that the item profile is a set of features,

  • it's often convenient to think of it as a vector.

  • The vector could be either Boolean or real valued, and there's one entry per feature.

  • For example, in the case of movies, the vector might be,

  • the item profile might be a Boolean vector.

  • And there's a 0 or a 1.

  • For each actor, director and so on, depending on whether that actor or

  • that director actually participated in that movie.

  • We'll look at the special case of text.

  • For example, we might be recommending news articles.

  • Now, what's the item profile in this case.

  • The simplest item profile in this case is to pick the set of important words in

  • the document or the item.

  • How do you pick the important words in the item?

  • The usual heuristic that we get from text mining is the technique called TF-IDF,

  • or term frequency inverse document frequency.

  • Many of you may have come across TF-IDF in the context of information retrieval but

  • for those of you who have not here's a quick refresher.

  • Let's say we are looking at a document or

  • item j, and is computing the score for term or feature i.

  • The term frequency TFIJ for feature I in document J

  • is just the number of times the feature J, the feature I appears in the document J

  • divided by the maximum number of time that same feature appears in any document.

  • For example, let's say the feature is a certain word the word apple.

  • And, int he document that we're looking at, the word apple appears five times.

  • But there's another document where the word apple appears 23 times.

  • And this is the maximum number of times the word apple appears in

  • any document at all.

  • Then the term frequency, TF ij is, it's five divided by 23.

  • Now, I'm glossing over the fact that we need to normalize TF to account for

  • the fact that document lengths are different.

  • Let's just ignore that for the moment.

  • Now, the,

  • term frequency captures the number of times, A term of years in a document.

  • Intuitively, the more often, term of years in a document,

  • the more important a feature it is.

  • For example, if a document mentions about Apple five times, we weight Apple as more

  • important in that document than in other documents that just mentions it once.

  • But how do you compare the weight of different terms?

  • For example, you know a, a rare word appearing just a couple of times might

  • more important than a more common word like the appearing thousands of times.

  • This is where the document frequency comes in.

  • Let n i be the number of documents that mention the term i.

  • And let n be the total number of documents, in the whole system.

  • The inverse document frequency for the term i is obtained by dividing,

  • N by n i, the number of documents that mention the term i and

  • then taking the logarithm of that, of that, fraction.

  • Notice that the more common a term, the larger an i.

  • And the larger an i, the lower the IDF.

  • The IDF function ensures, you know,

  • gives a lower weight to more common words and a higher weight to rarer words.

  • So if you put these two pieces together.

  • The TF-IDF score of feature i for

  • document j is obtained by multiplying the Term Frequency and the IDF.

  • So given a document you compute the TF-IDF scores.

  • For every term in the document.

  • And then you sort all the terms in the document by their TF-IDF scores.

  • And then you have some kind of threshold or

  • you might pick the set of words with the highest TF-IDF scores in

  • the document together with their scores and that would be the top profile.

  • So in this case,

  • a doc profile is a real value vector as opposed to a boolean vector.

  • Now that we have item profiles, our next task is to construct user profiles.

  • Let's say we have a user who has rated items with profiles i1 through i n.

  • Now remember, I want to i-n our vectors of, of entries.

  • Let's say this is i-1, this i-2, i-3 and so on.

  • And here is i-n.

  • These are each is a vector in a high dimensional space,

  • with many many Now the simplest way to construct a user

  • profile from a set of item profiles is just to average the item profiles.

  • [SOUND] Where N is the total number of item profiles.

  • So if I take all the item profiles in the users you know, of,

  • of all the item the user has has rated and then take the average,

  • that would be the simplest way of constructing a user profile.

  • Now this doesn't take into account that the user liked certain items

  • more than others.

  • So in that case we might want to use a weighted average,

  • where the weight is equal to the rating given by the user for for each item.

  • Then you'd have a weighted average item profile.

  • A variant of this is to normalize these weights

  • using the average rating of the user.

  • And you've seen example that makes this idea clear.

  • And of course, much more sophisticated aggregations are possible.

  • Here we're only looking at some very simple examples.

  • Let's look at an example that you know?

  • That'll clarify weighted average item profiles.

  • And how to normalize weights.

  • Let's start with an example of a Boolean Utility Matrix.

  • What's a Boolean Utility Matrix?

  • All we have is information of whether the user purchased an item or not.

  • For example.

  • So each entry is either a zero or a one.

  • Let's say the items are movies and the only feature is actor.

  • The item profile in this case is a vector for zero or one for each actor.

  • Zero if that actor did not appear in that movie.

  • And one if that actor appeared in that movie.

  • Suppose user x has watched five movies and two of those movies feature actor a and

  • three of those movies feature actor b.

  • Now the simplest user profile is just the mean of the item profiles.

  • Remember there are 5 vectors, and 2 of those have a 1 for feature A.

  • And so the data feature A is going to be 2 divided by the total number of

  • item profiles, which is 5, which is 0.4.

  • And the weight of feature B, correspondingly, is going to be 3/5.

  • Let's look at the more complex example of its star ratings.

  • Suppose we have star ratings in the range of one to five.

  • And the user has once again watched five movies.

  • And there are two movies starring actor A and three movies starring actor B.

  • The movies that actor A starred in, the user rated three and five.

  • But with the movie that that actor B acted in, the user rated one, two, and four.

  • So since we have five star ratings and the user gives lower ratings for

  • movies they didn't like and higher ratings for movies they liked.

  • It's somewhat apparent from these ratings that the user liked

  • at least one of the movies from from Actor A and one of the movies from Actor B.

  • But didn't he, but

  • they really didn't like two of Actor B's movies, the ones that were rated 1 and 2.

  • 1 and 2 are in fact, negative ratings.

  • Not positive ratings.

  • And we try to capture this fact.

  • The idea of normalizing ratings helps us capture the idea that

  • some ratings are actually negative ratings and some are positive ratings.

  • But the baseline, you know, users are very different from each other.

  • Some users are just more generous in their ratings than others.

  • So, for, user a, for instance.

  • A four might be a widely positive rating.

  • But if for another, four might just be an average rating.

  • To sort of capture this idea,

  • we want to baseline each user's ratings by their average rating.

  • So in this case, the, this user's average rating is a three.

  • If you, average all the five ratings that the user, has provided,

  • the average rating, is a three.

  • And so what we're going to do is just subtract the average rating from each of

  • the individual movie ratings.

  • So in this case the movies with actor A the normalized ratings in

  • that case a three and five, become zero and plus two.

  • And for actor B, the normalized ratings become minus two, minus one, and plus one.

  • Notice that this captures intuition that the user did not like, the,

  • the first two movies with actor B, whereas he really liked, the, the,

  • the second movie with, with, with actor A.

  • Where the first movie with actor A was, you know, was kind of an average movie.

  • Once you do this normalization, then you can compute the profile,

  • the profile weights.

  • But in this case we divide not by the total number of movies.

  • But by the total number of movies with a specific feature.

  • So in this case there are two movies with actor A.

  • And profile weight for

  • actor A the feature with actor A is zero plus two divided by two which is one.

  • And similarly the feature actor B has a profile weight of -2/3.

  • This indicates a mild positive preference for, for actor A.

  • And a mild negative preference for actor B.

  • Now that we have user profiles and

  • actor profiles, the next task is to recommend certain items to the user.

  • The key step in this is to take a pair of user profile and item profile, and

  • figure out what the rating for that user and item pair is likely to be.

  • Remember that both the user profile and

  • the item profile are vectors in high-dimensional space.

  • In this case I've shown them in a two-dimensional space, when the reality of

  • course, they're embedded in a much higher dimensional space.

  • You might recall from a prior lecture that when you have vectors in

  • higher dimensional space a good distance metric between the pair of vectors is

  • the angle theta between the pair of vectors.

  • In particular, you can estimate the angle using the cosine formula.

  • The cosine of Theta, Theta, the angle between the two vectors is given by

  • the dot product of the two vectors, divided by the product of the magnitudes.

  • And this distance, in, in this case,

  • we'll call this cosine similarity between, the user x and the item i.

  • Now technically the cosine distance is actually the angle theta and

  • not the cosine of the angle.

  • Right?

  • The cosine distance, as we studied in an earlier lecture, is the angle theta and

  • the cosine similarity is the angle 180 minus theta.

  • Now the smaller the angle, the more similar the item x and

  • the the, the more similar the user x and the item i r.

  • And therefore the similarity 180 minus data is going to be larger.

  • But for convenience, were going to actually use the cosine of theta as,

  • as a similarity measure.

  • Notice that as the angle of theta becomes smaller, cost theta becomes larger.

  • And as it angle theta becomes larger and larger, the cosine becomes smaller and

  • smaller in fact, as theta becomes greater than 90,

  • the cosine of theta becomes negative.

  • And, so this captures intuition, that, as the angle becomes smaller and

  • smaller, x and i are more and more similar to each other, and the, and

  • it's more likely that x will give a high rating to item i.

  • So the way we make predictions is as follows.

  • Given the user x, we compute the cosine similarity between that user and

  • all the items in the catalog.

  • And then you pick the items with the highest cosine similarity and

  • recommend those to the user.

  • And that's a theory of content-based recommendations.

  • Now let's look at some of the pros and

  • cons of the content-based recommendation approach.

  • The biggest pro of the content-based recommendation approach is that you don't

  • need data about other users in order to make recommendations to a specific user.

  • This turns out to be a very, very good thing because you know,