Subtitles section Play video
-
Welcome back to Mining of Massive Datasets.
-
We're going to continue our lesson on recommender systems by looking at
-
content-based recommendation systems.
-
The main idea behind content-based recommendation systems is to recommend
-
items to a customer x similar to previous items rated highly by the same customer.
-
For example, in, in example of movies you might recommend movies
-
with the same actor of actors, director, genre and so on.
-
In the case of websites, blogs or
-
news, we might recommend articles with similar content, similar topics.
-
In the case of people recommendations,
-
we might recommendation people with many common friends to each other.
-
So here's our plan of action.
-
We're going to start with a user and
-
find out a set of items the user likes using both explicit and implicit data.
-
For example, we might look at the items that the user has rated highly and
-
the set of items the user has purchased.
-
And for eat of, each of those items, we're going to build an item profile.
-
An item profile is a description of the item.
-
For example, in this case, we are dealing with geometric shapes.
-
And let's say the user likes, they, a red circle and, a red triangle.
-
We might build item profiles that say that the user likes red items.
-
Right?
-
Or they, or the user likes circles, for instance.
-
And from these item from these item profiles,
-
you're going to infer a user profile.
-
The user profile infers the likes of
-
the user from the profile of items the user likes.
-
Because the user here likes a red circle and
-
a red triangle we would infer that the user likes the color red.
-
They like circles.
-
And they like triangles.
-
Now once we have a profile of the user we can then match that against a catalog and
-
recommend our items to the user.
-
So, let's say the catalog has a bunch of items in it.
-
Some of those items are red, so we can recommend those to the user.
-
So, let's, let's look at how to build these item profiles.
-
For each item,
-
you want to create an item profile, which we can then use to build user profile.
-
So, the profile is a set of features about the item.
-
In the case of movies for instance.
-
The item profile might include author title actor director and so on.
-
In the case of images and videos.
-
We might use metadata and tags.
-
In the case of people the item profile might be a set of friends of the user.
-
Given that the item profile is a set of features,
-
it's often convenient to think of it as a vector.
-
The vector could be either Boolean or real valued, and there's one entry per feature.
-
For example, in the case of movies, the vector might be,
-
the item profile might be a Boolean vector.
-
And there's a 0 or a 1.
-
For each actor, director and so on, depending on whether that actor or
-
that director actually participated in that movie.
-
We'll look at the special case of text.
-
For example, we might be recommending news articles.
-
Now, what's the item profile in this case.
-
The simplest item profile in this case is to pick the set of important words in
-
the document or the item.
-
How do you pick the important words in the item?
-
The usual heuristic that we get from text mining is the technique called TF-IDF,
-
or term frequency inverse document frequency.
-
Many of you may have come across TF-IDF in the context of information retrieval but
-
for those of you who have not here's a quick refresher.
-
Let's say we are looking at a document or
-
item j, and is computing the score for term or feature i.
-
The term frequency TFIJ for feature I in document J
-
is just the number of times the feature J, the feature I appears in the document J
-
divided by the maximum number of time that same feature appears in any document.
-
For example, let's say the feature is a certain word the word apple.
-
And, int he document that we're looking at, the word apple appears five times.
-
But there's another document where the word apple appears 23 times.
-
And this is the maximum number of times the word apple appears in
-
any document at all.
-
Then the term frequency, TF ij is, it's five divided by 23.
-
Now, I'm glossing over the fact that we need to normalize TF to account for
-
the fact that document lengths are different.
-
Let's just ignore that for the moment.
-
Now, the,
-
term frequency captures the number of times, A term of years in a document.
-
Intuitively, the more often, term of years in a document,
-
the more important a feature it is.
-
For example, if a document mentions about Apple five times, we weight Apple as more
-
important in that document than in other documents that just mentions it once.
-
But how do you compare the weight of different terms?
-
For example, you know a, a rare word appearing just a couple of times might
-
more important than a more common word like the appearing thousands of times.
-
This is where the document frequency comes in.
-
Let n i be the number of documents that mention the term i.
-
And let n be the total number of documents, in the whole system.
-
The inverse document frequency for the term i is obtained by dividing,
-
N by n i, the number of documents that mention the term i and
-
then taking the logarithm of that, of that, fraction.
-
Notice that the more common a term, the larger an i.
-
And the larger an i, the lower the IDF.
-
The IDF function ensures, you know,
-
gives a lower weight to more common words and a higher weight to rarer words.
-
So if you put these two pieces together.
-
The TF-IDF score of feature i for
-
document j is obtained by multiplying the Term Frequency and the IDF.
-
So given a document you compute the TF-IDF scores.
-
For every term in the document.
-
And then you sort all the terms in the document by their TF-IDF scores.
-
And then you have some kind of threshold or
-
you might pick the set of words with the highest TF-IDF scores in
-
the document together with their scores and that would be the top profile.
-
So in this case,
-
a doc profile is a real value vector as opposed to a boolean vector.
-
Now that we have item profiles, our next task is to construct user profiles.
-
Let's say we have a user who has rated items with profiles i1 through i n.
-
Now remember, I want to i-n our vectors of, of entries.
-
Let's say this is i-1, this i-2, i-3 and so on.
-
And here is i-n.
-
These are each is a vector in a high dimensional space,
-
with many many Now the simplest way to construct a user
-
profile from a set of item profiles is just to average the item profiles.
-
[SOUND] Where N is the total number of item profiles.
-
So if I take all the item profiles in the users you know, of,
-
of all the item the user has has rated and then take the average,
-
that would be the simplest way of constructing a user profile.
-
Now this doesn't take into account that the user liked certain items
-
more than others.
-
So in that case we might want to use a weighted average,
-
where the weight is equal to the rating given by the user for for each item.
-
Then you'd have a weighted average item profile.
-
A variant of this is to normalize these weights
-
using the average rating of the user.
-
And you've seen example that makes this idea clear.
-
And of course, much more sophisticated aggregations are possible.
-
Here we're only looking at some very simple examples.
-
Let's look at an example that you know?
-
That'll clarify weighted average item profiles.
-
And how to normalize weights.
-
Let's start with an example of a Boolean Utility Matrix.
-
What's a Boolean Utility Matrix?
-
All we have is information of whether the user purchased an item or not.
-
For example.
-
So each entry is either a zero or a one.
-
Let's say the items are movies and the only feature is actor.
-
The item profile in this case is a vector for zero or one for each actor.
-
Zero if that actor did not appear in that movie.
-
And one if that actor appeared in that movie.
-
Suppose user x has watched five movies and two of those movies feature actor a and
-
three of those movies feature actor b.
-
Now the simplest user profile is just the mean of the item profiles.
-
Remember there are 5 vectors, and 2 of those have a 1 for feature A.
-
And so the data feature A is going to be 2 divided by the total number of
-
item profiles, which is 5, which is 0.4.
-
And the weight of feature B, correspondingly, is going to be 3/5.
-
Let's look at the more complex example of its star ratings.
-
Suppose we have star ratings in the range of one to five.
-
And the user has once again watched five movies.
-
And there are two movies starring actor A and three movies starring actor B.
-
The movies that actor A starred in, the user rated three and five.
-
But with the movie that that actor B acted in, the user rated one, two, and four.
-
So since we have five star ratings and the user gives lower ratings for
-
movies they didn't like and higher ratings for movies they liked.
-
It's somewhat apparent from these ratings that the user liked
-
at least one of the movies from from Actor A and one of the movies from Actor B.
-
But didn't he, but
-
they really didn't like two of Actor B's movies, the ones that were rated 1 and 2.
-
1 and 2 are in fact, negative ratings.
-
Not positive ratings.
-
And we try to capture this fact.
-
The idea of normalizing ratings helps us capture the idea that
-
some ratings are actually negative ratings and some are positive ratings.
-
But the baseline, you know, users are very different from each other.
-
Some users are just more generous in their ratings than others.
-
So, for, user a, for instance.
-
A four might be a widely positive rating.
-
But if for another, four might just be an average rating.
-
To sort of capture this idea,
-
we want to baseline each user's ratings by their average rating.
-
So in this case, the, this user's average rating is a three.
-
If you, average all the five ratings that the user, has provided,
-
the average rating, is a three.
-
And so what we're going to do is just subtract the average rating from each of
-
the individual movie ratings.
-
So in this case the movies with actor A the normalized ratings in
-
that case a three and five, become zero and plus two.
-
And for actor B, the normalized ratings become minus two, minus one, and plus one.
-
Notice that this captures intuition that the user did not like, the,
-
the first two movies with actor B, whereas he really liked, the, the,
-
the second movie with, with, with actor A.
-
Where the first movie with actor A was, you know, was kind of an average movie.
-
Once you do this normalization, then you can compute the profile,
-
the profile weights.
-
But in this case we divide not by the total number of movies.
-
But by the total number of movies with a specific feature.
-
So in this case there are two movies with actor A.
-
And profile weight for
-
actor A the feature with actor A is zero plus two divided by two which is one.
-
And similarly the feature actor B has a profile weight of -2/3.
-
This indicates a mild positive preference for, for actor A.
-
And a mild negative preference for actor B.
-
Now that we have user profiles and
-
actor profiles, the next task is to recommend certain items to the user.
-
The key step in this is to take a pair of user profile and item profile, and
-
figure out what the rating for that user and item pair is likely to be.
-
Remember that both the user profile and
-
the item profile are vectors in high-dimensional space.
-
In this case I've shown them in a two-dimensional space, when the reality of
-
course, they're embedded in a much higher dimensional space.
-
You might recall from a prior lecture that when you have vectors in
-
higher dimensional space a good distance metric between the pair of vectors is
-
the angle theta between the pair of vectors.
-
In particular, you can estimate the angle using the cosine formula.
-
The cosine of Theta, Theta, the angle between the two vectors is given by
-
the dot product of the two vectors, divided by the product of the magnitudes.
-
And this distance, in, in this case,
-
we'll call this cosine similarity between, the user x and the item i.
-
Now technically the cosine distance is actually the angle theta and
-
not the cosine of the angle.
-
Right?
-
The cosine distance, as we studied in an earlier lecture, is the angle theta and
-
the cosine similarity is the angle 180 minus theta.
-
Now the smaller the angle, the more similar the item x and
-
the the, the more similar the user x and the item i r.
-
And therefore the similarity 180 minus data is going to be larger.
-
But for convenience, were going to actually use the cosine of theta as,
-
as a similarity measure.
-
Notice that as the angle of theta becomes smaller, cost theta becomes larger.
-
And as it angle theta becomes larger and larger, the cosine becomes smaller and
-
smaller in fact, as theta becomes greater than 90,
-
the cosine of theta becomes negative.
-
And, so this captures intuition, that, as the angle becomes smaller and
-
smaller, x and i are more and more similar to each other, and the, and
-
it's more likely that x will give a high rating to item i.
-
So the way we make predictions is as follows.
-
Given the user x, we compute the cosine similarity between that user and
-
all the items in the catalog.
-
And then you pick the items with the highest cosine similarity and
-
recommend those to the user.
-
And that's a theory of content-based recommendations.
-
Now let's look at some of the pros and
-
cons of the content-based recommendation approach.
-
The biggest pro of the content-based recommendation approach is that you don't
-
need data about other users in order to make recommendations to a specific user.
-
This turns out to be a very, very good thing because you know,