Placeholder Image

Subtitles section Play video

  • Welcome back to Mining of Massive Datasets.

  • Today's topic is Recommender Systems.

  • We're going to start with an overview of recommendation systems, and

  • why they are necessary.

  • Today, we are going to look at the two most common types of Recommender Systems.

  • Content-Based Systems and Collaborative Filtering.

  • And finally we're going to look at how to evaluate Recommender Systems,

  • to make sure they're doing a good job.

  • Let's start with an overview.

  • Imagine any situation where a user interacts with

  • a really large catalog of items.

  • Now these items could be products at Amazon.

  • They could be movies at Netflix.

  • They could be music from Pandora's catalog.

  • Or they could be the news items on Google News.

  • What really matters, is that there are tens of thousands, or

  • hundreds of thousands, or millions of items.

  • A really large catalog.

  • And the user is interacting with this catalog.

  • There's two ways in which a user can interact with a large catalog of items.

  • The first is such, user knows what they're looking for, and they go and

  • they search the catalog for the precise item that they're looking for.

  • Now when you have a very large catalog of items,

  • very often the user doesn't know exactly what they're looking for.

  • And this is where recommendations come in.

  • The system recommends to the user certain items that they think the user will be

  • interested in, based on what they know about the user.

  • Now why do we really need such recommendations?

  • The key that made recommendations so important and

  • why a recommendation system developed so much in the last ten or 20 years.

  • If that he moved from an area scarcity to an area of abundance.

  • What do I mean by this?

  • Imagine that you were out shopping 20 years ago,

  • and you'd go to a local retailer, and

  • you'll find a certain number of products on the shelves of the local retailer.

  • Now, even in the really large retailer like, like a Wal Mart, for instance.

  • Shelf space is a key, is a scarce commodity.

  • It limits the number of items that a retailer can carry.

  • Shelf space is expensive, because it involves real estate costs.

  • And, therefore, a retailer can carry only a certain number of products.

  • Now, a similar situation applies in the case of, for example, TV networks.

  • A TV network can carry only so many shows, because there's only so

  • many hours in a day.

  • And there are only so

  • many movie theaters, so they can only ser, screen a certain number of movies.

  • Now once the internet was developed, things changed.

  • The web enables zero-cost dissemination of information about products.

  • And what this means, is that we can have many more products than ever before.

  • There is no shelf space limitation on the number of products.

  • That's why the number of products on Amazon is much,

  • much more than the number of products available at any physical retailer.

  • The number of you know, movies available on Netflix is more than the number of

  • movies that have been available, available at Blockbuster and so on.

  • This near-zero-cost dissemination of information gives rise to

  • a phenomenon that's called the long tail phenomenon.

  • Let's examine what this is.

  • Now imagine a graph, where on X-axis we've taken the items in the catalog.

  • Remember, items might be books, or music, or video, or news articles.

  • And we've ranked these items by popularity.

  • So the most popular items are on the left, and

  • as they move towards the right, the items become less and less popular.

  • What do I mean by popular?

  • Well, I mean the number of times the item is purchased in a week.

  • Or the number of times a movie is viewed in a week, or a month, or

  • some, some fixed time period.

  • Now, on the Y axis, you have the actual popularity, which in this case I've

  • shown as the number of purchases per week, it could be number of views per week, or

  • it could be number of, you know, plays per month, for some music, and so on.

  • So in general you have items ranked by popularity along the X axis,

  • and the popularity itself along the Y axis.

  • Now when you take items you know, in a large catalog.

  • And you rank them,

  • and you plot them on this curve you get a curve that looks like this.

  • You can see that the score, you know, has a very steep fall initially.

  • the, the, you know, you have a really, really, a few really,

  • really popular items.

  • And then as you move towards the right as the,

  • you know, as the item rank becomes greater the popularity falls off very steeply.

  • But at a certain point, you can see that this popularity stops, you know,

  • falls off less and less deeply.

  • And, you know, it quite reaches the X axis.

  • The interesting thing here, is that there is a cut off point.

  • The you know, items that are less popular than this cut off point.

  • You know, might be purchased perhaps just once a week.

  • Or maybe once a month.

  • If you're a physical retailer like a Wal-Mart, it's not economic to

  • stock this item, because the rent cost of stocking the item is more than you make,

  • when you sell the item.

  • And therefore a retailer,

  • any right thinking retailer doesn't stock items that are unpopular.

  • The, you know, they only stock the, the head of the distribution.

  • So there's this cutoff point that I show on this graph here and items that are more

  • popular then this, the, the more popular items are available at a retail store.

  • But the less popular items, the items that are to the right of the cut off point,

  • are not available at any retail store.

  • They're only available online.

  • Now this phenomenon applies to books, to music, to movie,

  • to videos to news articles for example, there are only so

  • many news articles in newspaper, but when you go online you can see the rest of

  • the news articles are less popular, news articles that are off to the right.

  • The piece of the curve, that is to the the piece of the curve here that is to

  • the right of this dividing line, is called the long tail.

  • These are the items that are available only online.

  • The interesting thing is, the, is this area under the curve here.

  • And you can see the area under the curve here is quite significant.

  • In fact, in some cases the area under the curve on the right is about as large, or

  • could be even larger than the area of the curve,

  • under the curve on the, on the left.

  • So you have all these items that could never be found in a physical store, but

  • that can be only found online.

  • But there are so many of them.

  • That it's very hard for any user to find all these items.

  • Right, so when you have the seed of abundance and

  • you have so many items and many of them are really found online.

  • How, you know, how do you introduce a user to all these new

  • items they may have not otherwise find?

  • When you have more choice like this, when you have these millions and

  • millions of items that are only available online, you need a better way for

  • the user to find all these items.

  • The user doesn't even know where to start looking, and

  • that's where recommendation engines come in.

  • So recommendation engines work in the case of many,

  • many kinds of items books, music, movies, news articles.

  • Interestingly, they even work in the case of people.

  • For example, when you go to Facebook, or LinkedIn, or

  • Twitter, there are so many people that you don't know who to follow or who to friend.

  • And so Facebook, or LinkedIn, or

  • Twitter make recommendations to you, on the people you could follow or friend.

  • I like this point with interesting anecdote that shows you

  • the power of a recommendation engine.

  • Several years ago a book was published called Touching the Void.

  • It's a book about mountaineering.

  • It's very, very good book.

  • The book came out, it didn't make much of a ripple.

  • You know, few people bought the book.

  • It got some decent reviews, but it never became a bestseller.

  • And then a few years after Touching the Void,

  • a new book was published on mountaineering called Into Thin Air.

  • Now Into Thin Air picked up traction,

  • and lots of people started buying Into Thin Air.

  • Amazon noticed that a few of people who bought Into Thin Air,

  • had also bought Touching the Void.

  • So they started recommending Touching the Void, to people who bought Into Thin Air.

  • And low and behold, those people started buying Touching the Void as well.

  • The interesting point is, this made Touching the Void a bestseller.

  • In fact, it became a bigger bestseller even than Into Thin Air,

  • even though a few years ago, it had sank without a trace.

  • So this example should show you the power of recommendation systems.

  • There are these items, these sort of gems like Touching the Void you know,

  • that people don't know because they don't know to look for them.

  • But a good recommendation system can expose people to these hidden gems,

  • that they wouldn't have known about otherwise.

  • So let's look at types of recommendation systems.

  • The simplest and the oldest kind of recommendation is editorial or

  • hand curated.

  • You might find a list of favorites for example when you go into your

  • favorite neighborhood book store, you might find staff picks.

  • Certain marked off as staff picks, right?

  • And these are editorial triangulated, and on certain websites you'll

  • see a list of staff favorites or a list of essential items.

  • These are essentially built by hand.

  • And another place where you'll see these editorial recommendations is often on

  • the homepages of websites.

  • For example if you go to the the homepage of most popular

  • websites including product websites you'll see editorial picks.

  • These are products that have been picked by the editorial staff to feature