Subtitles section Play video Print subtitles Welcome back to Mining of Massive Datasets. Today's topic is Recommender Systems. We're going to start with an overview of recommendation systems, and why they are necessary. Today, we are going to look at the two most common types of Recommender Systems. Content-Based Systems and Collaborative Filtering. And finally we're going to look at how to evaluate Recommender Systems, to make sure they're doing a good job. Let's start with an overview. Imagine any situation where a user interacts with a really large catalog of items. Now these items could be products at Amazon. They could be movies at Netflix. They could be music from Pandora's catalog. Or they could be the news items on Google News. What really matters, is that there are tens of thousands, or hundreds of thousands, or millions of items. A really large catalog. And the user is interacting with this catalog. There's two ways in which a user can interact with a large catalog of items. The first is such, user knows what they're looking for, and they go and they search the catalog for the precise item that they're looking for. Now when you have a very large catalog of items, very often the user doesn't know exactly what they're looking for. And this is where recommendations come in. The system recommends to the user certain items that they think the user will be interested in, based on what they know about the user. Now why do we really need such recommendations? The key that made recommendations so important and why a recommendation system developed so much in the last ten or 20 years. If that he moved from an area scarcity to an area of abundance. What do I mean by this? Imagine that you were out shopping 20 years ago, and you'd go to a local retailer, and you'll find a certain number of products on the shelves of the local retailer. Now, even in the really large retailer like, like a Wal Mart, for instance. Shelf space is a key, is a scarce commodity. It limits the number of items that a retailer can carry. Shelf space is expensive, because it involves real estate costs. And, therefore, a retailer can carry only a certain number of products. Now, a similar situation applies in the case of, for example, TV networks. A TV network can carry only so many shows, because there's only so many hours in a day. And there are only so many movie theaters, so they can only ser, screen a certain number of movies. Now once the internet was developed, things changed. The web enables zero-cost dissemination of information about products. And what this means, is that we can have many more products than ever before. There is no shelf space limitation on the number of products. That's why the number of products on Amazon is much, much more than the number of products available at any physical retailer. The number of you know, movies available on Netflix is more than the number of movies that have been available, available at Blockbuster and so on. This near-zero-cost dissemination of information gives rise to a phenomenon that's called the long tail phenomenon. Let's examine what this is. Now imagine a graph, where on X-axis we've taken the items in the catalog. Remember, items might be books, or music, or video, or news articles. And we've ranked these items by popularity. So the most popular items are on the left, and as they move towards the right, the items become less and less popular. What do I mean by popular? Well, I mean the number of times the item is purchased in a week. Or the number of times a movie is viewed in a week, or a month, or some, some fixed time period. Now, on the Y axis, you have the actual popularity, which in this case I've shown as the number of purchases per week, it could be number of views per week, or it could be number of, you know, plays per month, for some music, and so on. So in general you have items ranked by popularity along the X axis, and the popularity itself along the Y axis. Now when you take items you know, in a large catalog. And you rank them, and you plot them on this curve you get a curve that looks like this. You can see that the score, you know, has a very steep fall initially. the, the, you know, you have a really, really, a few really, really popular items. And then as you move towards the right as the, you know, as the item rank becomes greater the popularity falls off very steeply. But at a certain point, you can see that this popularity stops, you know, falls off less and less deeply. And, you know, it quite reaches the X axis. The interesting thing here, is that there is a cut off point. The you know, items that are less popular than this cut off point. You know, might be purchased perhaps just once a week. Or maybe once a month. If you're a physical retailer like a Wal-Mart, it's not economic to stock this item, because the rent cost of stocking the item is more than you make, when you sell the item. And therefore a retailer, any right thinking retailer doesn't stock items that are unpopular. The, you know, they only stock the, the head of the distribution. So there's this cutoff point that I show on this graph here and items that are more popular then this, the, the more popular items are available at a retail store. But the less popular items, the items that are to the right of the cut off point, are not available at any retail store. They're only available online. Now this phenomenon applies to books, to music, to movie, to videos to news articles for example, there are only so many news articles in newspaper, but when you go online you can see the rest of the news articles are less popular, news articles that are off to the right. The piece of the curve, that is to the the piece of the curve here that is to the right of this dividing line, is called the long tail. These are the items that are available only online. The interesting thing is, the, is this area under the curve here. And you can see the area under the curve here is quite significant. In fact, in some cases the area under the curve on the right is about as large, or could be even larger than the area of the curve, under the curve on the, on the left. So you have all these items that could never be found in a physical store, but that can be only found online. But there are so many of them. That it's very hard for any user to find all these items. Right, so when you have the seed of abundance and you have so many items and many of them are really found online. How, you know, how do you introduce a user to all these new items they may have not otherwise find? When you have more choice like this, when you have these millions and millions of items that are only available online, you need a better way for the user to find all these items. The user doesn't even know where to start looking, and that's where recommendation engines come in. So recommendation engines work in the case of many, many kinds of items books, music, movies, news articles. Interestingly, they even work in the case of people. For example, when you go to Facebook, or LinkedIn, or Twitter, there are so many people that you don't know who to follow or who to friend. And so Facebook, or LinkedIn, or Twitter make recommendations to you, on the people you could follow or friend. I like this point with interesting anecdote that shows you the power of a recommendation engine. Several years ago a book was published called Touching the Void. It's a book about mountaineering. It's very, very good book. The book came out, it didn't make much of a ripple. You know, few people bought the book. It got some decent reviews, but it never became a bestseller. And then a few years after Touching the Void, a new book was published on mountaineering called Into Thin Air. Now Into Thin Air picked up traction, and lots of people started buying Into Thin Air. Amazon noticed that a few of people who bought Into Thin Air, had also bought Touching the Void. So they started recommending Touching the Void, to people who bought Into Thin Air. And low and behold, those people started buying Touching the Void as well. The interesting point is, this made Touching the Void a bestseller. In fact, it became a bigger bestseller even than Into Thin Air, even though a few years ago, it had sank without a trace. So this example should show you the power of recommendation systems. There are these items, these sort of gems like Touching the Void you know, that people don't know because they don't know to look for them. But a good recommendation system can expose people to these hidden gems, that they wouldn't have known about otherwise. So let's look at types of recommendation systems. The simplest and the oldest kind of recommendation is editorial or hand curated. You might find a list of favorites for example when you go into your favorite neighborhood book store, you might find staff picks. Certain marked off as staff picks, right? And these are editorial triangulated, and on certain websites you'll see a list of staff favorites or a list of essential items. These are essentially built by hand. And another place where you'll see these editorial recommendations is often on the homepages of websites. For example if you go to the the homepage of most popular websites including product websites you'll see editorial picks. These are products that have been picked by the editorial staff to feature