Subtitles section Play video Print subtitles Welcome back to Mining of Massive Datasets. We're going to continue or discussion of clustering by looking at the CURE Algorithm. CURE is an acronym that stands for Clustering Using Representatives. But before we get to the algorithm itself, let's see why we need it. Remember, we've looked at the, BFR algorithm, or the Bradley-Fayyad-Reina algorithm, in the last lecture, for clustering very large datasets that don't fit in memory. The BFR algorithm is great, because you can scan the data in one pass, and obtain clusters. The problem, though, is that the BFR algorithm makes very strong assumptions about the data, and about what the clusters look like. The first assumption that the BFR Algorithm makes is that the clusters are normally distributed in each dimension. That in each dimension there is a fixed centroid and a fixed standard deviation that the, that each cluster follows along each dimension. The second strong assumption that the BFR Algorithm makes is that the axes are fixed. So the clusters then, if you follow both these assumptions, the, that the clusters are normally distributed in each dimension and the axes are fixed. The clusters that are discovered by the BFR Algorithm had this the cigar kind of shape that, that you see here on the left. it, it, it could either be a horizontal cigar shape or a vertical cigar shape. Or a circle, which is kind of a limiting case of, of an ellipse. But if your clusters actually are not oriented along the x or the y axis in this case, or along the axis in general in the multi-dimensional case, but are at an angle to the axis as, as I show in the, in the fig, in, in the second picture here. that's, that's not okay. The BFR Algorithm will not find a cluster that looks like a tilted ellipse. It can only find clusters that look like either upright or horizontal ellipses. And clusters actually look very, very different, like the picture on the extreme right here where there are two clusters and the clusters look kind of like crescent moons except in, in opposite directions, those would definitely not be found by the BFR Algorithm because they don't look like cigars at all. They don't look like ellipses at all or in any dimension. So, that's the kind of cluster that will never be found by the BFR Algorithm. So the BFR Algorithm, even though it's super efficient makes the strong assumptions the clusters are going to look like the, the pictures on the extreme left, and not like the other two. And we'd like to avoid this assumption, and try to find clusters regardless of what they actually look like, because we don't control what the clusters look like in the, in the, in the data. The CURE Algorithm tries to fix this problem with the with the BFR Algorithm. The CURE Algorithm assumes a Euclidean distance. I remember a Euclidean distance metric means between any two points, we can always find a mid point of two points by taking the average of those two points. However, the CURE Algorithm, unlike the BFR Algorithm, allows clusters to assume any shape whatsoever. There is no restriction on the, on the shape of the clusters. So in the CURE Algorithm any of these clusters, the the, the first, the second, or the third are perfectly fine. The CURE Algorithm works, can find clusters of, of those shapes. Now difference between the CURE Algorithm and the BFR Algorithm is that the BFR we represent each cluster using its centroid. Whereas in the CURE Algorithm, we're going to use instead of a centroid, we're going to represent each cluster by a collection of representative points. So, instead of representing a cluster by one point, we're going to represent it by many points. Here's an example of a dataset where the clusters don't look anything at all like ellipses or cigars. So, this data, on x axis we have the age of faculty members at a university like Stanford and on the y axis we have their salaries. Now these are, these are, this is not the actual data, but more a representation of what the data might look like, although it's based on, on real world experience. Now, the, this, the data points marked by h, are salaries of faculty members in the humanities. Where the, data points marked with an e are salaries of faculty members in, in engineering, departments. And as you can see, it's apparent from the, from the graph here, that in the humanities the, the starting salary is, is somewhat lower than in engineering. A humanities, faculty member starts at a much lower salary than an engineering faculty member. But, as, over time, as their tenure increases, the salary of a humanities, faculty member, keeps increasing and eventually overtakes the salary of a an engineering faculty member. But in the salary of engineering faculty members increases a little bit with their tenure, but then kind of flattens out, it doesn't increase beyond that. And this is just a phenomenon that has been observed in, in terms of salaries at, at most universities and presumably this is because, in, in the engineering departments the fields keep changing so much that there's a lot of value in bringing in new faculty with with new interests and and new you know, new expertise. Whereas in the humanities I guess you age better as you age. So if you sort of look at the look at these two sets of salaries and you try to cluster them, what you really want in an ideal world is, is, is two, is two clusters. One that you know looks at the engineering salaries, and one that looks at the humanities salaries and see it, and cleanly separate out these two data points into, into two separate clusters. Now when you're looking at the data, remember you do know that some of these of salaries corresponding to engineering faculty members and some to humanities facilities members so so the clustering algorithm doesn't have access to this information but you'd yet like it to find these find these clusters in the data. Now it's too much to hope that a clustering algorithm can actually find these exact clusters because these are overlapping clusters, and most clustering algorithm cannot find cluster that actually overlap with with each other where a single data point is in two clusters. But at the very least, we can hope that the clustering algorithm finds some approximation for these clusters. For example we might want it to discover one cluster there. Another cluster there. And a third cluster there. So that would be a nice, nice outcome for from from any clustering algorithm. And the CURE Algorithm can indeed find clusters of this nature. The CURE Algorithm is a two pass algorithm. And let's look at the, the first pass of the two pass algorithm. In the first pass, we sample a random set of points from the, dataset that's given to us. Remember the dataset is really large and doesn't fit in memory at all. It's sitting on disk somewhere. But we're going to randomly sample a point from this very large dataset. And we're only going to sample enough points, that, that fit in memory. we'll, we've covered techniques for, sampling, in another lecture, so you know how to do this already. Now, once we have a lot of, sample points that, you know, that you've randomly sampled, we're going to use any main memory clustering algorithm. So, for example, you can use a hierarchical clustering algorithm that we covered in our previous lecture. And you can cluster the the sample points that are in memory using the hierarchical clustering algorithm to create an initial set of clusters. Right? And because hierarchical clustering is is, is, is, is a complex algorithm, it can actually find clusters of the nature you know, it doesn't have any kind of restriction on the kind of clusters that it can find. So it can find clusters of any shape. Now, once we've actually clustered the sample points, and fi, figured out initial sort of clusters, for each of those clusters, we're going to pick representative points to represent them. So what we're going to do, is we're going to pick a number, k, and let's say 4, and we're going to find pick k representative points for each cluster. Now our goal is to, you know, take a cluster like this. Let's say, here's a cluster that, that we found using hierarchical clustering. And we want to find a representative point, that are, you know, as far away from each other to sort of get a good coverage of the cluster as possible. For example, we might want to, if, if we find the first representative point there, you might want to find the second there, the third there, and the fourth there. So we have four points that are sort of well distributed in the cluster and if they, they cover as much of the cluster's, you know, volume as possible. And the, you know, we've sort of in our previous lecture, we've discussed how to pick, points that are as dispersed as possible in a cluster. The technique is to, for each cluster, first pick, the first point at random, and then you pick the second point to be within the cluster but be as far away from the first point as possible. And you pick the third point to be still within the cluster but as far away from points one and two as possible, and so on. And we've covered this technique in a, in a previous previous lecture. So once we've picked these representative points you know, what we're going to do is that we're going to create the synthetic points. And the synthetic points are obtained by moving each, representative point a certain, distance towards the centroid of the cluster. Right? So, we have the cluster. We know its centroid. And we have these, these k representative points. Now, we're going to take each representative point and we're going to create a synthetic point, that is obtained by moving the representative point a fixed fraction, maybe 20%, towards the centroid of the cluster. And the 20% here is a, is a parameter to the algorithm. So let's, let's look at an example, that should make this clear. So, here are faculty salary data. And let's say when you run a hierarchical clustering on a sample of the data, let's say that this is in fact a sample of the data and not the actual data. When you run a hierarchical clustering on this, let's say the, the here is the first cluster that's found. Here is the second cluster and here's the third cluster. So we end up with these three clusters that are found by the hierarchical clustering algorithm. Now, once we've found these clustering algorithms, let's take one of these clusters and and let's say we want to find four remote, or representative point for each cluster. And let's start at the cluster to the right. And there, there's the first representative point, let's say it's the one that there that we've ended up picking. Now we want to pick the second representative point to be in this cluster, but as far away from the first representative point as possible. So that's the point we end up picking. The third representative point is going to be, still in the cluster but as far away from these two points as possible. So we end up with that point there. And the fourth representative point similarly ends up being that, that point there. Now, once you have picked these four remote points for the cluster we're going to move these points towards a centroid of the cluster. No, we, we don't actually going to affect the data itself. But we're going to you know, pick synthetic points that are closer to the centroid of a cluster than the remote points that we've actually picked. Now the centroid of the, of the cluster is somewhere in the middle of the cluster there so each of these points is move, going to move towards the center of the circle here. So when you move the points 20% towards the centroid, we end up with these synthetic points and these synthetic points are manufactured points, are going to be the representative points for this cluster. And we repeat this process for the other clusters as well. So for each cluster we end up with k, in this case 4, represented points representing that cluster. So that's the first pass of the CURE Algorithm. In the, in the second pass, of the, of the algorithm, we scan the whole dataset. Remember, so far we've been working with, a sample of the dataset that fits in memory. Now we go back to the whole dataset and, which is sitting on disc. And we re-scan the whole dataset, and visit each point p. And, and we're going to take the point p and we're going to place it in the cluster that's closest to it. And the definition of closest is very simple. We're going to find the point you know, the, we're going to take p, and we're going to find the, the, among the set of representative points of all the clusters. We're going to find the representative point, that's closest to p, and we're going to assign p to the cluster, belonging to that representative point. And that's it. It's a very, very simple algorithm, where we just scan the data in one pass, and we place, each point, in, in its closest cluster, which is the cluster, with the closest representative point. Now the CURE is a, it's a very, very simple algorithm as as we saw. It just requires a main memory sample and clustered cluster, finding some representative points, and then one scan of the entire dataset to place each data point into you know, in, into the closest cluster, but surprisingly CURE really works well for a large number of used cases in finding, complicated clusters of the, of the kind that we saw with the faculty's salaries and so on. Our discussion of the CURE Algorithm wraps up our discussion of clustering, which we've covered over the past several lectures. Remember that the clustering point, problem is one where we're given a large set of points with a notion of distance between the points, usually in a very high-dimensional data space. And we, and our go, goal is to group these points into some number of clusters, with the points that are closer together being in the same cluster. We'll look at a variety of algorithms to do clustering.