Placeholder Image

Subtitles section Play video

  • Welcome back to Mining of Massive Datasets.

  • We're going to continue or

  • discussion of clustering by looking at the CURE Algorithm.

  • CURE is an acronym that stands for Clustering Using Representatives.

  • But before we get to the algorithm itself, let's see why we need it.

  • Remember, we've looked at the, BFR algorithm, or

  • the Bradley-Fayyad-Reina algorithm, in the last lecture, for

  • clustering very large datasets that don't fit in memory.

  • The BFR algorithm is great, because you can scan the data in one pass, and

  • obtain clusters.

  • The problem, though, is that the BFR algorithm makes very

  • strong assumptions about the data, and about what the clusters look like.

  • The first assumption that the BFR Algorithm makes is that the clusters

  • are normally distributed in each dimension.

  • That in each dimension there is a fixed centroid and

  • a fixed standard deviation that the, that each cluster follows along each dimension.

  • The second strong assumption that the BFR Algorithm makes is

  • that the axes are fixed.

  • So the clusters then, if you follow both these assumptions, the,

  • that the clusters are normally distributed in each dimension and the axes are fixed.

  • The clusters that are discovered by the BFR Algorithm had this

  • the cigar kind of shape that, that you see here on the left.

  • it, it, it could either be a horizontal cigar shape or a vertical cigar shape.

  • Or a circle, which is kind of a limiting case of, of an ellipse.

  • But if your clusters actually are not oriented along the x or

  • the y axis in this case, or along the axis in general in the multi-dimensional case,

  • but are at an angle to the axis as, as I show in the, in the fig,

  • in, in the second picture here.

  • that's, that's not okay.

  • The BFR Algorithm will not find a cluster that looks like a tilted ellipse.

  • It can only find clusters that look like either upright or horizontal ellipses.

  • And clusters actually look very, very different, like the picture on

  • the extreme right here where there are two clusters and the clusters look kind of

  • like crescent moons except in, in opposite directions, those would definitely not be

  • found by the BFR Algorithm because they don't look like cigars at all.

  • They don't look like ellipses at all or in any dimension.

  • So, that's the kind of cluster that will never be found by the BFR Algorithm.

  • So the BFR Algorithm,

  • even though it's super efficient makes the strong assumptions the clusters are going

  • to look like the, the pictures on the extreme left, and not like the other two.

  • And we'd like to avoid this assumption, and

  • try to find clusters regardless of what they actually look like,

  • because we don't control what the clusters look like in the, in the, in the data.

  • The CURE Algorithm tries to fix this problem with the with the BFR Algorithm.

  • The CURE Algorithm assumes a Euclidean distance.

  • I remember a Euclidean distance metric means between any two points, we can

  • always find a mid point of two points by taking the average of those two points.

  • However, the CURE Algorithm,

  • unlike the BFR Algorithm, allows clusters to assume any shape whatsoever.

  • There is no restriction on the, on the shape of the clusters.

  • So in the CURE Algorithm any of these clusters, the the, the first,

  • the second, or the third are perfectly fine.

  • The CURE Algorithm works, can find clusters of, of those shapes.

  • Now difference between the CURE Algorithm and

  • the BFR Algorithm is that the BFR we represent each cluster using its centroid.

  • Whereas in the CURE Algorithm, we're going to use instead of a centroid,

  • we're going to represent each cluster by a collection of representative points.

  • So, instead of representing a cluster by one point,

  • we're going to represent it by many points.

  • Here's an example of a dataset where the clusters don't look anything at

  • all like ellipses or cigars.

  • So, this data, on x axis we have the age of faculty members at

  • a university like Stanford and on the y axis we have their salaries.

  • Now these are, these are, this is not the actual data, but more a representation of

  • what the data might look like, although it's based on, on real world experience.

  • Now, the, this, the data points marked by h,

  • are salaries of faculty members in the humanities.

  • Where the, data points marked with an e are salaries of faculty members in,

  • in engineering, departments.

  • And as you can see, it's apparent from the, from the graph here, that in

  • the humanities the, the starting salary is, is somewhat lower than in engineering.

  • A humanities, faculty member starts at a much lower salary than

  • an engineering faculty member.

  • But, as, over time, as their tenure increases,

  • the salary of a humanities, faculty member, keeps increasing and

  • eventually overtakes the salary of a an engineering faculty member.

  • But in the salary of engineering faculty members increases a little bit with their

  • tenure, but then kind of flattens out, it doesn't increase beyond that.

  • And this is just a phenomenon that has been observed in, in terms of salaries at,

  • at most universities and presumably this is because, in,

  • in the engineering departments the fields keep changing so

  • much that there's a lot of value in bringing in new

  • faculty with with new interests and and new you know, new expertise.

  • Whereas in the humanities I guess you age better as you age.

  • So if you sort of look at the look at these two sets of salaries and

  • you try to cluster them, what you really want in an ideal world is, is, is two,

  • is two clusters.

  • One that you know looks at the engineering salaries,

  • and one that looks at the humanities salaries and see it,

  • and cleanly separate out these two data points into, into two separate clusters.

  • Now when you're looking at the data, remember you do know that some of

  • these of salaries corresponding to engineering faculty members and

  • some to humanities facilities members so so

  • the clustering algorithm doesn't have access to this information but you'd yet

  • like it to find these find these clusters in the data.

  • Now it's too much to hope that a clustering algorithm can actually

  • find these exact clusters because these are overlapping clusters, and

  • most clustering algorithm cannot find cluster that actually overlap with

  • with each other where a single data point is in two clusters.

  • But at the very least, we can hope that the clustering algorithm finds some

  • approximation for these clusters.

  • For example we might want it to discover one cluster there.

  • Another cluster there.

  • And a third cluster there.

  • So that would be a nice, nice outcome for from from any clustering algorithm.

  • And the CURE Algorithm can indeed find clusters of this nature.

  • The CURE Algorithm is a two pass algorithm.

  • And let's look at the, the first pass of the two pass algorithm.

  • In the first pass,

  • we sample a random set of points from the, dataset that's given to us.

  • Remember the dataset is really large and doesn't fit in memory at all.

  • It's sitting on disk somewhere.

  • But we're going to randomly sample a point from this very large dataset.

  • And we're only going to sample enough points, that, that fit in memory.

  • we'll, we've covered techniques for, sampling, in another lecture, so

  • you know how to do this already.

  • Now, once we have a lot of, sample points that, you know, that

  • you've randomly sampled, we're going to use any main memory clustering algorithm.

  • So, for example, you can use a hierarchical clustering algorithm that

  • we covered in our previous lecture.

  • And you can cluster the the sample points that are in memory using the hierarchical

  • clustering algorithm to create an initial set of clusters.

  • Right?

  • And because hierarchical clustering is is, is, is,

  • is a complex algorithm, it can actually find clusters of the nature you know,

  • it doesn't have any kind of restriction on the kind of clusters that it can find.

  • So it can find clusters of any shape.

  • Now, once we've actually clustered the sample points, and fi,

  • figured out initial sort of clusters, for each of those clusters,

  • we're going to pick representative points to represent them.

  • So what we're going to do, is we're going to pick a number, k, and

  • let's say 4, and we're going to find pick k representative points for each cluster.

  • Now our goal is to, you know, take a cluster like this.

  • Let's say, here's a cluster that, that we found using hierarchical clustering.

  • And we want to find a representative point, that are, you know, as far away

  • from each other to sort of get a good coverage of the cluster as possible.

  • For example, we might want to, if, if we find the first representative point there,

  • you might want to find the second there, the third there, and the fourth there.

  • So we have four points that are sort of well distributed in the cluster and

  • if they, they cover as much of the cluster's, you know, volume as possible.

  • And the, you know, we've sort of in our previous lecture,

  • we've discussed how to pick, points that are as dispersed as possible in a cluster.

  • The technique is to, for each cluster, first pick, the first point at random, and

  • then you pick the second point to be within the cluster but

  • be as far away from the first point as possible.

  • And you pick the third point to be still within the cluster but

  • as far away from points one and two as possible, and so on.

  • And we've covered this technique in a, in a previous previous lecture.

  • So once we've picked these representative points you know,

  • what we're going to do is that we're going to create the synthetic points.

  • And the synthetic points are obtained by moving each,

  • representative point a certain, distance towards the centroid of the cluster.

  • Right? So, we have the cluster.

  • We know its centroid.

  • And we have these, these k representative points.

  • Now, we're going to take each representative point and we're going to

  • create a synthetic point, that is obtained by moving the representative point

  • a fixed fraction, maybe 20%, towards the centroid of the cluster.

  • And the 20% here is a, is a parameter to the algorithm.

  • So let's, let's look at an example, that should make this clear.

  • So, here are faculty salary data.

  • And let's say when you run a hierarchical clustering on a sample of the data,

  • let's say that this is in fact a sample of the data and not the actual data.

  • When you run a hierarchical clustering on this, let's say the,

  • the here is the first cluster that's found.

  • Here is the second cluster and here's the third cluster.

  • So we end up with these three clusters that are found by

  • the hierarchical clustering algorithm.

  • Now, once we've found these clustering algorithms,

  • let's take one of these clusters and and

  • let's say we want to find four remote, or representative point for each cluster.

  • And let's start at the cluster to the right.

  • And there, there's the first representative point,

  • let's say it's the one that there that we've ended up picking.

  • Now we want to pick the second representative point to be in this

  • cluster, but as far away from the first representative point as possible.

  • So that's the point we end up picking.

  • The third representative point is going to be,

  • still in the cluster but as far away from these two points as possible.

  • So we end up with that point there.

  • And the fourth representative point similarly ends up being that,

  • that point there.

  • Now, once you have picked these four remote points for

  • the cluster we're going to move these points towards a centroid of the cluster.

  • No, we, we don't actually going to affect the data itself.

  • But we're going to you know, pick synthetic points that are closer to

  • the centroid of a cluster than the remote points that we've actually picked.

  • Now the centroid of the,

  • of the cluster is somewhere in the middle of the cluster there so each of

  • these points is move, going to move towards the center of the circle here.

  • So when you move the points 20% towards the centroid, we end up

  • with these synthetic points and these synthetic points are manufactured points,

  • are going to be the representative points for this cluster.

  • And we repeat this process for the other clusters as well.

  • So for each cluster we end up with k,

  • in this case 4, represented points representing that cluster.

  • So that's the first pass of the CURE Algorithm.

  • In the, in the second pass, of the, of the algorithm, we scan the whole dataset.

  • Remember, so

  • far we've been working with, a sample of the dataset that fits in memory.

  • Now we go back to the whole dataset and, which is sitting on disc.

  • And we re-scan the whole dataset, and visit each point p.

  • And, and we're going to take the point p and

  • we're going to place it in the cluster that's closest to it.

  • And the definition of closest is very simple.

  • We're going to find the point you know, the, we're going to take p, and we're

  • going to find the, the, among the set of representative points of all the clusters.

  • We're going to find the representative point, that's closest to p, and

  • we're going to assign p to the cluster, belonging to that representative point.

  • And that's it.

  • It's a very, very simple algorithm, where we just scan the data in one pass, and

  • we place, each point, in, in its closest cluster, which is the cluster,

  • with the closest representative point.

  • Now the CURE is a, it's a very, very simple algorithm as as we saw.

  • It just requires a main memory sample and

  • clustered cluster, finding some representative points, and

  • then one scan of the entire dataset to place each data point into you know,

  • in, into the closest cluster, but surprisingly CURE really works well for

  • a large number of used cases in finding, complicated clusters of the,

  • of the kind that we saw with the faculty's salaries and so on.

  • Our discussion of the CURE Algorithm wraps up our discussion of clustering,

  • which we've covered over the past several lectures.

  • Remember that the clustering point, problem is one where we're

  • given a large set of points with a notion of distance between the points,

  • usually in a very high-dimensional data space.

  • And we, and our go, goal is to group these points into some number of clusters,

  • with the points that are closer together being in the same cluster.

  • We'll look at a variety of algorithms to do clustering.