Name: 6 5 The CURE Algorithm 15 13 Advanced
Uploaded: 2017-08-24T17:09:35.000Z
Duration: 15 min 14 s
Description: Thousands of YouTube videos with English-Chinese subtitles! Now you can learn to understand native speakers, expand your vocabulary, and improve your pronunciation...

Welcome back to Mining of Massive Datasets.

discussion of clustering by looking at the CURE Algorithm.

CURE is an acronym that stands for Clustering Using Representatives.

But before we get to the algorithm itself, let's see why we need it.

Remember, we've looked at the, BFR algorithm, or

the Bradley-Fayyad-Reina algorithm, in the last lecture, for

clustering very large datasets that don't fit in memory.

The BFR algorithm is great, because you can scan the data in one pass, and

The problem, though, is that the BFR algorithm makes very

strong assumptions about the data, and about what the clusters look like.

The first assumption that the BFR Algorithm makes is that the clusters

are normally distributed in each dimension.

That in each dimension there is a fixed centroid and

a fixed standard deviation that the, that each cluster follows along each dimension.

The second strong assumption that the BFR Algorithm makes is

So the clusters then, if you follow both these assumptions, the,

that the clusters are normally distributed in each dimension and the axes are fixed.

The clusters that are discovered by the BFR Algorithm had this

the cigar kind of shape that, that you see here on the left.

it, it, it could either be a horizontal cigar shape or a vertical cigar shape.

Or a circle, which is kind of a limiting case of, of an ellipse.

But if your clusters actually are not oriented along the x or

the y axis in this case, or along the axis in general in the multi-dimensional case,

but are at an angle to the axis as, as I show in the, in the fig,

The BFR Algorithm will not find a cluster that looks like a tilted ellipse.

It can only find clusters that look like either upright or horizontal ellipses.

And clusters actually look very, very different, like the picture on

the extreme right here where there are two clusters and the clusters look kind of

like crescent moons except in, in opposite directions, those would definitely not be

found by the BFR Algorithm because they don't look like cigars at all.

They don't look like ellipses at all or in any dimension.

So, that's the kind of cluster that will never be found by the BFR Algorithm.

even though it's super efficient makes the strong assumptions the clusters are going

to look like the, the pictures on the extreme left, and not like the other two.

And we'd like to avoid this assumption, and

try to find clusters regardless of what they actually look like,

because we don't control what the clusters look like in the, in the, in the data.

The CURE Algorithm tries to fix this problem with the with the BFR Algorithm.

The CURE Algorithm assumes a Euclidean distance.

I remember a Euclidean distance metric means between any two points, we can

always find a mid point of two points by taking the average of those two points.

unlike the BFR Algorithm, allows clusters to assume any shape whatsoever.

There is no restriction on the, on the shape of the clusters.

So in the CURE Algorithm any of these clusters, the the, the first,

the second, or the third are perfectly fine.

The CURE Algorithm works, can find clusters of, of those shapes.

Now difference between the CURE Algorithm and

the BFR Algorithm is that the BFR we represent each cluster using its centroid.

Whereas in the CURE Algorithm, we're going to use instead of a centroid,

we're going to represent each cluster by a collection of representative points.

So, instead of representing a cluster by one point,

we're going to represent it by many points.

Here's an example of a dataset where the clusters don't look anything at

So, this data, on x axis we have the age of faculty members at

a university like Stanford and on the y axis we have their salaries.

Now these are, these are, this is not the actual data, but more a representation of

what the data might look like, although it's based on, on real world experience.

Now, the, this, the data points marked by h,

are salaries of faculty members in the humanities.

Where the, data points marked with an e are salaries of faculty members in,

And as you can see, it's apparent from the, from the graph here, that in

the humanities the, the starting salary is, is somewhat lower than in engineering.

A humanities, faculty member starts at a much lower salary than

But, as, over time, as their tenure increases,

the salary of a humanities, faculty member, keeps increasing and

eventually overtakes the salary of a an engineering faculty member.

But in the salary of engineering faculty members increases a little bit with their

tenure, but then kind of flattens out, it doesn't increase beyond that.

And this is just a phenomenon that has been observed in, in terms of salaries at,

at most universities and presumably this is because, in,

in the engineering departments the fields keep changing so

much that there's a lot of value in bringing in new

faculty with with new interests and and new you know, new expertise.

Whereas in the humanities I guess you age better as you age.

So if you sort of look at the look at these two sets of salaries and

you try to cluster them, what you really want in an ideal world is, is, is two,

One that you know looks at the engineering salaries,

and one that looks at the humanities salaries and see it,

and cleanly separate out these two data points into, into two separate clusters.

Now when you're looking at the data, remember you do know that some of

these of salaries corresponding to engineering faculty members and

some to humanities facilities members so so

the clustering algorithm doesn't have access to this information but you'd yet

like it to find these find these clusters in the data.

Now it's too much to hope that a clustering algorithm can actually

find these exact clusters because these are overlapping clusters, and

most clustering algorithm cannot find cluster that actually overlap with

with each other where a single data point is in two clusters.

But at the very least, we can hope that the clustering algorithm finds some

Subtitles ListPlay Video

6 5 The CURE Algorithm 15 13 Advanced

sort

technique

scale

random