Name: Unsupervised Machine Learning: Crash Course Statistics #37
Uploaded: 2020-04-04T04:38:02.000Z
Duration: 10 min 56 s
Description: Thousands of YouTube videos with English-Chinese subtitles! Now you can learn to understand native speakers, expand your vocabulary, and improve your pronunciation...

Hi, I'm Adriene Hill, and welcome back to Crash Course Statistics.

In the last episode, we talked about using Machine Learning with data that already has

Like teaching a computer to tell whether an image contains a hotdog or not.

Or using health information to predict whether someone has diabetes.

Sometimes we want to create labels that don't exist yet.

Like if we wanted to use test and homework grades to create 3 different groups of students

If you group similar students together, you can target each group with a specific review

Because the groups don't already exist, we call this Unsupervised Machine Learning

since we can't give our models feedback on whether they're right or not.

There are no “True” categories to compare our groups with.

Putting data into groups that don't already exist might seem kinda weird but today we'll

explore two types of Clustering--the main type of Unsupervised Machine Learning: k-means

And we'll see how creating new groups can actually help us a lot.

You've been collecting data on your customers' pizza eating habits.

Like how many pizzas a person orders a week.

And the average number of toppings they get on their pizzas.

You're rolling out a new coupon program and you want to create 3 groups of customers

and make custom coupons to target their needs.

Maybe 2-for-1 five-topping medium pizzas.

So let's use k-means to create 3 customer groups.

All we know right now is that we want 3 separate groups.

So, what the k-means algorithm does is select 3 random points on your graph.

Usually these are data points from your set, but they don't have to be.

Then, we treat these random points as the centers of our 3 groups.

We assign each data point (the points in black) to the group of the centroid that it's closest to.

This point here is closest to the Green center.

Once we assign each point to the group it's closest to, we now have three groups, or clusters.

Now that each group has some members, we calculate the current centroid for each group.

And now that we have the new centroids we'll repeat this process of assigning every point

to the closest centroid and then recalculating the new centroids.

The computer will do this over and over again until the centroids “converge”.

And here, converge means that the centroids and groups stop changing, even as you keep

Once it converges, you have your 3 groups, or clusters.

We can then look at the clusters and decide which coupons to send.

For example, this group doesn't order many pizzas each week but when they do, they order

So they might like the “Buy 3 toppings get 2 free” coupon.

Whereas this group, who orders a lot of simple pizzas, might like the “20% off Medium-2

(This is probably also the pineapple group since really, there aren't that many things

that pair well with pineapple and cheese.)

If you were a scientist, you might want to look at the differences in health outcomes

Like whether the group that orders a lot of pizza has higher cholesterol.

You may even want to look at the data in 5 clusters instead of 3.

It will even allow you to create 5 clusters of Crash Course Viewers based on how many

Raccoons they think they can fight off, and the number of Pieces of Pizza they claim to

A K-means clustering created these 5 groups.

We can see that this green group is PRETTY confident that they could fight off a lot

On the other hand, we also see the light blue group.

They have perhaps more reasonable expectations about their raccoon fighting abilities, they

Which makes me wonder…could they get the pizza delivery folks to help out if we go

Unlike the Supervised Machine Learning we looked at last time, you can't calculate

the “accuracy” of your results because there's no true groups or labels to compare.

There's one method called the silhouette score can help us determine how well fit our

clusters are even without existing labels.

Roughly speaking, the silhouette score measures cluster “cohesion and separation” which

is just a fancy way of saying that the data points in that cluster are close to each other,

but far away from points in other clusters.

Here's an example of clusters that have HIGH silhoutte scores.

And here's an example of clusters that have LOW silhouette scores.

In an ideal world, we prefer HIGH silhouette scores, because that means that there are

For example, if you clustered data from lollipops and Filet Mignon based on sugar, fat, and

protein content the two groups would be VERY far apart from each other, with very little

overlap--leading to high silhouette scores.

But if you clustered data from Filet Mignon and a New York Strip steak, the data would

probably have lower silhouette scores, because the two groups would be closer together - there'd

Putting data into groups is useful, but sometimes, we want to know more about the structure of

Like whether there are subgroups--or subclusters.

Like in real life when we could look at two groups: people who eat meat and those who

The differences between the groups' health or beliefs might be interesting, but we also

know that people who eat meat could be broken up into even smaller groups like people who

Subtitles ListPlay Video

Unsupervised Machine Learning: Crash Course Statistics #37

diabetes

determine

compelling

disorder