Data Analysis 8: Classifying Data - Computerphile - VoiceTube: Learn English through videos!

Subtitles section Play video

It's becoming increasingly common to start using machine learning or AI driven techniques to make decisions
The world over so for example, you know credit checks health checks, and these can be life-changing
right, so it's really important we get this right you could find yourself turned down through a mortgage on your dream house because quite literally
The computer says no
Let's talk a little bit about classification. So now we have a data set where we've got labels
All right, so we've got some input features or input
Attributes or dimensions lots of instances and we've got some labels for these attributes
All right, and so we've got for example books and the type of book or music and the genre with the music
Things that we want to start to try and classify
So supervised learning is the idea that we've got labels for our data. So we're still gonna have instances
We're gonna have attributes or dimensions to our instances. But we've also now got labels for our data and so
Classification is the process of learning how to correctly assign these labels to these instances before we start talking about classifiers
Let's talk a little bit about the learning process and machine learning process
we want to use it's not enough to say I've got my data set and I can correctly predict all of the classes right because
Then someone will ask well what happens if we have any new data that we haven't seen before right?
Maybe you've got some medical data and you can correct me
Diagnose all of the diseases but a new patient comes along and you could incorrectly diagnose the disease, right? That's not helped
anyone
What we need is a regimented way of training and testing these approaches so that we know how well they apply in the real world
So what we're going to do is we've got some data set just like before
Where we've got some instances and we've got some attributes this way and so, you know
We might have a lot of attributes a few it doesn't really matter
and we also now have our labels which we often call Y right but this is going to be a vector of all of the
Labels for data, so this could be label one-one B's could be a few twos down here
And this could be a few three
So this is a bit like our tennis example where we had this is the weather outlook and are we going to play?
Tennis today, right? Yes, or no so that you could have multiple labels or just two for binary classification
It's not enough just to train a classifier over all this data
We want to make sure that this classifier will work properly when we apply a new data to it
So what we're going to do is we're going to separate this data into training sets
And testing sets so we're going to train on the training set
Then we're going to test as we go on the validation set and then right at the end when we're finished we're going to do
a final test on our test set
The reason we do this is it's a very safe way to make sure that we don't accidentally gain the system
We don't accidentally report incredibly good results on the training set
but that's because we all just show the Machine those things so we hold out the validation of a test set for later to make
Sure that it will generalize now exactly how much of your data goes in the training validation and testing set is really up to you
right
typically
You might use something like 70% for training
15% for validation of 15% for testing that will be quite a reasonable way of doing it
So what are some good classifiers we could use given that we've done this right? Let's imagine. We've got our instances
We've got our attributes and we split them up probably randomly into training validation and testing
What we want to do is train our classifier on the training set and then test it on the validation and testing sets to see
How we're getting on so what algorithms could we use? Let's start with a simplest. One of all zero are in zero
Are we just take the most common label and that's what we predict every time. It's V
You've got five minutes until the deadline just hand something in
Approach to machine learning in the case of playing tennis or not playing tennis we could say well I play tennis more than I didn't
So we'll just assume that I'm going to play tennis and predict. Yes all the time
All right, regardless of what the weather is this is not a good way to perform machine learning
But I suppose it does give you a baseline accuracy, right?
If you're baseline of just yet saying yes to everything is sixty percent accuracy
Then if your machine learning doesn't perform at least a 60 percent, we know we've got a real problem
We can go one better than that
We can use one R one R is where we pick one of our attributes
We made classification based only on that and then we pick the best of those attributes
I mean, it's slightly better than 0 R but not a lot
So you'll find you will find references to bees in military too a little bit but not very much
Because we use much more powerful approaches to this. So let's talk about one example classifier is very popular and that's
KNN or k nearest neighbor let's imagine. We've got a to
Attribute data set. So I like to draw in two dimensions. It's just a little easier for me
And so we've got attribute one and attribute two, and we've got some different data points in here now
Don't forget also that each of these is going to have a prediction as well
so if this one
Is going to have let's say a label if we did play tennis when we want to test a new data point an unseen data
So a new person comes along who may or may not play tennis. They're going to appear over here
We measure them and we find the K number of nearest neighbors to this point
So that's this one this one this one this one and this one
so this will be 1 2 3 4 5 6 this would be K of 6 and then we take the majority vote or the
Average of these responses so if four out of six of these people play tennis, this would be assigned to play tennis
So the output is what in the existing data set. Have we already seen nearby?
And can we use that to make a prediction?
So this is quite a good approach obviously choosing K is a little bit difficult to do
Right and this starts to get very very slow when you've got hundreds and hundreds of dimensions finding for K nearest points to a point
When you've got tens of thousands of dimensions or tens of thousands of instances, it's not easy to do even with good data structures
Why it starts to get slow quite quickly nevertheless. This is an effective and popular approach
Are there any alternatives there is one decision trees. All right, now I like decision trees
They have a nice benefit that once we created a decision tree
Which is just a series of decisions on is the data this yes, is it this?
No, once we've done all that we can actually look at the rules and say ok. That's how a decision was made
And that's quite a good rule set. So kind of a way of lighting a sort of if-else
Programming language, but you're doing it automatically let's draw out another data set
So we've got our instances down here and we've got our attributes here and remember for each of our instances
We're going to have some label that we're trying to output
All right
So here well
You know 1 2 3 4 5 6 and so on
So let's imagine but this is a credit score by a credit check
So you've got actually boots based on how much money you've got how much you spent me to me if you already have other loans
and
What we want to do is make a decision as to whether you should be allowed more credit or not, right?
So the answer is yes or no quite simply so a decision tree is going to partition the data up based on the attributes
So let's say the first rule is credit rating credit rating
You know greater than or equal to 5 question mark and if the answer is yes
We continue if the answer is no
Then we actually output a leaf node here
Which says credit denied here we say, okay, so the credit ratings are by five. It's not a no yet
Now we say okay do they earn?
More than let's say 10,000 a year or something like that
And if the answer is yes, we proceed to the next stage if it's no then they don't earn enough credit denied
This is what a decision tree does now you don't have to design this yourself. There are algorithms to produce decision trees for you
The way they will work is they will pick one of these attributes at each level that best separates for data out
so for example
you've got a lot of different instances of yes and no decisions in your training set is
credit rating the best way of separating out the yeses and anodes and
One of them is going to be best for each individual step and we can use all of them in a tree structure like this
until we get to a series of leaf nodes which end up with only yeses and
Only nose and then is very simple to apply this when you data comes along
we apply these rules and we get to a decision a decision tree is going to be
Equivalent to programming a bunch of carefully chosen if statements
but of course the benefit is that you can do this over a huge number of
Attributes very very quickly without having to do all this yourself, right?
So yes, it's not much better than doing it yourself, but it's much quicker. So let's have a look at this in some code
we're going to change and use a different piece of software today because for things like classification and
Prediction we're going to use Weka it's a very simple tool that makes applying things like decision trees. Very very simple
And it has some of the same data cleaning processes as our does but in a graphical form, we've already prepared our credit report
right
so we've got credit data where we have a number of inputs things like how much money do they make whether they've
Defaulted on any credit before we have these in a file so I'm gonna go in here
I'm gonna find my file. It's gonna be in here right now. You can load up various file types JSON files
For example, we're gonna load a CSV. It's our credit data. So we have about 600 rows of
Whether or not people I think it was Japan this data originally came from were given credit or not
So we have things like age debt
Marital status whether they're a customer at the bank already
Whether they've got a driving license what their current credit score is and you can see that what Weka has done is load all these
Work out whether they're nominal or values numerical values already
So for example credit score is a numerical value
And you can see here a quick histogram that shows the different types and whether they've been approved for credit
Approved at the bottom Weka has interpreted as the output or the classification that we're trying to achieve
Alright, so in this data set we have 307 you can almost see that font
307 approved and
383
Denied credit. So let's train up a decision tree and see how it does. So we only go to classify
We're going to select a decision tree. So we're going to choose we could choose 0r
That's not so gonna go down to trees and j48, which is your standard decision tree
We're gonna use a percentage split and we're going to select 70% for our training set. This one doesn't have a validation set
We're gonna be predicting whether one what they were approved and then we're gonna train
up like this what happens this weapon will train the decision tree and then it will produce for us some measurements of its accuracy you
Can see it's correctly classified
85% of the testing set which is good. I mean, it means a lot to these people
So maybe those 15% could be a bit aggrieved and then we get a confusion matrix down here
So we're saying that of the yeses a 76 were correctly
allowed credit and
22 were denied incorrectly and if the noes a hundred were correctly denied and nine were accidentally allowed, right?
So that's the ever we can see here now
The nice thing about decision trees is we can now look at these rules and see what they are
So we can go into visualized tree
And so you can see that the most important attribute that is decided on is whether or not they defaulted on a loan
Prior to this. So anyone that defaulting on a loan before is immediately denied credit if they
Haven't default on a loan then it starts to look at whether they were employed and if they are
It's going to give them credit
All right. It's a simple rule system and it's the best it can do given the amount of data
We've got if they aren't employed, but it's going to look at their income
Maybe they're self-employed gonna make a decision then whether they're married where they live and their income again
Right, so you can use attributes multiple times to make complex decision making processes
So this is a very simple tree
Which actually has performed pretty well on this data set and it's not a huge data set for 85% That's not too bad
Once you've used a classifier so KNN or a decision tree to classify your data
You want to know really as how well as it performs on your testing set so you could quite simply calculate accuracy
So what is the percentage of the time that we were correct iein?
Obviously that's going to be hard to do for many classes, but for credit yes or no 85 percent is not bad
Right if our if our average was guessing at 50% it's quite a lot better than that
there's another type of classified as perhaps a little bit more common these days and a little bit more powerful with decision trees and that's
The support vector machine. So what is a support vector machine?
well
what we're going to try and do is
Separate our classes based on a line or plane or some separation in the attributes that we have
But what we're going to do is try and maximize a separation between these two classes to make our decision more effective
So let's imagine we have two attributes just like before so this is actually because one misses attribute two
Don't forget this is labeled training data. So we know which classes either been already. This is not like clustering
So maybe we have some data over here and we have maybe some data over here
Now obviously this is our quite an easy one
We're going to try to find a decision boundary between these two classes that maximizes a separation
So for example one decision boundary we could pick will be this one here
Right, but it's not perfect because it's very close to this point here and it's very close to this point here
So these are on the fringes are being misclassified
Right and you've got to think that this is just a training set if we start to bring in testing data that may appear around
Here or around here. Maybe that's the stuff that gets misclassified
So what a support vector machine will do is pick a line between these data points
Where the distance to the nearest point is maximized these nearest points are called support vectors, right? So this
Margin here is going to be as big as we can get it so you can imagine if we move this around the margins going
To get bigger and smaller now the nice thing about support vector machines in a kind of almost reverse PCA approach
You can convert this into a higher dimensional space and perform quite complicated
Separation of things aren't really obviously separable like this things that are essentially we have to have a nonlinear decision made, right?
So not a simple line something more complex like a curve a lot of the time we're going to look at precision and recall
So recall is a measure off for all the positive things
But all the people that should have been granted credit how when even actually were like so we should have said yes
How many times did we actually say?
Yes, right
And that's a measure of how good is our algorithm at spotting
That class and precision is of the ones it spotted what percentage of them were, correct?
You can imagine a situation where your recall might be very high because you've just said yes to everyone right?
So yes
You spotted every single person that should have got credit
But also your precision is low because you were giving it to loads of people who shouldn't have had it, right?
So a really good algorithm is going to be one that has a very high precision and a very high recall
Right, and we combine these measures into one score?
F1 or F score and this is going to be a value between Norton one. Where one is
Absolutely, perfect. And zero is doesn't work at all. Where did our training data come from in this case?
We've got our train date off Internet, right?
But if you're a credit agency
Then what you're going to do is you're going to use humans to make these initial decisions
Then you're going to train a machine and you're going to test to see whether it can do as well as people can do right
Maybe there's nuance there that this decision tree couldn't capture those 15 percent of people that were misclassified
Is there something we could have done better to help those people?
So what you'll find it happens in practically is your trainer system
But maybe you don't rely on it entirely maybe for the very obvious
Yes is we can use a decision tree or some other classifier to just say yeah
Those people are fine
Maybe for the obvious knows we can say no
They're not going to get credit
But for the edge cases the people in the middle, maybe that's when we bring a human into the loop
So in our data set for our training examples
We're going to have all of the attributes and then we're crucially gonna have an already known label for that data
But says yes that person was denied credit or they were allow credit. Right? So we're going to use those training examples of
input attributes and output yes or no decisions to train our
Classifier and then we're going to test the results and whether or not it'll work when we use our unseen test data for unknown cases
Classifiers let us put groups into discreet labels yes or no a B or C
Depending on what our situation is. They're very powerful and as long as you've got enough training data
We should be able to use them to make real-life decisions
What we want to do going forward is start to move from just yes or no to can we actually produce output values
You know, can we regress actual values out of the these algorithms?
Let's talk a little bit about something more powerful
That's artificial neural networks
now
Anytime in the media at the moment when you see the term AI what they're actually talking about is machine learning and what they're talking
About is some large neural network. Now. Let's keep it a little bit smaller for this but let's imagine what