Subtitles section Play video
-
It's becoming increasingly common to start using machine learning or AI driven techniques to make decisions
-
The world over so for example, you know credit checks health checks, and these can be life-changing
-
right, so it's really important we get this right you could find yourself turned down through a mortgage on your dream house because quite literally
-
The computer says no
-
Let's talk a little bit about classification. So now we have a data set where we've got labels
-
All right, so we've got some input features or input
-
Attributes or dimensions lots of instances and we've got some labels for these attributes
-
All right, and so we've got for example books and the type of book or music and the genre with the music
-
Things that we want to start to try and classify
-
So supervised learning is the idea that we've got labels for our data. So we're still gonna have instances
-
We're gonna have attributes or dimensions to our instances. But we've also now got labels for our data and so
-
Classification is the process of learning how to correctly assign these labels to these instances before we start talking about classifiers
-
Let's talk a little bit about the learning process and machine learning process
-
we want to use it's not enough to say I've got my data set and I can correctly predict all of the classes right because
-
Then someone will ask well what happens if we have any new data that we haven't seen before right?
-
Maybe you've got some medical data and you can correct me
-
Diagnose all of the diseases but a new patient comes along and you could incorrectly diagnose the disease, right? That's not helped
-
anyone
-
What we need is a regimented way of training and testing these approaches so that we know how well they apply in the real world
-
So what we're going to do is we've got some data set just like before
-
Where we've got some instances and we've got some attributes this way and so, you know
-
We might have a lot of attributes a few it doesn't really matter
-
and we also now have our labels which we often call Y right but this is going to be a vector of all of the
-
Labels for data, so this could be label one-one B's could be a few twos down here
-
And this could be a few three
-
So this is a bit like our tennis example where we had this is the weather outlook and are we going to play?
-
Tennis today, right? Yes, or no so that you could have multiple labels or just two for binary classification
-
It's not enough just to train a classifier over all this data
-
We want to make sure that this classifier will work properly when we apply a new data to it
-
So what we're going to do is we're going to separate this data into training sets
-
And testing sets so we're going to train on the training set
-
Then we're going to test as we go on the validation set and then right at the end when we're finished we're going to do
-
a final test on our test set
-
The reason we do this is it's a very safe way to make sure that we don't accidentally gain the system
-
We don't accidentally report incredibly good results on the training set
-
but that's because we all just show the Machine those things so we hold out the validation of a test set for later to make
-
Sure that it will generalize now exactly how much of your data goes in the training validation and testing set is really up to you
-
right
-
typically
-
You might use something like 70% for training
-
15% for validation of 15% for testing that will be quite a reasonable way of doing it
-
So what are some good classifiers we could use given that we've done this right? Let's imagine. We've got our instances
-
We've got our attributes and we split them up probably randomly into training validation and testing
-
What we want to do is train our classifier on the training set and then test it on the validation and testing sets to see
-
How we're getting on so what algorithms could we use? Let's start with a simplest. One of all zero are in zero
-
Are we just take the most common label and that's what we predict every time. It's V
-
You've got five minutes until the deadline just hand something in
-
Approach to machine learning in the case of playing tennis or not playing tennis we could say well I play tennis more than I didn't
-
So we'll just assume that I'm going to play tennis and predict. Yes all the time
-
All right, regardless of what the weather is this is not a good way to perform machine learning
-
But I suppose it does give you a baseline accuracy, right?
-
If you're baseline of just yet saying yes to everything is sixty percent accuracy
-
Then if your machine learning doesn't perform at least a 60 percent, we know we've got a real problem
-
We can go one better than that
-
We can use one R one R is where we pick one of our attributes
-
We made classification based only on that and then we pick the best of those attributes
-
I mean, it's slightly better than 0 R but not a lot
-
So you'll find you will find references to bees in military too a little bit but not very much
-
Because we use much more powerful approaches to this. So let's talk about one example classifier is very popular and that's
-
KNN or k nearest neighbor let's imagine. We've got a to
-
Attribute data set. So I like to draw in two dimensions. It's just a little easier for me
-
And so we've got attribute one and attribute two, and we've got some different data points in here now
-
Don't forget also that each of these is going to have a prediction as well
-
so if this one
-
Is going to have let's say a label if we did play tennis when we want to test a new data point an unseen data
-
So a new person comes along who may or may not play tennis. They're going to appear over here
-
We measure them and we find the K number of nearest neighbors to this point
-
So that's this one this one this one this one and this one
-
so this will be 1 2 3 4 5 6 this would be K of 6 and then we take the majority vote or the
-
Average of these responses so if four out of six of these people play tennis, this would be assigned to play tennis
-
So the output is what in the existing data set. Have we already seen nearby?
-
And can we use that to make a prediction?
-
So this is quite a good approach obviously choosing K is a little bit difficult to do
-
Right and this starts to get very very slow when you've got hundreds and hundreds of dimensions finding for K nearest points to a point
-
When you've got tens of thousands of dimensions or tens of thousands of instances, it's not easy to do even with good data structures
-
Why it starts to get slow quite quickly nevertheless. This is an effective and popular approach
-
Are there any alternatives there is one decision trees. All right, now I like decision trees
-
They have a nice benefit that once we created a decision tree
-
Which is just a series of decisions on is the data this yes, is it this?
-
No, once we've done all that we can actually look at the rules and say ok. That's how a decision was made
-
And that's quite a good rule set. So kind of a way of lighting a sort of if-else
-
Programming language, but you're doing it automatically let's draw out another data set
-
So we've got our instances down here and we've got our attributes here and remember for each of our instances
-
We're going to have some label that we're trying to output
-
All right
-
So here well
-
You know 1 2 3 4 5 6 and so on
-
So let's imagine but this is a credit score by a credit check
-
So you've got actually boots based on how much money you've got how much you spent me to me if you already have other loans
-
and
-
What we want to do is make a decision as to whether you should be allowed more credit or not, right?
-
So the answer is yes or no quite simply so a decision tree is going to partition the data up based on the attributes
-
So let's say the first rule is credit rating credit rating
-
You know greater than or equal to 5 question mark and if the answer is yes
-
We continue if the answer is no
-
Then we actually output a leaf node here
-
Which says credit denied here we say, okay, so the credit ratings are by five. It's not a no yet
-
Now we say okay do they earn?
-
More than let's say 10,000 a year or something like that
-
And if the answer is yes, we proceed to the next stage if it's no then they don't earn enough credit denied
-
This is what a decision tree does now you don't have to design this yourself. There are algorithms to produce decision trees for you
-
The way they will work is they will pick one of these attributes at each level that best separates for data out
-
so for example
-
you've got a lot of different instances of yes and no decisions in your training set is
-
credit rating the best way of separating out the yeses and anodes and
-
One of them is going to be best for each individual step and we can use all of them in a tree structure like this
-
until we get to a series of leaf nodes which end up with only yeses and
-
Only nose and then is very simple to apply this when you data comes along
-
we apply these rules and we get to a decision a decision tree is going to be
-
Equivalent to programming a bunch of carefully chosen if statements
-
but of course the benefit is that you can do this over a huge number of
-
Attributes very very quickly without having to do all this yourself, right?
-
So yes, it's not much better than doing it yourself, but it's much quicker. So let's have a look at this in some code
-
we're going to change and use a different piece of software today because for things like classification and
-
Prediction we're going to use Weka it's a very simple tool that makes applying things like decision trees. Very very simple
-
And it has some of the same data cleaning processes as our does but in a graphical form, we've already prepared our credit report
-
right
-
so we've got credit data where we have a number of inputs things like how much money do they make whether they've
-
Defaulted on any credit before we have these in a file so I'm gonna go in here
-
I'm gonna find my file. It's gonna be in here right now. You can load up various file types JSON files
-
For example, we're gonna load a CSV. It's our credit data. So we have about 600 rows of
-
Whether or not people I think it was Japan this data originally came from were given credit or not
-
So we have things like age debt
-
Marital status whether they're a customer at the bank already
-
Whether they've got a driving license what their current credit score is and you can see that what Weka has done is load all these
-
Work out whether they're nominal or values numerical values already
-
So for example credit score is a numerical value
-
And you can see here a quick histogram that shows the different types and whether they've been approved for credit
-
Approved at the bottom Weka has interpreted as the output or the classification that we're trying to achieve
-
Alright, so in this data set we have 307 you can almost see that font
-
307 approved and
-
383
-
Denied credit. So let's train up a decision tree and see how it does. So we only go to classify
-
We're going to select a decision tree. So we're going to choose we could choose 0r
-
That's not so gonna go down to trees and j48, which is your standard decision tree
-
We're gonna use a percentage split and we're going to select 70% for our training set. This one doesn't have a validation set
-
We're gonna be predicting whether one what they were approved and then we're gonna train
-
up like this what happens this weapon will train the decision tree and then it will produce for us some measurements of its accuracy you
-
Can see it's correctly classified
-
85% of the testing set which is good. I mean, it means a lot to these people
-
So maybe those 15% could be a bit aggrieved and then we get a confusion matrix down here
-
So we're saying that of the yeses a 76 were correctly
-
allowed credit and
-
22 were denied incorrectly and if the noes a hundred were correctly denied and nine were accidentally allowed, right?
-
So that's the ever we can see here now
-
The nice thing about decision trees is we can now look at these rules and see what they are
-
So we can go into visualized tree
-
And so you can see that the most important attribute that is decided on is whether or not they defaulted on a loan
-
Prior to this. So anyone that defaulting on a loan before is immediately denied credit if they
-
Haven't default on a loan then it starts to look at whether they were employed and if they are
-
It's going to give them credit
-
All right. It's a simple rule system and it's the best it can do given the amount of data
-
We've got if they aren't employed, but it's going to look at their income
-
Maybe they're self-employed gonna make a decision then whether they're married where they live and their income again
-
Right, so you can use attributes multiple times to make complex decision making processes
-
So this is a very simple tree
-
Which actually has performed pretty well on this data set and it's not a huge data set for 85% That's not too bad
-
Once you've used a classifier so KNN or a decision tree to classify your data
-
You want to know really as how well as it performs on your testing set so you could quite simply calculate accuracy
-
So what is the percentage of the time that we were correct iein?
-
Obviously that's going to be hard to do for many classes, but for credit yes or no 85 percent is not bad
-
Right if our if our average was guessing at 50% it's quite a lot better than that
-
there's another type of classified as perhaps a little bit more common these days and a little bit more powerful with decision trees and that's
-
The support vector machine. So what is a support vector machine?
-
well
-
what we're going to try and do is
-
Separate our classes based on a line or plane or some separation in the attributes that we have
-
But what we're going to do is try and maximize a separation between these two classes to make our decision more effective
-
So let's imagine we have two attributes just like before so this is actually because one misses attribute two
-
Don't forget this is labeled training data. So we know which classes either been already. This is not like clustering
-
So maybe we have some data over here and we have maybe some data over here
-
Now obviously this is our quite an easy one
-
We're going to try to find a decision boundary between these two classes that maximizes a separation
-
So for example one decision boundary we could pick will be this one here
-
Right, but it's not perfect because it's very close to this point here and it's very close to this point here
-
So these are on the fringes are being misclassified
-
Right and you've got to think that this is just a training set if we start to bring in testing data that may appear around
-
Here or around here. Maybe that's the stuff that gets misclassified
-
So what a support vector machine will do is pick a line between these data points
-
Where the distance to the nearest point is maximized these nearest points are called support vectors, right? So this
-
Margin here is going to be as big as we can get it so you can imagine if we move this around the margins going
-
To get bigger and smaller now the nice thing about support vector machines in a kind of almost reverse PCA approach
-
You can convert this into a higher dimensional space and perform quite complicated
-
Separation of things aren't really obviously separable like this things that are essentially we have to have a nonlinear decision made, right?
-
So not a simple line something more complex like a curve a lot of the time we're going to look at precision and recall
-
So recall is a measure off for all the positive things
-
But all the people that should have been granted credit how when even actually were like so we should have said yes
-
How many times did we actually say?
-
Yes, right
-
And that's a measure of how good is our algorithm at spotting
-
That class and precision is of the ones it spotted what percentage of them were, correct?
-
You can imagine a situation where your recall might be very high because you've just said yes to everyone right?
-
So yes
-
You spotted every single person that should have got credit
-
But also your precision is low because you were giving it to loads of people who shouldn't have had it, right?
-
So a really good algorithm is going to be one that has a very high precision and a very high recall
-
Right, and we combine these measures into one score?
-
F1 or F score and this is going to be a value between Norton one. Where one is
-
Absolutely, perfect. And zero is doesn't work at all. Where did our training data come from in this case?
-
We've got our train date off Internet, right?
-
But if you're a credit agency
-
Then what you're going to do is you're going to use humans to make these initial decisions
-
Then you're going to train a machine and you're going to test to see whether it can do as well as people can do right
-
Maybe there's nuance there that this decision tree couldn't capture those 15 percent of people that were misclassified
-
Is there something we could have done better to help those people?
-
So what you'll find it happens in practically is your trainer system
-
But maybe you don't rely on it entirely maybe for the very obvious
-
Yes is we can use a decision tree or some other classifier to just say yeah
-
Those people are fine
-
Maybe for the obvious knows we can say no
-
They're not going to get credit
-
But for the edge cases the people in the middle, maybe that's when we bring a human into the loop
-
So in our data set for our training examples
-
We're going to have all of the attributes and then we're crucially gonna have an already known label for that data
-
But says yes that person was denied credit or they were allow credit. Right? So we're going to use those training examples of
-
input attributes and output yes or no decisions to train our
-
Classifier and then we're going to test the results and whether or not it'll work when we use our unseen test data for unknown cases
-
Classifiers let us put groups into discreet labels yes or no a B or C