Placeholder Image

Subtitles section Play video

  • It's becoming increasingly common to start using machine learning or AI driven techniques to make decisions

  • The world over so for example, you know credit checks health checks, and these can be life-changing

  • right, so it's really important we get this right you could find yourself turned down through a mortgage on your dream house because quite literally

  • The computer says no

  • Let's talk a little bit about classification. So now we have a data set where we've got labels

  • All right, so we've got some input features or input

  • Attributes or dimensions lots of instances and we've got some labels for these attributes

  • All right, and so we've got for example books and the type of book or music and the genre with the music

  • Things that we want to start to try and classify

  • So supervised learning is the idea that we've got labels for our data. So we're still gonna have instances

  • We're gonna have attributes or dimensions to our instances. But we've also now got labels for our data and so

  • Classification is the process of learning how to correctly assign these labels to these instances before we start talking about classifiers

  • Let's talk a little bit about the learning process and machine learning process

  • we want to use it's not enough to say I've got my data set and I can correctly predict all of the classes right because

  • Then someone will ask well what happens if we have any new data that we haven't seen before right?

  • Maybe you've got some medical data and you can correct me

  • Diagnose all of the diseases but a new patient comes along and you could incorrectly diagnose the disease, right? That's not helped

  • anyone

  • What we need is a regimented way of training and testing these approaches so that we know how well they apply in the real world

  • So what we're going to do is we've got some data set just like before

  • Where we've got some instances and we've got some attributes this way and so, you know

  • We might have a lot of attributes a few it doesn't really matter

  • and we also now have our labels which we often call Y right but this is going to be a vector of all of the

  • Labels for data, so this could be label one-one B's could be a few twos down here

  • And this could be a few three

  • So this is a bit like our tennis example where we had this is the weather outlook and are we going to play?

  • Tennis today, right? Yes, or no so that you could have multiple labels or just two for binary classification

  • It's not enough just to train a classifier over all this data

  • We want to make sure that this classifier will work properly when we apply a new data to it

  • So what we're going to do is we're going to separate this data into training sets

  • And testing sets so we're going to train on the training set

  • Then we're going to test as we go on the validation set and then right at the end when we're finished we're going to do

  • a final test on our test set

  • The reason we do this is it's a very safe way to make sure that we don't accidentally gain the system

  • We don't accidentally report incredibly good results on the training set

  • but that's because we all just show the Machine those things so we hold out the validation of a test set for later to make

  • Sure that it will generalize now exactly how much of your data goes in the training validation and testing set is really up to you

  • right

  • typically

  • You might use something like 70% for training

  • 15% for validation of 15% for testing that will be quite a reasonable way of doing it

  • So what are some good classifiers we could use given that we've done this right? Let's imagine. We've got our instances

  • We've got our attributes and we split them up probably randomly into training validation and testing

  • What we want to do is train our classifier on the training set and then test it on the validation and testing sets to see

  • How we're getting on so what algorithms could we use? Let's start with a simplest. One of all zero are in zero

  • Are we just take the most common label and that's what we predict every time. It's V

  • You've got five minutes until the deadline just hand something in

  • Approach to machine learning in the case of playing tennis or not playing tennis we could say well I play tennis more than I didn't

  • So we'll just assume that I'm going to play tennis and predict. Yes all the time

  • All right, regardless of what the weather is this is not a good way to perform machine learning

  • But I suppose it does give you a baseline accuracy, right?

  • If you're baseline of just yet saying yes to everything is sixty percent accuracy

  • Then if your machine learning doesn't perform at least a 60 percent, we know we've got a real problem

  • We can go one better than that

  • We can use one R one R is where we pick one of our attributes

  • We made classification based only on that and then we pick the best of those attributes

  • I mean, it's slightly better than 0 R but not a lot

  • So you'll find you will find references to bees in military too a little bit but not very much

  • Because we use much more powerful approaches to this. So let's talk about one example classifier is very popular and that's

  • KNN or k nearest neighbor let's imagine. We've got a to

  • Attribute data set. So I like to draw in two dimensions. It's just a little easier for me

  • And so we've got attribute one and attribute two, and we've got some different data points in here now

  • Don't forget also that each of these is going to have a prediction as well

  • so if this one

  • Is going to have let's say a label if we did play tennis when we want to test a new data point an unseen data

  • So a new person comes along who may or may not play tennis. They're going to appear over here

  • We measure them and we find the K number of nearest neighbors to this point

  • So that's this one this one this one this one and this one

  • so this will be 1 2 3 4 5 6 this would be K of 6 and then we take the majority vote or the

  • Average of these responses so if four out of six of these people play tennis, this would be assigned to play tennis

  • So the output is what in the existing data set. Have we already seen nearby?

  • And can we use that to make a prediction?

  • So this is quite a good approach obviously choosing K is a little bit difficult to do

  • Right and this starts to get very very slow when you've got hundreds and hundreds of dimensions finding for K nearest points to a point

  • When you've got tens of thousands of dimensions or tens of thousands of instances, it's not easy to do even with good data structures

  • Why it starts to get slow quite quickly nevertheless. This is an effective and popular approach

  • Are there any alternatives there is one decision trees. All right, now I like decision trees

  • They have a nice benefit that once we created a decision tree

  • Which is just a series of decisions on is the data this yes, is it this?

  • No, once we've done all that we can actually look at the rules and say ok. That's how a decision was made

  • And that's quite a good rule set. So kind of a way of lighting a sort of if-else

  • Programming language, but you're doing it automatically let's draw out another data set

  • So we've got our instances down here and we've got our attributes here and remember for each of our instances

  • We're going to have some label that we're trying to output

  • All right

  • So here well

  • You know 1 2 3 4 5 6 and so on

  • So let's imagine but this is a credit score by a credit check

  • So you've got actually boots based on how much money you've got how much you spent me to me if you already have other loans

  • and

  • What we want to do is make a decision as to whether you should be allowed more credit or not, right?

  • So the answer is yes or no quite simply so a decision tree is going to partition the data up based on the attributes

  • So let's say the first rule is credit rating credit rating

  • You know greater than or equal to 5 question mark and if the answer is yes

  • We continue if the answer is no

  • Then we actually output a leaf node here

  • Which says credit denied here we say, okay, so the credit ratings are by five. It's not a no yet

  • Now we say okay do they earn?

  • More than let's say 10,000 a year or something like that

  • And if the answer is yes, we proceed to the next stage if it's no then they don't earn enough credit denied

  • This is what a decision tree does now you don't have to design this yourself. There are algorithms to produce decision trees for you

  • The way they will work is they will pick one of these attributes at each level that best separates for data out

  • so for example

  • you've got a lot of different instances of yes and no decisions in your training set is

  • credit rating the best way of separating out the yeses and anodes and

  • One of them is going to be best for each individual step and we can use all of them in a tree structure like this

  • until we get to a series of leaf nodes which end up with only yeses and

  • Only nose and then is very simple to apply this when you data comes along

  • we apply these rules and we get to a decision a decision tree is going to be

  • Equivalent to programming a bunch of carefully chosen if statements

  • but of course the benefit is that you can do this over a huge number of

  • Attributes very very quickly without having to do all this yourself, right?

  • So yes, it's not much better than doing it yourself, but it's much quicker. So let's have a look at this in some code

  • we're going to change and use a different piece of software today because for things like classification and

  • Prediction we're going to use Weka it's a very simple tool that makes applying things like decision trees. Very very simple

  • And it has some of the same data cleaning processes as our does but in a graphical form, we've already prepared our credit report

  • right

  • so we've got credit data where we have a number of inputs things like how much money do they make whether they've

  • Defaulted on any credit before we have these in a file so I'm gonna go in here

  • I'm gonna find my file. It's gonna be in here right now. You can load up various file types JSON files

  • For example, we're gonna load a CSV. It's our credit data. So we have about 600 rows of

  • Whether or not people I think it was Japan this data originally came from were given credit or not

  • So we have things like age debt

  • Marital status whether they're a customer at the bank already

  • Whether they've got a driving license what their current credit score is and you can see that what Weka has done is load all these

  • Work out whether they're nominal or values numerical values already

  • So for example credit score is a numerical value

  • And you can see here a quick histogram that shows the different types and whether they've been approved for credit

  • Approved at the bottom Weka has interpreted as the output or the classification that we're trying to achieve

  • Alright, so in this data set we have 307 you can almost see that font

  • 307 approved and

  • 383

  • Denied credit. So let's train up a decision tree and see how it does. So we only go to classify

  • We're going to select a decision tree. So we're going to choose we could choose 0r

  • That's not so gonna go down to trees and j48, which is your standard decision tree

  • We're gonna use a percentage split and we're going to select 70% for our training set. This one doesn't have a validation set

  • We're gonna be predicting whether one what they were approved and then we're gonna train

  • up like this what happens this weapon will train the decision tree and then it will produce for us some measurements of its accuracy you

  • Can see it's correctly classified

  • 85% of the testing set which is good. I mean, it means a lot to these people

  • So maybe those 15% could be a bit aggrieved and then we get a confusion matrix down here

  • So we're saying that of the yeses a 76 were correctly

  • allowed credit and

  • 22 were denied incorrectly and if the noes a hundred were correctly denied and nine were accidentally allowed, right?

  • So that's the ever we can see here now

  • The nice thing about decision trees is we can now look at these rules and see what they are

  • So we can go into visualized tree

  • And so you can see that the most important attribute that is decided on is whether or not they defaulted on a loan

  • Prior to this. So anyone that defaulting on a loan before is immediately denied credit if they

  • Haven't default on a loan then it starts to look at whether they were employed and if they are

  • It's going to give them credit

  • All right. It's a simple rule system and it's the best it can do given the amount of data

  • We've got if they aren't employed, but it's going to look at their income

  • Maybe they're self-employed gonna make a decision then whether they're married where they live and their income again

  • Right, so you can use attributes multiple times to make complex decision making processes

  • So this is a very simple tree

  • Which actually has performed pretty well on this data set and it's not a huge data set for 85% That's not too bad

  • Once you've used a classifier so KNN or a decision tree to classify your data

  • You want to know really as how well as it performs on your testing set so you could quite simply calculate accuracy

  • So what is the percentage of the time that we were correct iein?

  • Obviously that's going to be hard to do for many classes, but for credit yes or no 85 percent is not bad

  • Right if our if our average was guessing at 50% it's quite a lot better than that

  • there's another type of classified as perhaps a little bit more common these days and a little bit more powerful with decision trees and that's

  • The support vector machine. So what is a support vector machine?

  • well

  • what we're going to try and do is

  • Separate our classes based on a line or plane or some separation in the attributes that we have

  • But what we're going to do is try and maximize a separation between these two classes to make our decision more effective

  • So let's imagine we have two attributes just like before so this is actually because one misses attribute two

  • Don't forget this is labeled training data. So we know which classes either been already. This is not like clustering

  • So maybe we have some data over here and we have maybe some data over here

  • Now obviously this is our quite an easy one

  • We're going to try to find a decision boundary between these two classes that maximizes a separation

  • So for example one decision boundary we could pick will be this one here

  • Right, but it's not perfect because it's very close to this point here and it's very close to this point here

  • So these are on the fringes are being misclassified

  • Right and you've got to think that this is just a training set if we start to bring in testing data that may appear around

  • Here or around here. Maybe that's the stuff that gets misclassified

  • So what a support vector machine will do is pick a line between these data points

  • Where the distance to the nearest point is maximized these nearest points are called support vectors, right? So this

  • Margin here is going to be as big as we can get it so you can imagine if we move this around the margins going

  • To get bigger and smaller now the nice thing about support vector machines in a kind of almost reverse PCA approach

  • You can convert this into a higher dimensional space and perform quite complicated

  • Separation of things aren't really obviously separable like this things that are essentially we have to have a nonlinear decision made, right?

  • So not a simple line something more complex like a curve a lot of the time we're going to look at precision and recall

  • So recall is a measure off for all the positive things

  • But all the people that should have been granted credit how when even actually were like so we should have said yes

  • How many times did we actually say?

  • Yes, right

  • And that's a measure of how good is our algorithm at spotting

  • That class and precision is of the ones it spotted what percentage of them were, correct?

  • You can imagine a situation where your recall might be very high because you've just said yes to everyone right?

  • So yes

  • You spotted every single person that should have got credit

  • But also your precision is low because you were giving it to loads of people who shouldn't have had it, right?

  • So a really good algorithm is going to be one that has a very high precision and a very high recall

  • Right, and we combine these measures into one score?

  • F1 or F score and this is going to be a value between Norton one. Where one is

  • Absolutely, perfect. And zero is doesn't work at all. Where did our training data come from in this case?

  • We've got our train date off Internet, right?

  • But if you're a credit agency

  • Then what you're going to do is you're going to use humans to make these initial decisions

  • Then you're going to train a machine and you're going to test to see whether it can do as well as people can do right

  • Maybe there's nuance there that this decision tree couldn't capture those 15 percent of people that were misclassified

  • Is there something we could have done better to help those people?

  • So what you'll find it happens in practically is your trainer system

  • But maybe you don't rely on it entirely maybe for the very obvious

  • Yes is we can use a decision tree or some other classifier to just say yeah

  • Those people are fine

  • Maybe for the obvious knows we can say no

  • They're not going to get credit

  • But for the edge cases the people in the middle, maybe that's when we bring a human into the loop

  • So in our data set for our training examples

  • We're going to have all of the attributes and then we're crucially gonna have an already known label for that data

  • But says yes that person was denied credit or they were allow credit. Right? So we're going to use those training examples of

  • input attributes and output yes or no decisions to train our

  • Classifier and then we're going to test the results and whether or not it'll work when we use our unseen test data for unknown cases

  • Classifiers let us put groups into discreet labels yes or no a B or C