Placeholder Image

Subtitles section Play video

  • People need to learn to use standardized measures for things. So take me

  • For example when I Drive anywhere I driving miles I Drive in miles per hour

  • My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons

  • I pump it in liters

  • And then but when I run anywhere so short distances I run in kilometres and I run in kilometers per hour

  • So I'm using two different systems there and any short distances. I'm measuring are going to be in meat. It's not feet, right

  • so if I'm measuring let's say

  • around my house for painting I'm going to measure in square meters so I know how much paint to buy but then

  • If I'm selling a house, or I'm buying a house

  • I'm going to be looking at the size of the house in square feet again. What who knows why British people?

  • If I'm baking anything, it's going to be weight in grams or kilograms going into the recipe

  • but if I'm weighing myself is going to be in stones and

  • pounds but of course a ton would for me would be a metric tonne not an imperial time and

  • As I said, I measure fuel in litres and most of my liquids are measured in liters except for coarse for beer and milk

  • Which are in pints? So this is the kind of problem

  • You're going to be dealing with when you're looking at data. You're trying to transform your data into a usable form

  • Maybe the data is coming from different sources

  • None of it goes together. You need standardized units standardized scales so we can go on and analyze it

  • Let's think back

  • we what we're doing is we're trying to prepare our data into a

  • Densest most clean format so that we can apply modeling or machine learning or some kind of statistical

  • Test to work out what's going on and draw knowledge from our data. So this is going to be an iterative process

  • We're going to be cleaning the data

  • We're going to transform the data and then we're going to reduce for data and transforming data is what we're going to do today

  • So let's imagine that you've cleaned your data. So we've got rid of as many missing variables as possible

  • Hopefully all of them with deleted instances and attributes that just we're not going to work out for us

  • Now what we're going to try and do is we're going to try and transform our data so that everything's on the same scale

  • Everything makes sense together and if we're bringing datasets from different places

  • We need to also make sure what the units are the same and everything makes sense

  • There's no point in trying to use machine learning or sum or clustering or any other mechanism

  • To draw knowledge from our data if our data is is all wrong

  • So today we're going to be looking at census data now census data is kind of a classic example of a kind of data you

  • Might look at in data analysis. It's got lots of different kinds of attributes things that are going to need cleaning up and transforming

  • So we're back in our we're going to read the census data using census is read CSV

  • So we've downloaded some census data that represents samples from the US population

  • To begin with we're going to read that in and you can see that we've got 32,000 observations and 15 attributes or variables

  • So what are the first timers so let's have a quick look at just a little bit of it and we can see the kind

  • Of thing. We're looking at so we're going to say head of census and that's just going to produce the first few rows

  • So we can kind of see the kind of data so you can see we've got age

  • we've got what working classification that person has their educational level a

  • Numerical representation about whether they're married or not this kind of thing

  • So there's a lot of different kinds of data here some of its going to be nominal

  • So for example, this working-class state government private employee. That's a nominal value

  • We might have ordinal values or ratio values or interval values

  • All right

  • We're gonna have to delve in a little bit closer to find out what these are now

  • What we do to transform this data into a usable format for clustering or machine learning

  • It's going to depend on exactly what these types of these columns are and what we want to do with them

  • So let's look at it just a couple of the attributes and see what we can do with them, right?

  • we're going to use a process called codification the idea is that may be things like random forests or

  • Multi-layer perceptrons, you know neural networks aren't going to be very amenable to putting in text-based inputs

  • And what we want to do is try and replace these attributes with a numerical score

  • All right

  • So let's look at just for example of a working class and also for example

  • The educational level so education now work class is the kind of class of worker that we're looking at here

  • So for example a state worker or in private sector or someone that worked in a school or something like this now

  • This is a nominal value. That means there's no order to this data at all

  • we can't say but someone in state is higher or lower than someone in private and we can't also say but let's say

  • State is two times more or less than some other one. That makes no sense at all

  • So what we can't we can replace this with numbers?

  • so let's say we could replace private with zero and state with one and

  • You know self-employed with two and so on right and that week that's perfectly reasonable thing to do, but it's still nominal data

  • so what we can't do is then calculate a mean and

  • Say are the mean is halfway between private and public that doesn't make any sense just because something has been replaced by a numerical score

  • Doesn't mean that it actually represents something that we can quantify in that way right? It's still nominal data

  • Okay, so I bet the best advice I can give is feel free to codify your data into easy-to-read numbers

  • but just bear in mind that you can calculate the mode just like

  • you know the most common but you can't calculate the median and you can't calculate the mean another example would be something like the

  • Educational level now fear letting me this is ordinal data so we could save it someone with a an undergraduate degree

  • It's maybe slightly higher in terms of their the amount of time. They spent in education, but someone with a high school diploma

  • But we don't know exactly what the distance is

  • And what's the distance between let's say a high school when a degree and then a PhD?

  • And so on an MD and things like this

  • We can represent these

  • Using numbers and probably in order right so we could say that zero is no

  • Education and one is sort of the end of primary school and two is the end of high school and so on and so forth

  • But again, it's difficult to calculate distances between these things

  • We don't know what high school is two times more than primary school and half of a degree or something like that

  • That doesn't really make sense

  • So again, you might be able to calculate a median on this or a mode, but you can't calculate an average

  • You can't say the average level of ocation. It's halfway between high school and undergraduate that doesn't make any sense either

  • So for any kind of attribute that is nominal or possibly ordinal and it's sort of represented using text

  • We can codify this so but it's more amenable to things like decision trees depending on the library you're using right?

  • But you just have to be careful all machine learning

  • Algorithms will take any number you give them and you just have to be careful that this makes sense to do

  • So what you would do is you would go through your data and you'd begin to systematically replace appropriate attributes with numerical versions of themselves

  • Remembering all the time, but they don't necessarily represent true numbers, you know in a ratio or interval format

  • So for any text-based value, we're going to start with places and possibly with numerical scores. What about the numerical values?

  • Well, they might be okay, but the issue is going to be one of scale

  • you might find for example in this census data that one of the

  • Dimensions or one of the attributes is much much larger than another one. So for example, this data set has hours per week

  • which is obviously going to be somewhere between naught and maybe 60 or 70 hours for someone that's got

  • you know a very strong work ethic and

  • Salary right or salary or income or any other measure of you know?

  • monetary gain now obviously hours per week is going to be in the tens and

  • Salary could be into the tens of thousands. Maybe even the hundreds of thousands

  • Those scales are not even close to being the same. That means if you're doing clustering or machine learning on this kind of data

  • You're going to be finding the salary is kind of overbearing everything, right?

  • So it's going to be very easy for your clustering to find differences in salary and it's harder for it to spot differences in hours

  • Because they're so small in comparison

  • Right. So we need to start to bring everything onto the same scale the more attributes you have which is another way of saying the

  • More dimensions you have to your data

  • Then the further everything is going to be spread around if we can scale all of these values to between sort of let's say around

  • 0 & 1 then everything gets more tightly sort of controlled in the middle

  • And so it gets much easier to do

  • Clustering or machine learning or any kind of analysis we want

  • So let's look back at our data and see what we can do to try and scale some of this into the right range

  • So we're going to look back at the head of our data again

  • so our numerical values are things like the capital gain the capital loss which I guess Zuma bleah how much money they've made in the

  • Loss that year probably for normal license on some scale

  • and then things like the hours per week that they work and their salary which at this case is rate of an or less than

  • 50,000. So let's have a quick look at the kind of range of values

  • We're looking at here so we can see if scalings even necessary

  • Maybe we got lucky and the person did it before they sent us the data

  • So we're going to apply a function across all the columns and we're going to calculate the range of the data

  • So this is going to be apply on a census data division, too

  • So that's all of our columns and we're going to use the range function for this and this is going to tell us okay

  • So for example the age ranges from 17 to 90 the educational level from 1 to 16

  • It gives you the range for things like nominal values as well, but they don't really make any sense

  • I mean working class ranges from question mark to without pay, you know is meaningless and then so for example capital gain

  • ranges from zero to nearly one hundred

  • Thousand and capital loss from zero

  • To four thousand and finally the hours per week main gist from 1 to 99

  • So you can see that the capital gain is many orders of magnitude larger in scale than the hours per week

  • We're going to need to try and scale this data. Well begin by doing to make our lives a little bit easier

  • It's just focus on the numerical attributes, right so we'd have to worry about the nominal values, which we've not codified yet

  • We're going to select all the columns from the data where they are numeric. So that's this line here a star then down here

  • So we're going to s apply that applies over each of the fields

  • is it numeric and that's going to give us a

  • Logical list that says true or false depending on whether those columns are numeric

  • What we're doing here is selecting from this list any bit of true and then finding their name

  • So what are the names of a columns for the numeric?

  • So let's have a look at just a range of these attributes to make a life a little bit easier

  • So I'm gonna run this line

  • and so this is a simplified version of what I was just showing you can see that capital gain is

  • massive compared to the hours per week

  • for example

  • Let's have a look at the standard deviation

  • the call that the standard deviation

  • Is the average distance from the mean so it kinda gives us an idea of the spread of some data

  • Like is it very tight and everyone owns roughly the same or is it very spread out and it's huge

  • Deviations and the answer is there's pretty huge deviations. So the age has a standard deviation of 13

  • so it obviously

  • That means that most people are going to be kind of in the middle and on average

  • they're going to be 13 years younger or older, but you can see that things like capital gain have a

  • 7,000 standard deviation, which is a huge amount to give you some idea what we're aiming for

  • It's very common to standardize this kind of data. So but the standard deviation is 1 right so

  • 7,000 much too big let's plot an example but gives you some idea of what the kind of problem is when we have these massive

  • Ranges, so I'm going to plot here a graph of age vs. Capital games, right?

  • We know age goes between about one and a hundred and capital gain is much much larger

  • So if I run this basically the figure makes no sense at all because the capital gain ranges from zero to one hundred

  • Thousand and as a few people earning right at the top scale, everything is sort of squished down the bottom. We can't see anything

  • That's going on. There's no way of telling whether the capital gain of an individual is related to their age

  • I mean it probably is like because a retired people people who are very young. Perhaps her and slightly less

  • We can't really