## Subtitles section Play video

• People need to learn to use standardized measures for things. So take me

• For example when I Drive anywhere I driving miles I Drive in miles per hour

• My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons

• I pump it in liters

• And then but when I run anywhere so short distances I run in kilometres and I run in kilometers per hour

• So I'm using two different systems there and any short distances. I'm measuring are going to be in meat. It's not feet, right

• so if I'm measuring let's say

• around my house for painting I'm going to measure in square meters so I know how much paint to buy but then

• If I'm selling a house, or I'm buying a house

• I'm going to be looking at the size of the house in square feet again. What who knows why British people?

• If I'm baking anything, it's going to be weight in grams or kilograms going into the recipe

• but if I'm weighing myself is going to be in stones and

• pounds but of course a ton would for me would be a metric tonne not an imperial time and

• As I said, I measure fuel in litres and most of my liquids are measured in liters except for coarse for beer and milk

• Which are in pints? So this is the kind of problem

• You're going to be dealing with when you're looking at data. You're trying to transform your data into a usable form

• Maybe the data is coming from different sources

• None of it goes together. You need standardized units standardized scales so we can go on and analyze it

• Let's think back

• we what we're doing is we're trying to prepare our data into a

• Densest most clean format so that we can apply modeling or machine learning or some kind of statistical

• Test to work out what's going on and draw knowledge from our data. So this is going to be an iterative process

• We're going to be cleaning the data

• We're going to transform the data and then we're going to reduce for data and transforming data is what we're going to do today

• So let's imagine that you've cleaned your data. So we've got rid of as many missing variables as possible

• Hopefully all of them with deleted instances and attributes that just we're not going to work out for us

• Now what we're going to try and do is we're going to try and transform our data so that everything's on the same scale

• Everything makes sense together and if we're bringing datasets from different places

• We need to also make sure what the units are the same and everything makes sense

• There's no point in trying to use machine learning or sum or clustering or any other mechanism

• To draw knowledge from our data if our data is is all wrong

• So today we're going to be looking at census data now census data is kind of a classic example of a kind of data you

• Might look at in data analysis. It's got lots of different kinds of attributes things that are going to need cleaning up and transforming

• So we're back in our we're going to read the census data using census is read CSV

• So we've downloaded some census data that represents samples from the US population

• To begin with we're going to read that in and you can see that we've got 32,000 observations and 15 attributes or variables

• So what are the first timers so let's have a quick look at just a little bit of it and we can see the kind

• Of thing. We're looking at so we're going to say head of census and that's just going to produce the first few rows

• So we can kind of see the kind of data so you can see we've got age

• we've got what working classification that person has their educational level a

• Numerical representation about whether they're married or not this kind of thing

• So there's a lot of different kinds of data here some of its going to be nominal

• So for example, this working-class state government private employee. That's a nominal value

• We might have ordinal values or ratio values or interval values

• All right

• We're gonna have to delve in a little bit closer to find out what these are now

• What we do to transform this data into a usable format for clustering or machine learning

• It's going to depend on exactly what these types of these columns are and what we want to do with them

• So let's look at it just a couple of the attributes and see what we can do with them, right?

• we're going to use a process called codification the idea is that may be things like random forests or

• Multi-layer perceptrons, you know neural networks aren't going to be very amenable to putting in text-based inputs

• And what we want to do is try and replace these attributes with a numerical score

• All right

• So let's look at just for example of a working class and also for example

• The educational level so education now work class is the kind of class of worker that we're looking at here

• So for example a state worker or in private sector or someone that worked in a school or something like this now

• This is a nominal value. That means there's no order to this data at all

• we can't say but someone in state is higher or lower than someone in private and we can't also say but let's say

• State is two times more or less than some other one. That makes no sense at all

• So what we can't we can replace this with numbers?

• so let's say we could replace private with zero and state with one and

• You know self-employed with two and so on right and that week that's perfectly reasonable thing to do, but it's still nominal data

• so what we can't do is then calculate a mean and

• Say are the mean is halfway between private and public that doesn't make any sense just because something has been replaced by a numerical score

• Doesn't mean that it actually represents something that we can quantify in that way right? It's still nominal data

• Okay, so I bet the best advice I can give is feel free to codify your data into easy-to-read numbers

• but just bear in mind that you can calculate the mode just like

• you know the most common but you can't calculate the median and you can't calculate the mean another example would be something like the

• Educational level now fear letting me this is ordinal data so we could save it someone with a an undergraduate degree

• It's maybe slightly higher in terms of their the amount of time. They spent in education, but someone with a high school diploma

• But we don't know exactly what the distance is

• And what's the distance between let's say a high school when a degree and then a PhD?

• And so on an MD and things like this

• We can represent these

• Using numbers and probably in order right so we could say that zero is no

• Education and one is sort of the end of primary school and two is the end of high school and so on and so forth

• But again, it's difficult to calculate distances between these things

• We don't know what high school is two times more than primary school and half of a degree or something like that

• That doesn't really make sense

• So again, you might be able to calculate a median on this or a mode, but you can't calculate an average

• You can't say the average level of ocation. It's halfway between high school and undergraduate that doesn't make any sense either

• So for any kind of attribute that is nominal or possibly ordinal and it's sort of represented using text

• We can codify this so but it's more amenable to things like decision trees depending on the library you're using right?

• But you just have to be careful all machine learning

• Algorithms will take any number you give them and you just have to be careful that this makes sense to do

• So what you would do is you would go through your data and you'd begin to systematically replace appropriate attributes with numerical versions of themselves

• Remembering all the time, but they don't necessarily represent true numbers, you know in a ratio or interval format

• So for any text-based value, we're going to start with places and possibly with numerical scores. What about the numerical values?

• Well, they might be okay, but the issue is going to be one of scale

• you might find for example in this census data that one of the

• Dimensions or one of the attributes is much much larger than another one. So for example, this data set has hours per week

• which is obviously going to be somewhere between naught and maybe 60 or 70 hours for someone that's got

• you know a very strong work ethic and

• Salary right or salary or income or any other measure of you know?

• monetary gain now obviously hours per week is going to be in the tens and

• Salary could be into the tens of thousands. Maybe even the hundreds of thousands

• Those scales are not even close to being the same. That means if you're doing clustering or machine learning on this kind of data

• You're going to be finding the salary is kind of overbearing everything, right?

• So it's going to be very easy for your clustering to find differences in salary and it's harder for it to spot differences in hours

• Because they're so small in comparison

• Right. So we need to start to bring everything onto the same scale the more attributes you have which is another way of saying the

• More dimensions you have to your data

• Then the further everything is going to be spread around if we can scale all of these values to between sort of let's say around

• 0 & 1 then everything gets more tightly sort of controlled in the middle

• And so it gets much easier to do

• Clustering or machine learning or any kind of analysis we want

• So let's look back at our data and see what we can do to try and scale some of this into the right range

• So we're going to look back at the head of our data again

• so our numerical values are things like the capital gain the capital loss which I guess Zuma bleah how much money they've made in the

• Loss that year probably for normal license on some scale

• and then things like the hours per week that they work and their salary which at this case is rate of an or less than

• 50,000. So let's have a quick look at the kind of range of values

• We're looking at here so we can see if scalings even necessary

• Maybe we got lucky and the person did it before they sent us the data

• So we're going to apply a function across all the columns and we're going to calculate the range of the data

• So this is going to be apply on a census data division, too

• So that's all of our columns and we're going to use the range function for this and this is going to tell us okay

• So for example the age ranges from 17 to 90 the educational level from 1 to 16

• It gives you the range for things like nominal values as well, but they don't really make any sense

• I mean working class ranges from question mark to without pay, you know is meaningless and then so for example capital gain

• ranges from zero to nearly one hundred

• Thousand and capital loss from zero

• To four thousand and finally the hours per week main gist from 1 to 99

• So you can see that the capital gain is many orders of magnitude larger in scale than the hours per week

• We're going to need to try and scale this data. Well begin by doing to make our lives a little bit easier

• It's just focus on the numerical attributes, right so we'd have to worry about the nominal values, which we've not codified yet

• We're going to select all the columns from the data where they are numeric. So that's this line here a star then down here

• So we're going to s apply that applies over each of the fields

• is it numeric and that's going to give us a

• Logical list that says true or false depending on whether those columns are numeric

• What we're doing here is selecting from this list any bit of true and then finding their name

• So what are the names of a columns for the numeric?

• So let's have a look at just a range of these attributes to make a life a little bit easier

• So I'm gonna run this line

• and so this is a simplified version of what I was just showing you can see that capital gain is

• massive compared to the hours per week

• for example

• Let's have a look at the standard deviation

• the call that the standard deviation

• Is the average distance from the mean so it kinda gives us an idea of the spread of some data

• Like is it very tight and everyone owns roughly the same or is it very spread out and it's huge

• Deviations and the answer is there's pretty huge deviations. So the age has a standard deviation of 13

• so it obviously

• That means that most people are going to be kind of in the middle and on average

• they're going to be 13 years younger or older, but you can see that things like capital gain have a

• 7,000 standard deviation, which is a huge amount to give you some idea what we're aiming for

• It's very common to standardize this kind of data. So but the standard deviation is 1 right so

• 7,000 much too big let's plot an example but gives you some idea of what the kind of problem is when we have these massive

• Ranges, so I'm going to plot here a graph of age vs. Capital games, right?

• We know age goes between about one and a hundred and capital gain is much much larger

• So if I run this basically the figure makes no sense at all because the capital gain ranges from zero to one hundred

• Thousand and as a few people earning right at the top scale, everything is sort of squished down the bottom. We can't see anything

• That's going on. There's no way of telling whether the capital gain of an individual is related to their age

• I mean it probably is like because a retired people people who are very young. Perhaps her and slightly less

• We can't really