People need to learn to use standardized measures for things. So take me For example when I Drive anywhere I driving miles I Drive in miles per hour My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons I pump it in liters And then but when I run anywhere so short distances I run in kilometres and I run in kilometers per hour So I'm using two different systems there and any short distances. I'm measuring are going to be in meat. It's not feet, right so if I'm measuring let's say around my house for painting I'm going to measure in square meters so I know how much paint to buy but then If I'm selling a house, or I'm buying a house I'm going to be looking at the size of the house in square feet again. What who knows why British people? If I'm baking anything, it's going to be weight in grams or kilograms going into the recipe but if I'm weighing myself is going to be in stones and pounds but of course a ton would for me would be a metric tonne not an imperial time and As I said, I measure fuel in litres and most of my liquids are measured in liters except for coarse for beer and milk Which are in pints? So this is the kind of problem You're going to be dealing with when you're looking at data. You're trying to transform your data into a usable form Maybe the data is coming from different sources None of it goes together. You need standardized units standardized scales so we can go on and analyze it Let's think back we what we're doing is we're trying to prepare our data into a Densest most clean format so that we can apply modeling or machine learning or some kind of statistical Test to work out what's going on and draw knowledge from our data. So this is going to be an iterative process We're going to be cleaning the data We're going to transform the data and then we're going to reduce for data and transforming data is what we're going to do today So let's imagine that you've cleaned your data. So we've got rid of as many missing variables as possible Hopefully all of them with deleted instances and attributes that just we're not going to work out for us Now what we're going to try and do is we're going to try and transform our data so that everything's on the same scale Everything makes sense together and if we're bringing datasets from different places We need to also make sure what the units are the same and everything makes sense There's no point in trying to use machine learning or sum or clustering or any other mechanism To draw knowledge from our data if our data is is all wrong So today we're going to be looking at census data now census data is kind of a classic example of a kind of data you Might look at in data analysis. It's got lots of different kinds of attributes things that are going to need cleaning up and transforming So we're back in our we're going to read the census data using census is read CSV So we've downloaded some census data that represents samples from the US population To begin with we're going to read that in and you can see that we've got 32,000 observations and 15 attributes or variables So what are the first timers so let's have a quick look at just a little bit of it and we can see the kind Of thing. We're looking at so we're going to say head of census and that's just going to produce the first few rows So we can kind of see the kind of data so you can see we've got age we've got what working classification that person has their educational level a Numerical representation about whether they're married or not this kind of thing So there's a lot of different kinds of data here some of its going to be nominal So for example, this working-class state government private employee. That's a nominal value We might have ordinal values or ratio values or interval values All right We're gonna have to delve in a little bit closer to find out what these are now What we do to transform this data into a usable format for clustering or machine learning It's going to depend on exactly what these types of these columns are and what we want to do with them So let's look at it just a couple of the attributes and see what we can do with them, right? we're going to use a process called codification the idea is that may be things like random forests or Multi-layer perceptrons, you know neural networks aren't going to be very amenable to putting in text-based inputs And what we want to do is try and replace these attributes with a numerical score All right So let's look at just for example of a working class and also for example The educational level so education now work class is the kind of class of worker that we're looking at here So for example a state worker or in private sector or someone that worked in a school or something like this now This is a nominal value. That means there's no order to this data at all we can't say but someone in state is higher or lower than someone in private and we can't also say but let's say State is two times more or less than some other one. That makes no sense at all So what we can't we can replace this with numbers? so let's say we could replace private with zero and state with one and You know self-employed with two and so on right and that week that's perfectly reasonable thing to do, but it's still nominal data so what we can't do is then calculate a mean and Say are the mean is halfway between private and public that doesn't make any sense just because something has been replaced by a numerical score Doesn't mean that it actually represents something that we can quantify in that way right? It's still nominal data Okay, so I bet the best advice I can give is feel free to codify your data into easy-to-read numbers but just bear in mind that you can calculate the mode just like you know the most common but you can't calculate the median and you can't calculate the mean another example would be something like the Educational level now fear letting me this is ordinal data so we could save it someone with a an undergraduate degree It's maybe slightly higher in terms of their the amount of time. They spent in education, but someone with a high school diploma But we don't know exactly what the distance is And what's the distance between let's say a high school when a degree and then a PhD? And so on an MD and things like this We can represent these Using numbers and probably in order right so we could say that zero is no Education and one is sort of the end of primary school and two is the end of high school and so on and so forth But again, it's difficult to calculate distances between these things We don't know what high school is two times more than primary school and half of a degree or something like that That doesn't really make sense So again, you might be able to calculate a median on this or a mode, but you can't calculate an average You can't say the average level of ocation. It's halfway between high school and undergraduate that doesn't make any sense either So for any kind of attribute that is nominal or possibly ordinal and it's sort of represented using text We can codify this so but it's more amenable to things like decision trees depending on the library you're using right? But you just have to be careful all machine learning Algorithms will take any number you give them and you just have to be careful that this makes sense to do So what you would do is you would go through your data and you'd begin to systematically replace appropriate attributes with numerical versions of themselves Remembering all the time, but they don't necessarily represent true numbers, you know in a ratio or interval format So for any text-based value, we're going to start with places and possibly with numerical scores. What about the numerical values? Well, they might be okay, but the issue is going to be one of scale you might find for example in this census data that one of the Dimensions or one of the attributes is much much larger than another one. So for example, this data set has hours per week which is obviously going to be somewhere between naught and maybe 60 or 70 hours for someone that's got you know a very strong work ethic and Salary right or salary or income or any other measure of you know? monetary gain now obviously hours per week is going to be in the tens and Salary could be into the tens of thousands. Maybe even the hundreds of thousands Those scales are not even close to being the same. That means if you're doing clustering or machine learning on this kind of data You're going to be finding the salary is kind of overbearing everything, right? So it's going to be very easy for your clustering to find differences in salary and it's harder for it to spot differences in hours Because they're so small in comparison Right. So we need to start to bring everything onto the same scale the more attributes you have which is another way of saying the More dimensions you have to your data Then the further everything is going to be spread around if we can scale all of these values to between sort of let's say around 0 & 1 then everything gets more tightly sort of controlled in the middle And so it gets much easier to do Clustering or machine learning or any kind of analysis we want So let's look back at our data and see what we can do to try and scale some of this into the right range So we're going to look back at the head of our data again so our numerical values are things like the capital gain the capital loss which I guess Zuma bleah how much money they've made in the Loss that year probably for normal license on some scale and then things like the hours per week that they work and their salary which at this case is rate of an or less than 50,000. So let's have a quick look at the kind of range of values We're looking at here so we can see if scalings even necessary Maybe we got lucky and the person did it before they sent us the data So we're going to apply a function across all the columns and we're going to calculate the range of the data So this is going to be apply on a census data division, too So that's all of our columns and we're going to use the range function for this and this is going to tell us okay So for example the age ranges from 17 to 90 the educational level from 1 to 16 It gives you the range for things like nominal values as well, but they don't really make any sense I mean working class ranges from question mark to without pay, you know is meaningless and then so for example capital gain ranges from zero to nearly one hundred Thousand and capital loss from zero To four thousand and finally the hours per week main gist from 1 to 99 So you can see that the capital gain is many orders of magnitude larger in scale than the hours per week We're going to need to try and scale this data. Well begin by doing to make our lives a little bit easier It's just focus on the numerical attributes, right so we'd have to worry about the nominal values, which we've not codified yet We're going to select all the columns from the data where they are numeric. So that's this line here a star then down here So we're going to s apply that applies over each of the fields is it numeric and that's going to give us a Logical list that says true or false depending on whether those columns are numeric What we're doing here is selecting from this list any bit of true and then finding their name So what are the names of a columns for the numeric? So let's have a look at just a range of these attributes to make a life a little bit easier So I'm gonna run this line and so this is a simplified version of what I was just showing you can see that capital gain is massive compared to the hours per week for example Let's have a look at the standard deviation the call that the standard deviation Is the average distance from the mean so it kinda gives us an idea of the spread of some data Like is it very tight and everyone owns roughly the same or is it very spread out and it's huge Deviations and the answer is there's pretty huge deviations. So the age has a standard deviation of 13 so it obviously That means that most people are going to be kind of in the middle and on average they're going to be 13 years younger or older, but you can see that things like capital gain have a 7,000 standard deviation, which is a huge amount to give you some idea what we're aiming for It's very common to standardize this kind of data. So but the standard deviation is 1 right so 7,000 much too big let's plot an example but gives you some idea of what the kind of problem is when we have these massive Ranges, so I'm going to plot here a graph of age vs. Capital games, right? We know age goes between about one and a hundred and capital gain is much much larger So if I run this basically the figure makes no sense at all because the capital gain ranges from zero to one hundred Thousand and as a few people earning right at the top scale, everything is sort of squished down the bottom. We can't see anything That's going on. There's no way of telling whether the capital gain of an individual is related to their age I mean it probably is like because a retired people people who are very young. Perhaps her and slightly less We can't really