Placeholder Image

Subtitles section Play video

  • Well, we're looking at chocolate datasets today, so I thought I'd bring some research I'm

  • Yeah, good and definitely relevant

  • We've been looking at techniques like data visualization to try and explore our data and start to draw some initial

  • You know conclusions or hypotheses

  • we're going to start to move towards kind of modeling our data and actually trying to extract proper knowledge from this data because remember just

  • Because we've got data doesn't mean we've got knowledge. Now. This is going to be a kind of iterative process

  • We're going to need to clean up our data to make it as useful as possible

  • we need to transform it so that we can combine datasets together and

  • Statistically measure our datasets and then we're going to need to reduce it

  • sometimes if our data set is too big and unmanageable and this combination of cleaning data and

  • Transforming data and reducing data is a kind of cycle where we're going to iterate this until our data set

  • It's a smallest most useful form

  • It can be so if we've got redundant variables which are basically the same as others or we've got duplicates

  • These are all problems that we need to sort out because if we're going to be analyzing data with these kinds of issues

  • We're just making our life slightly more difficult

  • It's computationally inefficient and you know in the worst case we could draw the wrong conclusions

  • you might be surprised and disappointed when you get your first data set that actually it's not quite as nice as you were hoping and

  • It's gonna need some cleaning up. Maybe there's missing values

  • Maybe there's outliers that need to be dealt with because this yeah

  • they're warping your distributions and your and your medians and means

  • Right and perhaps you also got noise in your data set right these few things we can start to address with cleaning

  • So cleaning data is essentially the idea of trying to correct or fill in any missing

  • Values or remove those bits completely you might be surprised but there's missing data at all

  • I mean Oh, what are we not paying attention?

  • Like we've got one job and that was to collect the data and we seem to have got missing data

  • But actually it's quite common because I mean, for example if you're if you're trying to track patient records over time

  • Maybe we didn't show up to an appointment or maybe you in a hospital when they weren't there when they needed to have their temperature

  • Taken, right and then your trend line of your temperature over time is going to be missing some values

  • maybe you've got data going back for hundreds of years and they didn't have certain techniques and certain measurement systems back then so they only

  • Have other kinds of data so missing data is we common rye and we're gonna have to be able to deal with it

  • So the data set we're looking at today is a kind of ratings for chocolate bars

  • This is why ate my chocolate or at least that's why I'm telling myself

  • So we'll read the data in we've got ten different variables

  • We've got about 1,700 nearly observations and let's have a quick look using the summary

  • So we've got things like the company would produce for chocolate the name of the chocolate

  • Reviews the cocoa percentage the type of being this kind of information

  • Right and you can imagine what you might do if you were trying to produce better chocolate

  • It's a look at a huge amount of this kind of data and work out what it is that customers like and what it is

  • They don't like but this is going to be quite common in

  • market research

  • So the first thing we're going to do right we've we've received this data

  • We know now what the columns are, but we don't really know anything else other than this

  • So we're going to have to start looking through and seeing first of all, is there any missing data?

  • So we're going to use the S supply function for this the s supply function will apply a function

  • Over our data set so for each column or each attribute of our data

  • We're going to apply this and the function we're going to use we're lighting ourselves

  • so it's going to be the sum for any time where our item is either blank or

  • Na now blank means an empty string and na means literally not applicable

  • Right, which is something that comes up in data from time to time

  • Alright, so in any case both of these are missing values

  • And so we're going to treat them both the same

  • So if we apply this to our chocolate data set then we're going to see that for example, there are eight missing names

  • There are seven missing review dates and there are four missing Coco percentages

  • So for each row in our data, there are four rows where the Coco percent is missing right now

  • That's not too bad for I mean, this is as data set of nearly 1,700 items for it's not too bad

  • That's quite quite expected

  • You might imagine that if you're pooling this data from lots of different sources

  • People are going to forget to add datum in or they weren't able to record data on that day

  • There's a huge amount of reasons

  • Why you might have missing data and now it starts to become a little bit of a problem when we look at things like bean

  • Type because bean type has got 1200 missing values

  • That's a huge proportion of a data set. And in that case we might have to do something about this

  • So the only issue we've got is that 1200 is not relish. It's just an amount of rows

  • It's not relative to the of a data set. So we're going to use the exact same function any empty rows

  • but this time we're going to divide by the total number of rows so we can get a

  • percentage for how much of a missing data we've got so we can see for example

  • But company name has zero missing data. Whereas beam type has 74 percent missing data

  • So that's a huge problem

  • Now a kind of general rule of thumb is if you've got over half your data is missing

  • It's going to be quite hard to estimate or guess what that data is going to be

  • That's when you might want to start thinking about removing it. So what we want to do is we want to extract

  • Any of the names of any of our attributes that have this sort of over let's say 60% missing

  • All right

  • So we're going to start by calculating always percentages and saving them in a variable and then we're going to select only those percentages

  • Where the value is over point 6 right 60 percent?

  • so we're gonna say any attribute where the attribute is over point 6 and that is just being typed at point 7 for right or

  • 74% so we can now delete beam type ourselves so we could say something like chaco all the rows

  • for being type is

  • Null and by setting that to nah, that's just going to delete that column

  • We can also do it automatically so we could actually pass in those attributes that we just calculated as a parameter

  • So that would be this line here

  • So be something like chaco all rows that's here the names of any attributes where the percentage

  • Missing is greater than 0.6. And that's going to just delete being type right? It's not a lot we can do about being type

  • We've only got 25% issue of a data. It's not enough to start guessing

  • What beam types are going to be in other chocolate bars?

  • Let's have a look at now our rows of data voters for each instance

  • They're going to have a number of attributes now

  • There's 9 left and we want really to keep the instances that have the majority of their data, right?

  • So we're gonna apply my so this is going to be row wise to dimension 1

  • So that's the rows we're gonna count any of it a blank or n/a?

  • For each row over our data set and we're going to put this into missing

  • so what it is going to do is return a

  • List of values for every single row but tells you how many missing items are there in that rug so we can now look

  • at the first few missing

  • items, so we're going to order them right by larges first and then we're going to show just the first few and you can see

  • But actually some of them are missing seven and six attributes. That's quite a serious situation

  • Because it was only nine lights only only got a couple of entries in their fields

  • Now let's do this again as a percentage of the number of attributes

  • So this is exactly the same thing but this time we're dividing by the number of columns

  • which is nine and we're going to have a look at the top of these and so you can see that we're missing 77% of

  • Some of these initial attributes that's a real problem

  • Missing is the same length as the number of rows we've got so we can actually look up any

  • Rows where there's a greater percentage of missing values and we want and just remove them from the data set

  • So what we're going to do that is a bit like this. So there's a choco is

  • Choco, anywhere. We're

  • Missing is less than or point seven and then all the columns and what that's going to do is select only the rows we want

  • Where they've got a nice amount of data

  • So machaca dataset is going to be a little bit smaller now, but much more useful to us

  • We don't really want to be trying to do things like machine learning or statistics when 70% of some of the data is missing

  • Right, that isn't going to be a good idea. So it's quite easy just to delete data, right?

  • I mean in some sense, it's just more convenient to do that in general

  • The rule is that if you've got more than 50 or 60% missing data, it's a good idea to delete it, right?

  • Delete are the instances or the attributes depending on how much data you've got missing and where if you're missing a huge amount of data

  • Then you're not going to be able to recreate it by let's say using an average, right?

  • We've got so little later that an average isn't going to be reliable if we have got sufficient data that we could maybe start to

  • Infer what these missing values might be we can start to try and replace them instead of deleting them

  • So what we might do is we might for example set them all to zero

  • Maybe if an attribute is missing we can say well, okay

  • If it's missing, we'll just not use it and we'll say it's zero now whether you do that

  • It's going to depend on what the attribute is. Something zero is not a useful property

  • Right and we'll look at that in the chocolate dataset in a moment

  • What we might also do is we might start to add the data set mean into those attributes

  • So maybe we don't know what the rating for this chocolate bar is but we can guess but it's going to be around the average

  • Rating for any chocolate bar again. This is going to depend on your situation, right? You're still making up datum in some sense

  • You've got to be very careful about what you do here. So

  • We've deleted as much of our chakra data as we feel comfortable doing now. Let's see if we can

  • Fill in some of missing values with appropriate replacements. So let's have a look at our attributes

  • Alright, so we've got company name reference things like this being tab has been removed

  • But we still got things like the bean origin and the ratings and there's a few of these missing from our data set

  • Can we estimate these rather than completely removing them from the data set?

  • obviously the less data you use for less useful things like machine learning are going to be so let's look at an

  • Attribute and see what we can do

  • So if we look at bar price, I bats the price of each chocolate bar

  • We can see that there's a few missing values somewhere around 3% That's something we want to deal with

  • But we've got enough data, you know 97% Maybe we can start to guess what the prices of these chocolate bars might be

  • Now this is a good instance of a time when you wouldn't want to just populate with zeros, right? No chocolate bar is free

  • I wish and so what we need to do is produce a reliable

  • Value to represent an unknown price rather than just setting them all to be zero

  • so what we could do here is something like this we could set every missing bar price to be the average of all the

  • Chocolate bar prices and that way at least we're not warping our distribution up or down

  • We're keeping it exactly the same. We're gonna say for the chocolate data set for any row

  • Where bar price is n/a and for all columns?

  • We're going to set the bar price to be

  • The mean of all the bar prices and we're gonna obviously remove any na s from that calculation of what it's not going to work

  • And that's already worked. So now if we recalculate our missing values, you'll see that bar price now has zero missing values

  • So we've fixed that problem great. So that was quite an easy one

  • Right bar pie seems to me to be quite an intuitive time when you would just calculate an average and put it in right now

  • actually, maybe not because

  • You know bar price might depend on where in the world we're selling it or you know, what company is producing the chocolate bar

  • So could we do the same thing for rating if we look if we take the sum of all the NA values in rating?

  • It's eight like so there are eight chocolate bars for which there is no rating

  • So what we can do is we could do something called a stratified replacement

  • We could say well, let's group our chocolate bars by country or by company calculate those averages and then we can specifically fill in