Subtitles section Play video
What is data right I'm pretty sure that's data
Right is this data, you know this picture or that data
Is this data? What what is data?
So we talked a lot about data in the last video
Why is it important that we can analyze and understand data, but what is data? Everybody has data everybody's generating it
Companies are generating on us. We're generating it ourselves, you know when we use social media so on
but what is it and
Understanding what it is is a prerequisite for being able to use it properly
Perhaps the most important thing as far as we're concerned
So people who are trying to analyze data sort of scientifically is the data has to be measurable, right?
so the idea is, you know, if you're going to do a survey on what people like
Everyone's got to be using the same scale and the same rating system
Otherwise, it doesn't make any sense
Well, we can't have someone rating things from one to five and someone else saying I thought it was good
Right because which one of one to five is good. We don't you know, we don't know
All right
So everyone is going to be doing the same thing your data's got to be a consistent format and once that's achieved at least
We're a little bit closer
To be able to make some sense of it. Broadly speaking when we talk about data
We kind of have four different types and we summarize this with this nice noir word. So n, o, i, r, noir
And each of these different types of data we can do different things with all right
So n that's the first type so this is nominal data
The normal data is where we have no distance between the values that we can measure
Right because they're not really quantities and we can't order them. So a good example would be colors
So maybe you have your favorite color is red. And my favorite color is blue
I don't know which is better than the other
There is no measurement between them right is blue closer to green the medes. You know, that doesn't make any sense, right?
We're not talking about wavelengths. We're just talking about the colors, right?
Another good example would be lets say in football player numbers on your back right now symbolically
Sometimes certain player numbers have a meaning but you can't compare and contrast them
You can't say that 8 is 2 times better than 4. All right, that doesn't make any sense, right?
You also can't really order them in general right player
16 doesn't go before or after player 13 in a list but you know, but that doesn't make any sense, right?
So nominal data is data where and it's useful, right?
It could be really important but it's data where we we kind of have labels
But no way of ordering these labels so you can still analyze it, but you can't for example calculate
the average that the mean average right? That wouldn't make any sense
What you can do is calculate the mode so you can calculate the most common one so you could say that more people prefer red
To blue but you couldn't say you know
The average color that people like is a sort of muddy brown right. That doesn't make any sense at all, right
So as we go down this list, we get slightly more and more informative in some sense types of data
So the next one is ordinal
so in ordinal data
we have an order but we can't measure distances between things so a good example would be something like
Positions people finished in a race. So, you know, maybe I finished first
I'm super quick right you didn't you finished third
But how far we are a part that isn't included in that kind of data
You'd have to have a separate value for that another example what we're all familiar with is rating systems, right?
So perhaps you I rate a film from one to five stars and you rate the film from one to five stars
but you can't really say that a
film that's got four stars is two times better than one that scored two
Because that's a very subjective and it's there's no real sort of measurable distance between these stars if you have ordinal data
You can calculate the mode again. You can calculate the most
Common value of all the values that were returned or you can calculate the median the one that sits in the middle, right?
So maybe you know fifty runners in a race the 25th position roughly speaking is going to be you know around the median
So it's still not hugely useful, right the next up. We have interval data interval data
We have an order and we have a distance, but we have no sort of absolute zero for this scale
So a good example would be something like degree Celsius or degrees Fahrenheit
Zero degrees Celsius isn't no temperature. It's it's a specific temperature, right?
So we can't say that fifty degrees is half of a hundred degrees
I have a numbers a half but doesn't really make sense, right?
They are we can we can say that a hundred degrees is hotter than 50, which is hotter than zero, right?
So this is interval data now interval data
Lets us do a few more things than we could with ordinal as well as be able to calculate the mode and median we can
Now calculate the mean temperature. That's okay
And we could also calculate things like the rain the minimum and maximum temperatures for a certain window, right?
So that's pretty useful another good example of interval will be pH level right again, the pH of zero means very acidic
It doesn't mean there is no acidity at all or no pH at all. We can say that a
So 13 is higher than a pH of 7 is higher than a pH of 3
And we know how far apart these numbers are but we can't necessarily say if one is double one another one
So the final kind of data we're going to look at is ratio data
So this is exactly like interval, except we now have a sort of true zero value
So a good example of this would be degrees Kelvin right. So Kelvin has an absolute zero which is the absolute average
absence of any kind of heat right and when it goes upwards so we can say that in terms of Kelvin a hundred is
Half of 200 and so on like this and we can get to 0 another example would be number of children, right?
Zero children means the absence of any children and you can also say that let's say four children is double the amount of two children
And two many to look after in my opinion
So that is an example of ratio data
Right now ratio data is quite similar in terms of what you can calculate to interval, but it allows some more
complicated statistical measures such as t-test
So these are the types of data now actually, it's quite important how you structure your data in general
We can't just have it sitting in some massive spreadsheet with no thought given to where everything is, right
There's actually a pretty standard way of doing this that we're going to look at
Data comes in lots of forms, right different types of measurements different experiments people are going to collect it in different ways
But actually there's a very standard way that we use
To represent data once it's actually on a computer so we can have some kind of table of our data
We almost always
represent our data in a matrix like this a
Two-dimensional table because it's much easier to do and so along the top
We're going to have our attributes right which are the the things we've been measuring
So an example would be maybe we're collecting data on people so we could have name
That would be some nominal data and then, you know age height
So the columns are attributes all the things we've been measuring the rows
Those are the instances or the samples we've got so that's all the individual people
So here's person 1 and person 2 person 3 and person 3 is called John and there
You know 54 and you know 5 foot 11 or whatever, you know
Whatever right and so on and you can put you know have as many rows as you want
so when we talk about
attributes
We're talking about the number of columns people use lots of different terms for these. I like to think of them as features
Attributes is another one and we have instances or samples down the rows now quite often on the very last column of your data
Sometimes separated out but not really important. We'll have our output
Maybe we're trying to make a decision based on these people
Maybe these are candidates for a football team and we're saying, you know, are they gonna be on the team or not?
So this is yes. No John's made it
Yes, no, no and so on and that way we could perhaps analyze our decision-making process and decide you know
Is there any aspect of these things that inform our decision-making process as an example right now?
We always structure data in this way
But if we don't it becomes a huge problem because you end up spending all this time formatting and trying to work out
What's what and you know, why is John listed down the table or not across the table? And you know, nothing makes any sense anymore
So let's look at an actual data set and we'll see all this in action
So we have here a data set of whether someone goes to play tennis
Right and whether or not they go is going to depend a little bit on what the weather conditions are, right
So we don't like to play for example
When it's too hot the tennis data set is just the same structure as a data set. We looked at already
We're gonna load it into R it's held in a CSV file. So tennis read CSV
Tennis now we're using R for this because it's free and it has a load of decent functions for analyzing examining
Visualizing data, right? So we're going to be using it throughout these videos
obviously you could use MATLAB or Python or some other library if you wanted to
I think that you should use whatever you're most comfortable with
Looking at these rows and tables
I mean, it looks a lot like something like Microsoft Excel
You could do this data analysis in Excel
Some people would disagree. No, Excel is perfectly good for what it does you could do with data analysis in it. I think that
Excel in it doesn't enforce anything to do with
Observations versus variables and things like that. These are distinctions that are not really made in Excel
Obviously if you enforce those rules yourself that's going to work, but you have to be a little bit more
You know regimented and rule-based about it
Think the consensus would be that if you really want to get into data analysis and start doing things like principal component analysis or more
Advanced statistical measures something like R or Python is going to help a lot more
Okay
So I've loaded the data set and if we look up the data set
so we look at the top few rows of the data you'll see that there are 6 different variables or 6 attributes and
This data set has 14 instances or observations
R calls them observations. So what we're saying is we have six columns and
fourteen rows right of our data set and this data set is
structured exactly like
This people data set that I was looking at a minute ago
So we can examine a single instance we can say what is it about day three?
So let's have a look at day three so we can say tennis on day 3
And we can say on day three it was overcast. The temperature was only five degrees
The humidity was high there wasn't any wind so they decided to play tennis, right?
So it's a bit chilly, but I guess they gave it a go
So on we could also look at all the different temperatures, for example, all the different forecasts tennis dollar outlook
All right
And we can look at all the outlooks in the data set so we can say we've got sunny sunny overcast rainy rainy
rainy and so on and we can get a feel for what kind of weather we're looking at here as well using something like R
You can examine the instances
You can examine the individual attributes you can group them together or not as you see fit and then you can start to drill into
What this dataset means
Now this dataset has in it the final column which is whether they actually played so you could use something like machine learning
To predict that final column based on the other columns. That's something you could do one other thing about this dataset
It's quite interesting is it has a few examples of the different kinds of data. We were looking at earlier
So remember we have nominal ordinal interval and ratio
So for example Outlook is really a nominal field right, it's a nominal data type
You could perhaps suggest that you could order it from rainy through to sunny, but then cloudy overcast, you know
It doesn't really make any sense
so this is kind of nominal you could calculate for example the mode and say that most of the days were rainy or something like this