Placeholder Image

Subtitles section Play video

  • Let's talk about data visualization so that we can avoid problems like this which is where we've got some kind of graph

  • Who knows what it means?

  • Loads and loads of lines none of them labeled. I think the thick one is more important. That's that's what I've learned from this

  • Data visualization is another method we can use along with

  • Statistics to have a look at our data Explorer our data and try and work out what's going on

  • It's a way of trying to understand our data better so that we can then perform

  • You know more rigorous statistical tests or actually start to draw conclusions or model our data

  • It's a very important tool but you've got to use it properly

  • You can't just plot anything and everything

  • Every chart you use has got to support your hypothesis or it's got to try and show the story

  • You're trying to tell right? You don't just plot something because it could be plotted. There's got to be a point to it

  • There's a lot of problems with using inappropriate grass and only picking subsets of your data. That's a huge problem, right?

  • That is not just a problem for data visualization. That's a problem for your statistical test as well

  • If you're only using some of your data, it's that okay

  • It's going to depend on the situation right my um, you know

  • but I think there's a strong argument for saying you've got to be really really careful and you've got to be really

  • structured and regimented and

  • Document everything you do. The core problem with visualization is that people just plot stuff and they do it badly

  • maybe they use the inappropriate plot type or they

  • Don't scale of axes properly and that leads to huge misunderstandings and actually can be quite misleading, right?

  • This happens a lot in the media

  • So, for example, you might get a sort of political message for your door, but says these are different parties

  • So this is party one

  • This is party to this is party three and maybe you know party one's got this many votes and party twos got

  • This many votes and party three two

  • Right down here and party two are trying to make the case that just a few more votes and they're gonna win in this area

  • why but actually written down here this is twenty thousand and this is ten thousand and this is you know,

  • Eight thousand and just in the small labeling they've got here

  • They've completely skewed the axis right ten thousand is half of twenty thousand yet. Here. We are up here if you misuse plots

  • It's actually misleading when it's on your own data

  • You're going to draw the wrong conclusions and then spend

  • quite a while researching into an area but doesn't make sense or and ends up in failure or if it's if

  • It's something you're presented to someone else. You can mislead that person whether intentionally or by accident

  • And that's never a good thing. I'm back in our and I just wanted to show a couple of plots that you know

  • It's not misleading necessarily, but you can easily infer the wrong kind of information, right so

  • There's this websites online

  • You can go to to look at the ratings for different TV shows right now. One of my favorite TV shows is Fraser, right?

  • I think it's amazing and

  • If you go on to these sites and you plot the

  • Ratings for all these Fraser episodes. It's all over the place

  • Sometimes it's very highly regarded and sometimes it's not so I'm just going to plot this

  • using the GG plot tool and we can see if we look at the graph that

  • It's absolutely everywhere. Right? You've got good episodes. You've got bad episodes and it seems to maybe be going slightly downhill towards the end

  • But it's difficult to say right because it's all over the place

  • Now what's actually happened is I've just plotted using a default function and it's Auto scaled my rating axis, right?

  • so my y-axis is the rating of the episodes and it's going between seven and

  • About nine and a half now that isn't representative because it's spreading out my data if I plot the exact same data

  • But this time from naught to ten like an actual rating system

  • You can see that most episodes get almost the exact same rating somewhere between around seven and a half to eight

  • Which I think's pretty good

  • I would rate them a 10, but you know

  • It's just me. You can see that even if you're not careful

  • If you do it by accident, even auto-scaling a maxi's and things like this can cause a real problem another classic example, you'll see

  • In the news is when they show something like a currency exchange rate

  • So if we look at here

  • we've got our I've downloaded some sample data of the Japanese yen versus the US dollar and I've simplified this by

  • Extracting just a period of about 60 days in the middle of some time

  • I can't remember exactly what it is

  • If we plot this you can see that actually there's a big sort of cliff edge

  • Something terrible has happened around day 30 and the value of the Japanese yen is just plummeting

  • And of course, this is absolute nonsense, right? Because this scale goes between 108 and a hundred and fourteen

  • And so if we plot it with a proper axes on you can see that actually it's almost completely flat

  • If your business relies on the exchange rate of a Japanese yen to the US dollar

  • Obviously these small changes might be important right but if you're presenting this in the news

  • It's very easy to claim that something terrible's happened when in fact actually, maybe this is just normal blip up and down, right so

  • You can misuse

  • Plots to serve your purpose right or and you can do it accidentally and waste a huge amount of time

  • Let's have a look at the standard plots

  • You might see right and you could use on a very basic level and see you know

  • What are they appropriate for right because one of the most important things is that you use these plots and these charts

  • Appropriately, alright, so, you know, perhaps the most common one that everyone sees is going to be a bar chart

  • You've got two axes

  • You've got some kind of attributes or labels down here and then you've got some quantity or amount of some attribute here

  • And then you're going to have different bars like this now

  • This is a very nice graph to use it's simple but it's effective because you can very easily see what the difference between these different

  • Levels are right so that you know, it's often going to be your go to graph for lots of things

  • Right, some people now some people try and replace this graph of a pie chart, right? This is a bad idea in general

  • I mean

  • I like pie as much as the next person but if you've got different things

  • Like this and one of them is big

  • I mean you can see that this one's bigger than this one, but how much bigger it is?

  • I don't know

  • You can't see the relative sizes quite so easily this all gets worse if you combine this into a doughnut plot

  • And then you've got multiple pies embedded in each other none of them align and nothing makes any sense anymore, right?

  • So if in doubt don't use a pie chart, it's a bad idea. I mean they look very nice for presentations

  • That's about what I can say for it if we're going to be measuring some call of quantity then a bar charts going to be

  • What we want right but what we might also do is replace quantity with the with the frequency or the amount of something

  • So this is gonna be frequency. This is also our labels again on the bottom here

  • We've got our labels and this is going to be bins for some single attribute

  • So this is maybe so naught to 10 that misses maybe 10 to 20 of whatever the thing is

  • And this is a frequency the amount that fall into that range and what this allows us to do is work out very easily

  • What the distribution is is it normally distributed, but I'm only distributed with two peaks, you know

  • Is it suitable left skewed to the right?

  • We can see very easily the shape of our data and it can be really helpful

  • Another way of looking at this sort of the shape or the range of our data in particular is a box plot right now

  • You'll see box plots come up from time to time with scientific

  • Documents but they're very easy to produce in tools like are and they can be quite useful

  • So here we're gonna have a single attribute

  • So some label again or some attribute here and this is going to be the quantity of this attribute

  • And what a boxplot does is label the range of that data

  • So we're going to have a box here like this and it's going to look a little bit like this

  • So I'll use a different color pen

  • This line in the center is our median typically and then this is going to be the third quartile here

  • Third quartile and this is going to be the first quartile and then these are the max and the min in this one plot

  • We've got the absolute range of our data

  • We've got where 50% of our data is sort of this interquartile range here and we know where the midpoint of our data is

  • So we can very easily see whether we've got

  • outliers and we can plot this next to a different attribute and we can have two box plots next to each other and we can

  • See very quickly, you know a comparison between these two things so that can be really useful now the final ones right?

  • We're going to be talking about scatter plots and trend lines. All right, so it's got to pop very simple. We've got two

  • Attributes, this is attribute one and this is attribute two, and we want to see how they bury with respect to each other

  • So when one goes up does the other one go up or does it go down are they even related to?

  • So you'll see something like this and it'd be all over the place often

  • But you can see maybe there's a kind of trend where as attribute one increases attribute two increases right now

  • This is a correlation being shown here. Not a causation. So you can't say they're definitely related, but you can say that

  • generally speaking when one is big so is the other that's but sometimes useful a

  • Trendline is going to be where we're going to be plotting something over time

  • My so this has to be a continuous variable or at least a variable we believe

  • Can be inferred between our points like it's unlikely, but you're gonna have all the points

  • So you what you might have is you might have a plot where you've got time

  • Down here. So maybe time in mumps, for example

  • And we've got some amount of something and we're just going to plot it like this and we can sort of have a trendline going

  • Like this if it's a situation where we can infer the amount between two time points then this is okay

  • Right because we can say well look we've got a reading here. We've got a reading here

  • It's reasonable to assume that between these two points. This is the amount

  • All right. Nothing to funny's gone on between these two points, right?

  • If you can't assume that then you shouldn't really be using a trendline and you probably want to be using a bar graph

  • Does that depend on the kind of day to them? Yes, it'll depend on it

  • This is a judgment call based on the kind of data

  • So if a data I mean time is a good good example. We don't tend to measure sort of in infinitely small increments

  • We're going to be measuring daily or hourly or something like this

  • but we can kind of make an assumption a lot of the time that our readings like temperature for example over time if

  • You're at 20 and then the next hour you're at 25. We're probably halfway between there to between those two times, right?

  • It's going to depend on your data

  • I mean a good example would be if you were plotting something like operating system usage per student

  • so we've got OS X here, but Linux here and we've got

  • Windows these many people use OS X this many people uses Linux this many people use Windows

  • Well bees have discrete data points. You can't fit a trend line to these. There is no operating system

  • That's 50% between Linux and Windows that I know of and we can't infer

  • How many students are going to be using it that makes no sense? That should be a bar chart?

  • So let's look at an actual data set and see how we can use some of this visualization in practice

  • So I've got here a chicken data set and this data set is about

  • Weighing chickens on different diets over a period of weeks and also measuring how many eggs they produced

  • I'm not a farmer, but let's imagine that what we wanted to do was see if one of these

  • Diets produces a better weight gain and maybe more eggs per week. Let's have a look

  • So I'm going to load the chicken data set. This is at stored in a CSV

  • Just like before let's have a quick look at just the first few rows of this data to see what they look like

  • So that's going to be the head function and we you can see we've got six attributes

  • So we've got the week but the measurement was taken the chicken in this case of chicken number one, but they'll obviously be other chickens

  • diet, they're on a diet B or diet see the age of the chicken in mumps the weight of a chicken in kilograms and the

  • Number of eggs they produce that week. All right, so there's going to be lots of combinations of weeks and chickens in this data set

  • Now what we want to try and do is see if there's any kind of relationship between the diet

  • They're on and the number of eggs. They're producing or the weight of a chicken or anything like this

  • So the first thing we could do is we could have a look at the aggregate function

  • So I'm going to paste this down here. We'll talk through it. What the aggregate function does is let us produce

  • Let's say a summary or calculate some means or medians

  • Over a data set but this time grouping by a certain attribute

  • so in this case

  • What we're going to do is we're going to aggregate the weight of the chickens bar in groups of their diet

  • So all the A's all the B's and all the C's and then we're gonna for each of those

  • We're going to calculate a summary

  • So let's run that and you can see that we've got our group down here for a we've got the minimum the maximum

  • The median the mean and we can see some slight differences perhaps in these data sets

  • I mean the median mean for example of Group A. It's 3.8. Whereas the mean for Group C is 3.4

  • So maybe there's a slight difference in these things. Okay. So let's try a different aggregate function

  • So this time we're going to aggregate the number of eggs produced groups by again the diet

  • So this is going to be all the A's all the B's and all the Seas and then we're going to produce a summary

  • so we can see that the median number of eggs produced for group a is 4 per week and

  • For group B and Group C is 3 per week. So maybe again there's a slight difference

  • We're starting to learn a little bit about our data. So let's start with histogram light

  • So what we're gonna do we're gonna use this histogram function

  • Which is mostly labels like the hist function in our produces a histogram

  • And we're going to produce a histogram of the ages of a chickens. So what's the distribution of the ages?

  • Are they old are they young?

  • And we're gonna use 15 breaks

  • That means we're going to take the whole range and break it into 15 columns 15 bands right now

  • actually, I will do a little bit of

  • Just a few checks behind the scenes to make sure 15 is an appropriate number and might adjust it up or down slightly

  • so we can see this histogram broadly speaking our

  • Chickens are evenly distributed among the different ages

  • we've got some young ones that sort of 60 or 70 weeks old older ones that are

  • 350 weeks old and then for some reason we've got a peak around 250

  • I don't know why that is but I maybe we've got a batch of a certain age of chickens in

  • And let's finally let's look at the box plot

  • So we talked about the block's plot box plot will tell us the minimum the maximum

  • For an attribute and also the median in the range, right? So this is really helpful

  • So we're just going to have a look just to age just for all chickens

  • So you can see that the median is around 220 something like that

  • and then the majority of the chickens, so 50% of the chickens fall between about

  • 150 weeks old and 300 weeks old but you can see there are some very young ones and some very old ones this kind of

  • Plot will end. It's really size up where our data sits before we start to make any assumptions

  • so let's imagine now that we want to try and drill down into his day to a bit and work out whether

  • Actually the diet had any effect on the number of eggs or the weight of a chicken, right?

  • so what we're going to do is we're going to group we're going to use the aggregate function again to calculate the means of

  • All the weights per week. I was going to copy that down here

  • So we're going to say aggregate the weight of the chickens by both the week and the diet

  • so

  • combinations a week one

  • die a week to die a and so on and I don't want you to calculate the mean for all chickens, so

  • Run that so that produces some statistics on the different average weight of chickens over time

  • I'm going to rename the columns so that they're a little bit more informative that sort of run that line there

  • And then finally, we're going to plot this now

  • We're going to use GG plot for this, you know, whether you use the inbuilt our plot functions or enough alive

  • We like GG plot will kind of depend on what plot you want to do in general

  • You can get quite nice plots with GG plot, but they're a little bit more involved. Alright, so I'm going to run this line here

  • Looking at this data we can kind of see that maybe da a is having a positive effect, right?

  • So down at the bottom where no weeks are passed at the beginning of our experiment

  • There were roughly the same weight and then the average weight of a actually does seem to increase

  • So I guess that's something interesting about our data right now

  • Let's look at number of eggs, right so we're gonna do the same thing this time

  • We're going to aggregate the number of eggs by week and by diet so they don't copy that and I'm going to give it some

  • Helpful labels as well and then we're going to put the data. Let's see

  • Over time whether or not any of the diets have any effect on the eggs, and it's looking pretty good

  • Alright, so this is the frequency as the number of eggs were producing

  • the weeks is the twelve weeks of our

  • Experiment and you can see that diet B and