Placeholder Image

Subtitles section Play video

  • So, you've found a new dataset that you're excited to explore.

  • Or maybe you just want to get familiar with using Python for data analysis.

  • Well, in this video, I'll show you some of the essential skills that you'll need to know in order to perform data analysis with Python and Pandas.

  • Hi, my name is Rob, and I'm a data scientist and three-time Kaggle Grandmaster.

  • I've spent a lot of my time exploring new datasets on Kaggle.

  • In this video, I'm going to walk through step-by-step the process I usually take when exploring a brand new dataset.

  • If you're completely new to using Python and Pandas, I suggest you first watch my tutorial on an introduction to Pandas.

  • In this video, I'll be working entirely in a Kaggle notebook.

  • So this means if you don't have Python already set up on your computer, you don't need to Just click the link to the Kaggle notebook.

  • You can fork or copy it, and then step through the code along with me or after this video is over.

  • So with that, let's get started.

  • Okay, so here we are in a Kaggle notebook.

  • Kaggle notebook is very similar to a Jupyter notebook, just an environment where we can run code and write some descriptions around it in plain text.

  • So in general, I always import these packages.

  • Of course, Pandas, and we're going to import that as pd.

  • Then we also have NumPy, which will import NumPy as np.

  • We're going to use a few different visualization libraries, matplotlib.

  • So import matplotlib, pylab as plot.

  • And then we're also going to import a package called seaborn, which is really helpful for doing exploratory plots with our data.

  • And we're going to import seaborn as sns.

  • Few other things we want to do before we get too far into this is we're going to use a style sheet for matplotlib and seaborn style plots.

  • So the way we do that is we do plot style use ggplot.

  • I think it looks pretty nice.

  • Another setting we're going to add is we're going to expand the number of columns that are shown when we display a different data frame in our notebook.

  • So the way we do that is pd set option.

  • And then we set the max columns to, let's say, 200 just to make sure we can see them all.

  • Now I'm going to comment that part out now and show you why it's necessary later.

  • So let's go ahead and run this.

  • That cell has run.

  • And now next thing we need to do is import our data.

  • And with Kaggle, we can go here to this tab on the side.

  • I've already added our data set.

  • What we're going to be looking at is a data set that I created with a bunch of information about over a thousand roller coasters.

  • It has information like the speed of the roller coaster, what material was used to make the roller coaster, and some cool things like that.

  • But if you're starting in a notebook and want to look at a different data set, you can always click on add data and look through a bunch of these data sets that already exist or upload your own data.

  • And then, of course, if you're working on a notebook on your own machine, you can just reference whatever CSV file you have.

  • So I'm going to import our data and just call it df for data frame.

  • And we're going to use read CSV to read in the data.

  • So I know this is in the input folder, and it's called CoasterDB.

  • So that we've read in very quickly, and we have now a data frame we can work with.

  • So the next step is to do some basic data understanding of our data.

  • So very simple stuff, but the first thing I like to do is do a df.shape.

  • So what does this tell us?

  • This tells us the shape of the data frame or the data that we just loaded in.

  • In this data set, we have 187 rows and 56 columns of data.

  • We can also run a head command on this.

  • This shows us the first five rows.

  • We could change the number of rows we want to see.

  • Let's say if we want to see the first 20.

  • And now if you notice here, when we do this head command, and let's put it back at five, what we see here is it shows us all the different columns, but then at a certain point, we see just dots, and then it picks up later with the last bit of columns.

  • This is because pandas by default does not show you every single column in the data frame.

  • But for exploration, I find it easier to show them all.

  • So going back up here to our PD set option, we're going to make this 200, which is plenty enough to see every column in our data set.

  • And then rerun this head command, and we'll see all of the different columns in the data set.

  • Now, another thing we want to do is just list out all the columns.

  • Since this data frame has a lot of columns, we can just do df.columns to see them all.

  • And now we've seen all the columns listed.

  • Eventually, if we want to subset our data set, we can remove some of these columns, and I'm going to show you how to do that later.

  • So the next thing we might want to do is, for each of these columns, to find out actually what D type that pandas has decided it is.

  • So if you remember from the earlier tutorial, in a pandas data frame, every column is actually a series, and every pandas series has a type.

  • So if we type in df.dtypes, we can actually see, for each of these columns, what the type of column it is.

  • So we have a lot of objects here, which are just string-type columns, and then we have some that are float values.

  • Actually, down here at the bottom, I created this data set, and I added some of the numeric values of features, like the height of the roller coaster, the speed in miles per hour here to the end.

  • Okay, so one of the last things I want to show you here is just the describe function.

  • So if you type in df.describe, there we go, what it'll show us is some information and statistics about the numeric data in our data set.

  • So we can see here the height of the roller coasters.

  • We have a count of 171 values, a mean value of 101 feet, and all this information that gives us a good understanding for the data before we dive into it any further.

  • Okay, so now it's time to move on to the second step, which is data preparation.

  • We have a general understanding of the columns in our data and how many rows we have, but we want to do some cleaning before we actually get into analysis.

  • And it's very important that we first drop any of the columns or rows that we don't want to keep before we continue on, or else we might waste time cleaning up columns that we won't end up using.

  • So let's run a df.head again on our data set to remember what sort of columns we have.

  • One thing to note about this data set is we actually have some columns that have values that are string versions, like speed and length.

  • And we also have similar columns that I have created when I created this data set over to the right here that have the numeric version of those.

  • So it's stripped out all the text of mile per hour and converted everything into the same unit here, which is miles per hour.

  • Same thing with gforce, inversions.

  • So what we're going to want to do is we're going to want to subset this data set just to the columns we want to keep.

  • And there are two ways that I like to do this.

  • So if we run df.columns again, we can pull in all the columns and we can actually just subset by copying this list of columns here.

  • And then with two brackets, we put in the list of columns, and that will show us all of the columns again.

  • It's pretty much the same data set.

  • But now we can start removing column by column what we've decided we don't want to keep.

  • So I already know I want to keep the roller coaster name.

  • I want to keep the manufacturer.

  • So I'm just commenting out the lines in this list that we don't want to keep.

  • And this makes it easy to keep track of what we've actually dropped.

  • Height restriction, I don't think we want to keep.

  • Inversions, cost, trains, park section, all of these things may be interesting to look at later.

  • But at this time, we're not going to take a look at them.

  • But we do want to keep some of this information at the end.

  • So year introduced, latitude, longitude, I think we want to keep the type of material used here.

  • Opening date clean.

  • Speed values.

  • All right, so here for speed, I think we just want to keep the miles per hour.

  • For the height, we just want to keep the value in feet.

  • And yeah, this looks like a good subset of the data that we want to keep.

  • Actually, let's add in location and status.

  • Now, if I run this cell, I see that we actually have a smaller data set.

  • Our columns have been reduced just to the columns that we want to keep.

  • We have some information though we will be able to work with.

  • Now, there's a second way we can deal with subsetting our data sets columns, and that's by using the drop command.

  • So let's show an example of how to do that.

  • So if we only wanted to drop one column, we can just write drop and provide it either one or a list of columns that we want to drop.

  • Let's say we want to remove status or opening date as an example.

  • We'll run this cell, and we actually have to provide it axis equals one so that it knows to drop not a row but a column.

  • Now, if you look here, opening date, that column now has disappeared.

  • So that's in a way that we can, I'll just keep this example.

  • Example of dropping single, dropping columns.

  • But we're not using that way to drop our columns here.

  • I'm just going to put that up here as an example.

  • We're going to use our subsetting of a list of columns, and we're actually going to reassign our data frame to be this new subset of data frame.

  • So we're going to rewrite data frame by doing data frame equals this new subsetted data frame, and I am at the end here.

  • Now, this can be important.

  • We want to add the copy command to the end of our subset of data frame.

  • This makes sure that Python knows that it's a brand new data frame and not just a reference to the old one, and when you're manipulating the data frame later on, it's going to be nice to run this copy.

  • So let's go ahead and run the cell, and if I run df shape now, we see we still have 1,087 rows, but we only have 13 columns, and if we run dtypes on this, we see that we have coaster name, location, and it looks pretty good.

  • Now, one thing I'm noticing here is that opening date clean, this should be ...

  • It's saying that that's an object column, but we don't want it to be an object column because it's a date, so we can actually force this to be a datetime column by running pd to datetime, and now the dtype is actually a datetime 64 column, so this is a way of ensuring that our dtypes are correct for each column.

  • We can rewrite them running to datetime.

  • Now, another similar option would be if we had, say, a numeric column that we want to force to be numeric.

  • Now, year introduced we already know is an int column, but let's say it was a string.

  • We could run pd to numeric on this, and pandas will automatically try to make it into a numeric column.

  • Not necessary here, but good to know.

  • Now, we're going to learn how to rename our columns.

  • There's some columns here that I'm not too happy with the names, and there's a pretty easy way to actually rename them, so let's make sure we're all looking in the right spot, and here we go.

  • So, we can, again, run our df columns to see what they are, and we have some differences in lowercase and uppercase names, so I also think it's important to not have spaces in the names of our columns, which we luckily don't have here, so what I'm going to do is still rename some of these, so I'm going to run df.rename, and then we can ask it to rename columns, by writing columns, and we're going to provide it a dictionary here.

  • Don't worry, this is pretty easy.

  • All we have to do is put two brackets as our dictionary with the old name and then the new name, so this lets us rename the column name to uppercase, and we'll go ahead and do this to a few of these other ones to make them in uppercase first letter format, so here we go.

  • Now, I've renamed all of these columns into things that I'm more happy with, the column names.

  • They're all starting with uppercase.

  • I've removed this underscore clean because these are now the only columns we have for inversions in GeForce, and I'm going to go ahead and rewrite my data frame with these newly named columns.

  • Now, if I run a data frame head on this, we can see beautiful column names, and everything looks pretty good.

  • Okay, so the next step here is to try to identify if where missing values exist in the data frame and how often they occur, so the command we're going to run to identify missing values or null values is the na command.

  • If you run is na on this data frame, in every single row and column, it'll tell us if there is a null value, but this is a little bit overwhelming to look at at first, and we would like to instead see a sum of the number of null values per column, so here we can see that for this data set, the status is null for 213 of our rows.

  • Latitude and longitude are missing for some.

  • GeForce is missing for some.

  • Now we have a general understanding of where we might have missing values, and similarly, we will want to look and see if any of our data is duplicated, so there might be issues in our data where we have two rows that are identically the same, and we would not want that in our data set, so the way we look at that is by running duplicated on the data frame.

  • Duplicated by default will give us the second or all the second and additional rows that are duplicated in the data frame.

  • It'll ignore the first row that is a duplicate, and why that's nice is because it gives us this list of true or false if the values are duplicated, and then we can just simply do a look on the duplicated values to see which ones are duplicated, and none of them are in this data set, which is nice.

  • We can also, with duplicated, run this on a subset of columns, so let's just see if there are any duplicated coaster names by running subset on coaster name, and actually, it looks like there are some.

  • If we run a dflok on this list of duplicated rows, it'll show us just the rows the second time they occur and are duplicated, so we can see that there are actually 97 rows that have a duplicated coaster name.

  • You might want to think about reasons why that could be.

  • Let's go ahead, and to get an idea of why we have a duplicated rows, go and maybe just look at one of these coaster names and see the multiple values that are duplicated, and we'll do this by using the query command, so df.query, and we'll search in this data frame for when the coaster name equals, and then we have to put in quotes here, the coaster name itself, this Crystal Beach Cyclone.

  • Indeed, does have two rows.

  • Most of the values look identical, though, but if we look very closely here, oh, what is it?

  • Year introduced, this is not identical.

  • It actually has multiple years where it was introduced.

  • This might be an error in our dataset or potentially the roller coaster was put online, then taken offline and put back online, and we only want to look at really the first time it was introduced, so what we're going to do here is, so this is checking an example duplicate.

  • What we're going to want to do here is remove any duplicates with certain number of columns that are the same, so let's use our same df.duplicated to identify duplicates, and we're going to run this on a subset of columns, so if the coaster name location and opening date are duplicated, let's identify those.

  • We can put a sum on this to see what the count of those are.

  • There are 97 rows where it's duplicated, and then we actually want to take the inverse of this and select just the columns that are not the duplicate, so just the first version of those columns, so the way we write the inverse is we use this tilde before, and now if you see before where trues were true, now trues have swapped with falses, and we can locate just locations where the values are not duplicates of this subset of columns, and we will save this off as our data frame.

  • Now one other thing I like to do before we save this off is now that we've subsetted our data frame, we're actually dropping rows in our data frame.

  • Well, before we were dropping columns, now we're dropping rows, and this will make our index not necessarily jump up by a single number each time.

  • The way we can maintain that is by actually running a reset index, which will reset our index, but when we run that, it adds this index column, so a way to have it not keep that index column is to add drop equals true when we call reset index.

  • All right, there we go, so now we have our index is going from zero to our final number.

  • We have a subset of columns.

  • It's looking good.

  • I'm going to go ahead and copy this again and make this our data frame.

  • We can do a shape.

  • Now we have 990 unique roller coasters with 13 different feature columns that we're going to explore.

  • All right, now is where the real fun begins.

  • We're done with cleaning up our data set.

  • We have a good subset of the data and a good understanding for where missing values occur, and we're going to actually take a look at each feature themselves, so this is very important to do to understand what the distribution of those features are and maybe some potential outliers in the data set, so this is also known as univariate analysis, and we can run it on, let's say the year introduced numeric column would be a good feature to look at.

  • One of the very common things we can run on just a single column or a series, remember a single column in a data frame is just a series, is you can run value counts.

  • This is very powerful.

  • I use it all the time, and what it does is it looks for any duplicates in this year introduced or it counts how many unique values occur, so it'll automatically order this in from most to least occurring, and we can see here in the year 1999, there were 46 roller coasters added to this data set or roller coasters introduced, and the next highest is the year 2000 and 1998, so this gives us an idea of what years were the most common to have roller coasters introduced and what years were the least.

  • We can also on this year introduced, let's say we want to take this value counts and make it into a plot so we can see the top years of roller coasters introduced.

  • We could take value counts, and maybe plotting every single year would be a little bit much, so let's just run ahead on this and look at the first 10 columns or the first 10 most common years for roller coasters to be introduced, and then we can run a plot.

  • Now, these backslashes that I'm running just allow me to break the lines of code I have up into separate lines, and it makes it a little bit cleaner to read, but this would be the same as me writing this all on one long line without these backslashes.

  • All right, so we're going to make a plot.

  • We're going to make a kind, a bar plot.

  • What does that do?

  • That shows us each year and the counts, but we don't have any axis labels, and that's not good, and I actually think this would look better as a horizontal.

  • No, let's do normal bar plot.

  • When we're doing plots, we can also add titles here.

  • Title is Top Years Coasters Introduced, and we can actually save this as a matplotlib axis by doing x equals, so we're now saving this plot as a matplotlib axis, and with the axis, we can add some additional information to it.

  • We can set the x label to Year Introduced.

  • Now we have Year Introduced as the x label, and we can set the y label as Count.

  • There we go, so we see the year the coasters were introduced from the most, the top 10, so let's say Top 10 Years Coasters Introduced.

  • Another thing that we commonly want to do when doing analysis on just one column of data is to get an idea of the distribution of the column, so I'm going to take, for example, this speed in miles per hour.

  • Now a lot of these early on roller coasters do not have the speed value, so those values will just be missing, but for later years, it's pretty common for them to have the speed mile per hour, and we're going to just visualize this by making a plot, so we're going to call the plot command, and the kind of plot we want is a histogram.

  • A histogram just shows us in different bins what's the count of that value, so for a continuous value like speed here, it's great to run a histogram, and sometimes I find it helpful to run different bin sizes to get a better idea of this distribution, so right now, I'm not sure how many bins it defaults to, but if we add 20 bins, it's a little bit clearer to see the distribution of the speed, and we're going to go in and make sure we always add a title, so our title is coaster speed, and that's going to be in miles per hour.

  • We can again save off this if we want to add some additional features to it.

  • Save off this plot, X label, and we're going to set the X label to speed in miles per hour.

  • There we go, and now we have a plot that is distribution of the speed in miles per hour.

  • Now, we've noticed some things here.

  • We noticed that there's a very common speed, which is between 40, well, it's probably 50 miles per hour, which is the most common speed, and then there are also some speeds way out here at the end that might be interesting to look at later on, so we could run this on all of our features if we want, and I would encourage you if you're doing data analysis or exploratory data analysis to look at this for each feature.

  • Now, a very similar way to look at this instead of a histogram is we can look at a density plot, similar to a histogram, but it's a little less cluttered and easier to interpret it when you're looking at multiple distributions because they're all normalized, so I'm going to take this here, same code, and I'm just going to make one change here, which is KDE for kernel density plot, and I'm going to plot it here, and we can see for the roller coaster speed, this is what the KDE plot looks like.

  • We can see this kind of humpier at 35 miles per hour and also at 50 miles per hour.

  • It's very similar to the histogram.

  • All right, getting even more fun.

  • Now, we're going to look at feature relationships, so we've looked at each feature individually and some distributions and other characteristics of a given column in our dataset, but what we really want to look at is, well, how did the different features relate to each other in our dataset?

  • So there's a lot of things that we can do to compare the different features in the dataset, and one of them is to just compare two features side by side by making a scatter plot, so let's go ahead here and let's take the data frame and make a plot.

  • The kind is going to be scatter.

  • The X value is going to be our speed in miles per hour, and the Y value is going to be the height in feet, and if we plot this, we can see here now we have a scatter plot where there is a dot representing each of the rows in our dataset, and on the X axis, we have the speed, and the Y axis is our height.

  • We're going to also add a title called Coaster Speed Versus Height.

  • Now, one other thing to keep in mind here is this is creating a matplotlib object when we run this plot command in Pandas, and to make it so that it looks a little bit cleaner in our notebook, we're going to add this plot.show function at the end of our data frame, and that'll just make it so it doesn't actually display the object itself at the end, and you can see that if we remove this, we see this subplot is actually printed out information, so plot.show is just a cleaner way of showing it, so this scatter plot looks nice, but we made it using the basic Pandas built-in functionality and using another package like Seaborn, we can do a little bit more advanced analysis and plots with this data, so if you remember before, we imported Seaborn as SNS, and we're going to use the scatter plot function in Seaborn to plot a very similar plot to what we have above.

  • What this asks for are similar values to what we provided above.

  • If we go into the scatter plot command, this is a nice trick, and you hold shift tab, you can see the actual doc string for this function, and we can see that it requires an X value, a Y value, and then we also provided the data, which is going to be our data frame, so let's copy some of this from before.

  • We want our X to be our speed, our Y to be our height, and then data is going to be our data frame.

  • There we go, very similar-looking plot to above, but there's some cool stuff that we can do with Seaborn we can't with matplotlib out of the box.

  • We can actually have the year introduced or other features be the color, and the way we do that is by adding these as the hue in our scatter plot method, so look here, we can see that now we have different colored dots based on the year that the roller coaster was introduced.

  • There we go, just a different type of scatter plot where we were able to color it based on another variable.

  • So far, we've looked at comparing two features against each other in our dataset, but what if we want to compare more than two?

  • Well, Seaborn has a pretty nicely built-in functionality called a pair plot where we can compare multiple features against each other in our dataset, so again, I'm going to hold down shift tab within this function.

  • We're going to see what variables are needed to be provided to this function for it to work, so instead of X and Y, we can actually provide multiple X variables and Y variables, and allows us to show what type of comparison between them.

  • It defaults to a scatter, and then along the diagonal, which you'll see, it'll also show the distribution of the individual features, so let's go ahead here and add the data.

  • It's going to be our data frame, and let's see what features we want to compare, so let's compare, and this might take a little while to run, and while it's running, I'm just going to do a plot show here to make sure that, similar to what we did before, it shows it correctly, and what do we find here?

  • Okay, so now we have, and I'm going to have to zoom out here a little bit for you all to see this, but now we have, similar to what we did before, where we have the distribution of each feature, and then we have the relationship between pairs of features using a scatter plot, but we have this in a matrix form, so for each of the features that we provided, how do they interact with each other?

  • Pair plots are awesome, and we can actually take this up a level by adding the hue to this pair plot with the, let's use the type of material used as the color of the dot, so now it's going to be a very similar plot, but with the color of the dots in our scatter plot represented by the type of material used.

  • There we go, so we can see here on the right side the type of the material, red is wood, blue is another type, and the purplish is steel, and you can see that year introduced, really early years, there aren't much steel, and then they start being introduced around the 1950s and then ever more popular.

  • What else can we see here?

  • A lot of interesting stuff we can see here just from this pair plot, so pair plots are really nice to use.

  • I'll zoom out again so you can kind of see it all at once, so now we know what a pair plot is and how powerful that can be.

  • Another thing you might want to do in comparing features against each other is to look at the correlation, so luckily in Pandas it's very easy, so we can run just on the subset of features that we know are numeric, and then we can, let's go ahead and drop any null values, and we can run on this just a core function.

  • What does this show us?

  • This shows us the correlation between different features, so between speed and height, this value has a correlation of 0.73, for some have a negative correlation, and things like year introduced might not necessarily have a good correlation with anything, and another way I like to look at this is by using Seabourn's heat map, so if we run SNS.heatmap, we can pass into it this correlation data frame, so let's call this DF core, and we can still print this out so we can see what it looks like.

  • We can feed it in the correlation data frame, and now we have a heat map showing how correlated the values are to each other, so just another way of seeing it, and I also like to, within this, add in the annotations, so we can see the raw values of what the correlation is.

  • Remember, every value is going to be a perfect correlation with itself, that's why we see ones across the diagonal, but otherwise, this kind of lets us see interesting correlations and relationships in the data.

  • Okay, we're here in our final step of the exploratory data analysis process, and that can be one of the toughest parts, and that's asking a question that we want to answer with our dataset.

  • Now that we have a good feeling of our dataset, I think we can go ahead and ask a question, and I'm interested in a column we haven't really looked at much yet, and that's the location, so we know that for each location, there could be multiple roller coasters, but what I'm curious about, and I'm going to write the question out here is, what are the locations with the fastest roller coasters, so if we wanted to go to a park and have all fast roller coasters, I have the fastest roller coasters, which would those be, and let's say with a minimum of 10 coasters at that location.

  • So we're going to do some things that we've learned before, so one of the first things I'm going to do is go here to location, and let's go ahead and just do a value counts on this, and we notice here that there's an other location.

  • Other location is not truly a location, so we're just going to ignore that for this analysis, and we're going to query where location does not equal other, so this will basically just filter out those other locations, and then we're going to group by these locations.

  • All right, so now we have each location is grouped, and we can do our look at this speed and miles per hour count, a column, and what we want to do with this speed miles per hour column is find out a few things about this per location, like what's the average speed, and what are the number of coasters that we have at that location.

  • We can do this in one step together pretty easily using the ag function.

  • This will aggregate, and we aggregate by that location, and we can get the mean value and the count value, so what do we have here?

  • Now we have all these different locations, the average speed and the count, and we're going to run another filter on this, so we're going to filter this.

  • We're going to query where the count of this is greater than or equal to 10, so minimum of 10 roller coasters, which locations have the fastest values, and just to make this a little cleaner, we're going to sort the values by the mean speed.

  • Remember, this is a mean speed, and this is great.

  • Now we have for each location with more than 10 coasters, what's the average speed of the roller coasters, and let's go ahead here and make this a plot.

  • It would be a lot better as a plot, so plot as kind as a horizontal bar plot, and we're only going to plot here this mean value, and we'll give this a title is the average coaster speed by location.

  • We can save this axis as we've done before, save this figure so we can put the X label as average coaster speed, and there we go.

  • Now we have a plot that shows that Busch Gardens has the highest average coaster speed, followed by Cedar Point and so on, and we've answered our question here of by location, what are the parks with the highest average speed with a minimum of 10 coasters, all this in these lines of code.

  • Now, when you ask a question like this, it's going to take you a while to know which steps to go through to get this result, but by asking the question, you're going to be forced to search for how to use Pandas to get you this sort of solution.

  • Thanks for sticking around this long.

  • I hope you enjoyed the tutorial.

  • By now, you should have some basic understanding of how to use Pandas and Python to do simple data exploration.

  • If you enjoyed it, please give me a like and subscribe.

  • Also, follow me on Twitch where I do stream live coding from time to time.

  • See you around.

So, you've found a new dataset that you're excited to explore.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it

B1 US data plot column frame data set roller

Exploratory Data Analysis with Pandas Python

  • 2859 116
    Lin posted on 2025/05/26
Video vocabulary

Keywords

stuff

US /stʌf/

UK /stʌf/

  • noun
  • Generic description for things, materials, objects
  • verb
  • To push material inside something, with force
material

US /məˈtɪriəl/

UK /məˈtɪəriəl/

  • noun
  • Cloth; fabric
  • Supplies or data needed to do a certain thing
  • Substance from which a thing is made of
  • Supplies needed for a task or activity.
  • other
  • Fabric or cloth.
  • Information or data used for a particular purpose.
  • A substance from which something is made or can be made.
  • adjective
  • Relevant; (of evidence) important or significant
  • Belonging to the world of physical things
  • Relating to physical matter or substance.
multiple

US /ˈmʌltəpəl/

UK /ˈmʌltɪpl/

  • adjective
  • Having or involving more than one of something
  • Capable of handling more than one task or user at a time.
  • Consisting of or involving more than one.
  • Affecting many parts of the body.
  • More than one; many.
  • Having or involving several parts, elements, or members.
  • noun
  • Number produced by multiplying a smaller number
  • A ratio used to estimate the total value of a company.
  • A number of identical circuit elements connected in parallel or series.
  • A number that can be divided by another number without a remainder.
  • pronoun
  • More than one; several.
average

US /ˈævərɪdʒ, ˈævrɪdʒ/

UK /'ævərɪdʒ/

  • noun
  • Total of numbers divided by the number of items
  • verb
  • To add numbers then divide by the number of items
  • adjective
  • Typical or normal; usual; ordinary
common

US /ˈkɑmən/

UK /'kɒmən/

  • noun
  • Area in a city or town that is open to everyone
  • A piece of open land for public use.
  • A piece of open land for public use.
  • Field near a village owned by the local community
  • adjective
  • Lacking refinement; vulgar.
  • Occurring, found, or done often; prevalent.
  • (of a noun) denoting a class of objects or a concept as opposed to a particular individual.
  • Without special rank or position; ordinary.
  • Shared; Belonging to or used by everyone
  • Typical, normal; not unusual
  • Lacking refinement; vulgar.
  • Found all over the place.
general

US /ˈdʒɛnərəl/

UK /'dʒenrəl/

  • noun
  • A broad field of study or knowledge.
  • A high-ranking officer in the army, air force, or marine corps.
  • The public; the population at large.
  • Top ranked officer in the army
  • adjective
  • Widespread, normal or usual
  • Having the rank of general; chief or principal.
  • Not detailed or specific; vague.
  • Relating to all the people or things in a group; overall.
  • Applicable or occurring in most situations or to most people.
identify

US /aɪˈdɛntəˌfaɪ/

UK /aɪ'dentɪfaɪ/

  • verb
  • To indicate who or what someone or something is
  • other
  • To discover or determine something.
  • To say exactly what something is
  • To recognize someone or something and be able to say who or what they are
  • other
  • To feel that you understand and share the feelings of someone else
  • To feel that you are similar to someone, and understand them or their situation because of this
feature

US /ˈfitʃɚ/

UK /'fi:tʃə(r)/

  • noun
  • Special report in a magazine or paper
  • A distinctive attribute or aspect of something.
  • Distinctive or important point of something
  • A part of the face, such as the eyes, nose, or mouth.
  • A full-length film intended as the main item in a movie program.
  • adjective
  • Main; important
  • verb
  • To highlight or give special importance to
  • other
  • To give prominence to; to present or promote as a special or important item.
correlation

US /ˌkɔrəˈleʃən, ˌkɑr-/

UK /ˌkɒrəˈleɪʃn/

  • noun
  • The relationship between two variables
  • A mutual relationship or connection between two or more things.
  • A statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate).
status

US /ˈstetəs, ˈstætəs/

UK /'steɪtəs/

  • noun
  • Position or rank relative to others in a society
  • Legal position of a person or thing
  • Current state or position of a thing