Placeholder Image

Subtitles section Play video

  • Hello, and welcome back to R for Statistics and Data science.

  • In the next few lessons, we will dive deep into the Star Wars data and will learn how

  • to transform data sets in various creative and not-so-creative ways.

  • Let’s get to it!

  • This is the first real lesson in which we will use the diplyr package.

  • For the distracted souls out there, diplyr is part of the tidyverse and we got it when

  • we installed the tidyverse ecosystem of packages.

  • It specializes in data manipulation tools that deal with filtering, mutating, and summarizing

  • data.

  • First things first, let’s fire up the Star Wars data frame that comes with diplyr.

  • This time, I will save it asstar”.

  • Notice that the data are saved as a tibble instead of an R base data frame.

  • Let’s keep it this way and use some of the tibble properties.

  • Tibbles come in handy here because this is a relatively big dataset and we don’t want

  • to see the entire thing every time we do an operation and print to see our results.

  • Tibbles limit the printing to just a few rows.

  • Okay, although weve already looked at it before, if you want to see the data in all

  • its glory, run View(star).

  • This will open the viewer and you can scroll through the values to your heart’s content.

  • Right!

  • Transforming data!

  • The filter() function does what we think it does: subsets data according to a set of criteria.

  • It works like this: we pass the data, and then the expression according to which we

  • want our data filtered.

  • There can be more than 1 criteria, of course.

  • For instance, I can select all the droids in the data frame.

  • And now I can call only the ones from Tatooine.

  • Right.

  • Yes, that makes sense, it was young Anakin Skywalker who re-built C-3PO while still on

  • Tatooine.

  • And R5-D4…

  • I am not sure I know anything about that little R5-unit!

  • Okay.

  • Filter() also works with logical operators, so, for example, I can call every character

  • that has red, orange or yellow as an eye colour.

  • Okay, the majority of these aren’t human

  • I wonder if there are any more humans with weird eyes apart from Darth Vader and Palpatine.

  • No?

  • Yikes.

  • Alright, next we have the select() function.

  • Now, our database may not have hundreds of variables, but looking at the column names,

  • it does feel like I genuinely don’t need to know about some of these things.

  • To narrow down the data to the information I want, I can use select().

  • This selects specific individual columns, by name.

  • If I want to select a column and then everything between two other columns, I can do this...

  • Isn’t this already a lot easier to do than with the base R functions we learned earlier?

  • But check this out, too: select() works nicely with a couple of nifty functions like starts_with(),

  • or ends_with(), which let us subset data in a super intuitive way.

  • So, if I wanted to get all the columns that have to do with coloration, I can run this...

  • Okay, new scenario: there are a bunch of interesting variables you want to look at but you also

  • don’t want to ignore the rest of the datawhat do you do?

  • Well, you can use the everything() function with select, to move the variables you want

  • to the beginning of the table, and then show everything else.

  • Like this.

  • Sweet, right.

  • Finally, let’s look at the mutate() function.

  • Mutate() is dplyr’s easy way of creating new variables from variables that already

  • exist in the data set.

  • For example, I can calculate the BMI for our characters because the Star Wars data has

  • recorded both height and mass information.

  • Of course, this is largely uninformative, because the BMI scale is extremely human-centred,

  • but you knowanything to get the point across!

  • Now, if mutate() is the function to use when you want to add a column to your data while

  • also retaining all the other columns in your data frame, then transmute() is what you will

  • opt for if you only want to keep the new variable you create.

  • Let me show you want I mean...

  • See?

  • Effectively, transmute() created my new variable and allowed me to extract it without tagging

  • everything else along as well.

  • Great.

  • Okay!

  • I will end this lesson here because otherwise I am at risk of going into way too much detail

  • about side comments I make.

  • Thanks for watching, everyone!

  • In the next lesson we will pick it up right where we left off.

  • See you there!

Hello, and welcome back to R for Statistics and Data science.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it