Placeholder Image

Subtitles section Play video

  • Let's imagine that you work for a major streaming media provider right? So you have I know some 100 million drivers

  • So you've got I don't know ten thousand videos on your site or many more audio files, right

  • so for each user you're gonna have collected information on what they've watched when they've watched it how long they've watched it for whether they

  • Went from this one to this one. Did that work? Was that good for them? And

  • So maybe you've got 30,000 data points per user

  • We're now talking about trillions of data points and your job is to try and predict what someone wants to watch or listen to next

  • best of luck

  • So we've cleaned the data we've transformed our data everything's on the same scale we've joined data sets together

  • The problem is because we've joined data sets together perhaps our data set has got quite large right now

  • or maybe we just work for a company that has a lot a lot of data certainly the

  • General consensus these days is to collect as much data as you can like this isn't always a good idea

  • We what we want remember

  • It's the smallest most compact and useful data set we can otherwise you're just going to be wasting

  • CPU hours or GPU hours training on this wasting time

  • We want to get to the knowledge as quickly as possible

  • And if you can do that with a small amount of data that's going to be great

  • So we've got quite an interesting data set to look at today based on music

  • It's quite common these days when you're building something like a streaming service for example Spotify

  • You might want to have a recommender system

  • This is an idea where you've maybe clustered people who are similar in their tastes, you know

  • what kind of music they're listening to and you know, the

  • attributes of that music and if you know that you can say well this person likes high tempo music

  • So maybe he'd like this track as well. And this is how playlists are generated

  • One of the problems is that you're gonna have to produce

  • Descriptions of the audio on things like tempo and how upbeat they are in order to machine learn on this kind of system

  • Right, and that's what this data sets about. So we've collected a dataset here today. That is

  • Lots and lots of metadata on music tracks right now. These are freely available

  • Tracks and freely available data and put a link in the description if you want to have a look at it yourself

  • I've cleaned it up a bit already because obviously I've been through the process of cleaning and transforming my data

  • So we're gonna load this now this takes quite a long time to do

  • Because there's quite a lot of attributes and quite a lot of instances

  • It's loaded right? How much is this data? Well, we've got 13,500

  • Observations that's instances, and we've got seven hundred and sixty-two attributes, right?

  • so that means another way of putting this if in sort of machine learning parlance is we've got thirteen thousand instances and

  • 760 features now these features are a combination of things. So let's have a quick look at the columns

  • we're looking at so we can see what this data sets about so names of

  • Music all right, so we've got some

  • 760 features or attributes and you can see there's a lot of slightly meaningless text here

  • But if we look at the top you'll see some actual things that may be familiar to us

  • So we've got the track ID album ID the genre, right?

  • So Jean was an interesting one because maybe we can start to use

  • Some of these audio descriptions to predict what Jean with its music is or something like that

  • things like the track number and the track duration and

  • Then we get on to the actual audio description features. Now. These have been generated by two different libraries

  • the first is called Lib rosa, which is a publicly available library for taking an mp3 and

  • Calculating musical sort of attributes of it

  • What we're trying to do here is represent our data in terms of attributes an mp3 file is not an attribute

  • It's a lot of data. So can we summarize it in some way? Can we calculate by looking at the mp3?

  • What the tempo is what the amplitude is how loud the track is these kind of things this is a kind of thing

  • We're measuring and a lot of these are going to go into a lot of detail down at kind of a waveform level

  • so we have the Lib Roza features first and then if we scroll down

  • After a while we'd get to some echo nest features. Echinus is a company that

  • Produces very interesting features on music and actually these are the features that power Spotify is recommender system and numerous others

  • We've got things like acoustic nurse. How a coup stick does it sound we've got instrumental nurse

  • I'm not convinced that the word speech enos their hat hat to what extent is it speech or not? Speech

  • And then things like tempo how fast is it and valence?

  • How happy does it sound right a track of zero would be quite sad?

  • I guess and a track of one will be really high happy and upbeat and then of course

  • We've got a load of features. I've labeled temporal here and these are going to be based on the actual music data themselves

  • Often when we talk about data reduction

  • We're actually using its dimensionality reduction

  • right

  • well way of thinking about it is we as we started we've been looking at things like attributes and we've been saying what is the

  • Mean or a standard deviation of some attribute on our data

  • but actually when we start to talk about clustering and machine learning

  • We're going to talk a little bit more about dimensions. Now. This is in many ways

  • The number of attributes is the number of dimensions

  • It's just another term for the same thing, but certainly from a machine learning background

  • We refer to a lot of these things as dimensions so you can imagine if you've got some data here

  • So you've got your instances down here and you've got your attributes across here

  • So in this case our music data, we've got each song. So this is puts on one

  • This is on two song three and then all the attributes of a temple echo nest attributes its tempo and things like this

  • These are all dimensions in which this data can vary so they can be different in the first dimension, which is the track ID

  • But they can also down here be different in this dimension

  • Which is for tempo when we say?

  • Some data is seven hundred dimensional

  • What that actually means is it has seven hundred different ways or different attributes in which it can vary and you can imagine that first

  • Of all this is going to get quite big quite quickly

  • My seven hundred a tribute seems like a lot to me

  • Right and depending on what the algorithm you're running is it can get quite slow when you're running

  • Oh this kind of size of data and you can maybe this is a relatively small data set compared to what Spotify might deal with

  • on a daily basis

  • But another way to think about this data is actually points in this space

  • so we have some 700 different attributes that you can vary and when we take a

  • Specific track it sits somewhere in this space

  • So if we were looking at it in just two dimensions

  • You know a track one might be over here and track two over here and track three over here and in three

  • Dimensions track four might be back at the back here. You can imagine the more dimensions

  • We add the further spread out these things are going to get

  • But we can still do all the same things. We can in three dimensions in 700 dimensions. It just takes a little bit longer

  • So one of the problems is that some things like machine learning don't like to have too many dimensions

  • So things like linear regression can get quite slow if you have tens of thousands of attributes or dimensions

  • So remember that perhaps the the default response to anyone collecting data is just deflect it all and worry about it. Later

  • This is a time reporting when you have to worry about it. What we're trying to do is

  • Move any redundant variables if you've got two?

  • Attributes of your music like tempo and valence that turn out to be exactly the same

  • Why are we using Bo for making our problem a little bit harder right now in actual fact echo nests features are pretty good

  • They don't tend to correlate that strongly but you might find where we've collected some data on a big scale

  • actually

  • A lot of it variables are very very similar all the time and you can just remove some of them or combine some of them

  • Together and just make your problem a little bit easier

  • So let's look at this on the music data set and see what we can do

  • So the first thing we can do is we could remove duplicates Ryba sounds like an obvious one and perhaps one that we could also

  • Do during cleaning, but exactly when you do it doesn't really matter as long as you're paying attention

  • what we're going to say is music all

  • equals unique of music all and what that's going to do is look for find any duplicate rows and

  • Remove them the number of rows. We've got will drop by some amount. Let's see

  • thinking

  • It's where you live timer

  • Actually, this is quite a slow process

  • You've got to consider that we're going to look through every single row and try and find any other rows that match

  • Okay, so this is removed a bit about 40 rows

  • So this meant we had some duplicate tracks

  • You can imagine that things might get accidentally added to the database twice or maybe two tracks are actually identical because they were released multiple

  • Times or something like this now what this is doing?

  • The unique function actually finds rows that are exactly the same for every single attribute or every single dimension, of course in practice

  • You might find that you have two versions of the same track, which differ by one second they might have slightly different attributes

  • Hopefully they'll be very very similar. So what we could also do is have a threshold where we said these are too similar

  • They're the same thing. The name is the same. The artist is the same and the audio descriptors are very very similar

  • Maybe we should just remove one of them

  • Well, this is the other thing you could do just for demonstration

  • what we're going to do is focus on just a few of

  • The genres in this data set right just to make things a little bit clearer for visualizations

  • we're going to select just the classical jazz pop and

  • Spoken-word genres, right because these have a good distribution of different amounts in the data set

  • So we're going to run that we're creating a list of genres. We're going to say music is musical

  • Where any time where the genre is in that list of genres we just produced?

  • and that's going to produce a much smaller dataset of

  • 1,600 observations the same number of attributes or dimensions now

  • Normally you would obviously keep most of your data in this is just for a demonstration

  • But removing genres that aren't useful to you for your experiment is a perfectly reasonable way of reducing your data size if that's a problem

  • Assuming they've been labeled right in the first place, right that's on someone else. That's someone else's job

  • Let's imagine but 1,600 is still too long. Now actually computers are getting pretty quick. Maybe 1,600 observations is fine, but

  • Perhaps we want to remove some more

  • The first thing we could do is just chop off the day to half way and keep about half. So let's try that

  • first of all, so we're going to say the first music that's the first few rows of our music is

  • Rows 1 to 835 and all the columns. So we're going to run that and

  • That's even smaller. Right so we can start to whittle down our data. This is not necessarily a good idea

  • We're assuming here that our genre is equally, you know, randomly sampled around our data set. That might not be true

  • You might have all the lock first and then all the pop or something like that

  • If you take the first few, you're just going to get all the rock right depending on what you like

  • That might not be for you

  • So let's plot these on was in the normal data set and you can see that we've got very little spoken word

  • but it is there we have some classical international jazz and pop in sort of roughly the same amount if

  • We plot after we've selected the first 50 you can see we've lost two of the genres like we only have classical

  • International and jazz and there's hardly any jazz. That's not a good idea. So don't do that unless you know that your data is randomized

  • So this is not this is not giving us a good representation of genres if we wanted to predict

  • Jonatha, for example based on the musical features cutting out half the genres seems like an unwise decision

  • So a better thing to do will be to sample randomly from the data set

  • So what we're going to do is we're going to use the sample function to give us

  • 835 random indices into this data and then we're going to use that the index our music data frame instead

  • Alright, that's this line here

  • And hopefully this will give us a better distribution if we plot the original again

  • It looks like this and you can see we've got a broad distribution and then if we plot the randomized version

  • You can see we've