Placeholder Image

Subtitles section Play video

  • Let's imagine that you work for a major streaming media provider right? So you have I know some 100 million drivers

  • So you've got I don't know ten thousand videos on your site or many more audio files, right

  • so for each user you're gonna have collected information on what they've watched when they've watched it how long they've watched it for whether they

  • Went from this one to this one. Did that work? Was that good for them? And

  • So maybe you've got 30,000 data points per user

  • We're now talking about trillions of data points and your job is to try and predict what someone wants to watch or listen to next

  • best of luck

  • So we've cleaned the data we've transformed our data everything's on the same scale we've joined data sets together

  • The problem is because we've joined data sets together perhaps our data set has got quite large right now

  • or maybe we just work for a company that has a lot a lot of data certainly the

  • General consensus these days is to collect as much data as you can like this isn't always a good idea

  • We what we want remember

  • It's the smallest most compact and useful data set we can otherwise you're just going to be wasting

  • CPU hours or GPU hours training on this wasting time

  • We want to get to the knowledge as quickly as possible

  • And if you can do that with a small amount of data that's going to be great

  • So we've got quite an interesting data set to look at today based on music

  • It's quite common these days when you're building something like a streaming service for example Spotify

  • You might want to have a recommender system

  • This is an idea where you've maybe clustered people who are similar in their tastes, you know

  • what kind of music they're listening to and you know, the

  • attributes of that music and if you know that you can say well this person likes high tempo music

  • So maybe he'd like this track as well. And this is how playlists are generated

  • One of the problems is that you're gonna have to produce

  • Descriptions of the audio on things like tempo and how upbeat they are in order to machine learn on this kind of system

  • Right, and that's what this data sets about. So we've collected a dataset here today. That is

  • Lots and lots of metadata on music tracks right now. These are freely available

  • Tracks and freely available data and put a link in the description if you want to have a look at it yourself

  • I've cleaned it up a bit already because obviously I've been through the process of cleaning and transforming my data

  • So we're gonna load this now this takes quite a long time to do

  • Because there's quite a lot of attributes and quite a lot of instances

  • It's loaded right? How much is this data? Well, we've got 13,500

  • Observations that's instances, and we've got seven hundred and sixty-two attributes, right?

  • so that means another way of putting this if in sort of machine learning parlance is we've got thirteen thousand instances and

  • 760 features now these features are a combination of things. So let's have a quick look at the columns

  • we're looking at so we can see what this data sets about so names of

  • Music all right, so we've got some

  • 760 features or attributes and you can see there's a lot of slightly meaningless text here

  • But if we look at the top you'll see some actual things that may be familiar to us

  • So we've got the track ID album ID the genre, right?

  • So Jean was an interesting one because maybe we can start to use

  • Some of these audio descriptions to predict what Jean with its music is or something like that

  • things like the track number and the track duration and

  • Then we get on to the actual audio description features. Now. These have been generated by two different libraries

  • the first is called Lib rosa, which is a publicly available library for taking an mp3 and

  • Calculating musical sort of attributes of it

  • What we're trying to do here is represent our data in terms of attributes an mp3 file is not an attribute

  • It's a lot of data. So can we summarize it in some way? Can we calculate by looking at the mp3?

  • What the tempo is what the amplitude is how loud the track is these kind of things this is a kind of thing

  • We're measuring and a lot of these are going to go into a lot of detail down at kind of a waveform level

  • so we have the Lib Roza features first and then if we scroll down

  • After a while we'd get to some echo nest features. Echinus is a company that

  • Produces very interesting features on music and actually these are the features that power Spotify is recommender system and numerous others

  • We've got things like acoustic nurse. How a coup stick does it sound we've got instrumental nurse

  • I'm not convinced that the word speech enos their hat hat to what extent is it speech or not? Speech

  • And then things like tempo how fast is it and valence?

  • How happy does it sound right a track of zero would be quite sad?

  • I guess and a track of one will be really high happy and upbeat and then of course

  • We've got a load of features. I've labeled temporal here and these are going to be based on the actual music data themselves

  • Often when we talk about data reduction

  • We're actually using its dimensionality reduction

  • right

  • well way of thinking about it is we as we started we've been looking at things like attributes and we've been saying what is the

  • Mean or a standard deviation of some attribute on our data

  • but actually when we start to talk about clustering and machine learning

  • We're going to talk a little bit more about dimensions. Now. This is in many ways

  • The number of attributes is the number of dimensions

  • It's just another term for the same thing, but certainly from a machine learning background

  • We refer to a lot of these things as dimensions so you can imagine if you've got some data here

  • So you've got your instances down here and you've got your attributes across here

  • So in this case our music data, we've got each song. So this is puts on one

  • This is on two song three and then all the attributes of a temple echo nest attributes its tempo and things like this

  • These are all dimensions in which this data can vary so they can be different in the first dimension, which is the track ID

  • But they can also down here be different in this dimension

  • Which is for tempo when we say?

  • Some data is seven hundred dimensional

  • What that actually means is it has seven hundred different ways or different attributes in which it can vary and you can imagine that first

  • Of all this is going to get quite big quite quickly

  • My seven hundred a tribute seems like a lot to me

  • Right and depending on what the algorithm you're running is it can get quite slow when you're running

  • Oh this kind of size of data and you can maybe this is a relatively small data set compared to what Spotify might deal with

  • on a daily basis

  • But another way to think about this data is actually points in this space

  • so we have some 700 different attributes that you can vary and when we take a

  • Specific track it sits somewhere in this space

  • So if we were looking at it in just two dimensions

  • You know a track one might be over here and track two over here and track three over here and in three

  • Dimensions track four might be back at the back here. You can imagine the more dimensions

  • We add the further spread out these things are going to get

  • But we can still do all the same things. We can in three dimensions in 700 dimensions. It just takes a little bit longer

  • So one of the problems is that some things like machine learning don't like to have too many dimensions

  • So things like linear regression can get quite slow if you have tens of thousands of attributes or dimensions

  • So remember that perhaps the the default response to anyone collecting data is just deflect it all and worry about it. Later

  • This is a time reporting when you have to worry about it. What we're trying to do is

  • Move any redundant variables if you've got two?

  • Attributes of your music like tempo and valence that turn out to be exactly the same

  • Why are we using Bo for making our problem a little bit harder right now in actual fact echo nests features are pretty good

  • They don't tend to correlate that strongly but you might find where we've collected some data on a big scale

  • actually

  • A lot of it variables are very very similar all the time and you can just remove some of them or combine some of them

  • Together and just make your problem a little bit easier

  • So let's look at this on the music data set and see what we can do

  • So the first thing we can do is we could remove duplicates Ryba sounds like an obvious one and perhaps one that we could also

  • Do during cleaning, but exactly when you do it doesn't really matter as long as you're paying attention

  • what we're going to say is music all

  • equals unique of music all and what that's going to do is look for find any duplicate rows and

  • Remove them the number of rows. We've got will drop by some amount. Let's see

  • thinking

  • It's where you live timer

  • Actually, this is quite a slow process

  • You've got to consider that we're going to look through every single row and try and find any other rows that match

  • Okay, so this is removed a bit about 40 rows

  • So this meant we had some duplicate tracks

  • You can imagine that things might get accidentally added to the database twice or maybe two tracks are actually identical because they were released multiple

  • Times or something like this now what this is doing?

  • The unique function actually finds rows that are exactly the same for every single attribute or every single dimension, of course in practice

  • You might find that you have two versions of the same track, which differ by one second they might have slightly different attributes

  • Hopefully they'll be very very similar. So what we could also do is have a threshold where we said these are too similar

  • They're the same thing. The name is the same. The artist is the same and the audio descriptors are very very similar

  • Maybe we should just remove one of them

  • Well, this is the other thing you could do just for demonstration

  • what we're going to do is focus on just a few of

  • The genres in this data set right just to make things a little bit clearer for visualizations

  • we're going to select just the classical jazz pop and

  • Spoken-word genres, right because these have a good distribution of different amounts in the data set

  • So we're going to run that we're creating a list of genres. We're going to say music is musical

  • Where any time where the genre is in that list of genres we just produced?

  • and that's going to produce a much smaller dataset of

  • 1,600 observations the same number of attributes or dimensions now

  • Normally you would obviously keep most of your data in this is just for a demonstration

  • But removing genres that aren't useful to you for your experiment is a perfectly reasonable way of reducing your data size if that's a problem

  • Assuming they've been labeled right in the first place, right that's on someone else. That's someone else's job

  • Let's imagine but 1,600 is still too long. Now actually computers are getting pretty quick. Maybe 1,600 observations is fine, but

  • Perhaps we want to remove some more

  • The first thing we could do is just chop off the day to half way and keep about half. So let's try that

  • first of all, so we're going to say the first music that's the first few rows of our music is

  • Rows 1 to 835 and all the columns. So we're going to run that and

  • That's even smaller. Right so we can start to whittle down our data. This is not necessarily a good idea

  • We're assuming here that our genre is equally, you know, randomly sampled around our data set. That might not be true

  • You might have all the lock first and then all the pop or something like that

  • If you take the first few, you're just going to get all the rock right depending on what you like

  • That might not be for you

  • So let's plot these on was in the normal data set and you can see that we've got very little spoken word

  • but it is there we have some classical international jazz and pop in sort of roughly the same amount if

  • We plot after we've selected the first 50 you can see we've lost two of the genres like we only have classical

  • International and jazz and there's hardly any jazz. That's not a good idea. So don't do that unless you know that your data is randomized

  • So this is not this is not giving us a good representation of genres if we wanted to predict

  • Jonatha, for example based on the musical features cutting out half the genres seems like an unwise decision

  • So a better thing to do will be to sample randomly from the data set

  • So what we're going to do is we're going to use the sample function to give us

  • 835 random indices into this data and then we're going to use that the index our music data frame instead

  • Alright, that's this line here

  • And hopefully this will give us a better distribution if we plot the original again

  • It looks like this and you can see we've got a broad distribution and then if we plot the randomized version

  • You can see we've still got some spoken. It's actually going up slightly, but the distributions are broadly the same

  • So this is worked exactly how we want

  • So how you select your data?

  • If you're trying to make it a little bit smaller

  • It's very very important and consider but obviously we only had 1,600 here and even the human is whole data set is only

  • 1,300 rows you could imagine that you might have

  • Tens of millions of rows and you've got to think about this before you start just getting rid of them completely

  • Randomized sampling is is a perfectly good way of selecting your data. Obviously, it has a risk that maybe if the distributions of your

  • Genres are a little bit off and maybe you haven't got very much of a certain genre

  • You can't guarantee that the distributions are going to be the same on the way out

  • And if you're trying to predict Jama that's going to be a problem. So perhaps the best approach is stratified sampling

  • This is where we try and maintain the distribution of our classes

  • So for example in this case genre so we could say we all we had 50% Rock

  • 30% pop and 20% spoken and we want to maintain that kind of distribution on the way out

  • Even if we only saw about 50% right?

  • This is a little bit more complicated in our but it can be done

  • And this is a good approach if you want to make absolutely sure with

  • Distributions of your sample data are the same as your original data. We just looked at some ways

  • We can reduce the size of our data set in terms of a number of instances or the number of rows

  • Can we make the number of dimensions or the number of attributes smaller?

  • Because that's often one of the problems and the answer is yes

  • And there's lots of different ways we can do this some more powerful and useful than others

  • One of the ways we can do this is something called correlation analysis

  • so a correlation between two attributes basically tells us that when one of them increases the other one either increases or decreases in

  • General in relation to it. So you might have some data like this. We've actually won

  • And we might have attribute two and they sort of look like this

  • These are the data points for all of our different data

  • obviously

  • We've got a lot of data points and you can see that roughly speaking they kind of increase in this

  • Sort of direction here like this now it might be but if this correlation is very very strong. So basically

  • Attribute to is a copy of attribute one more or less

  • Maybe it doesn't make sense to have attribute two in our data set. Maybe we can remove it without too much of a problem

  • What we can do is something called correlation analysis where we pitch all of the attributes versus all of the other attributes

  • We look for high correlations and we decide

  • Ourselves whether to remove them now, sometimes it's useful just to keep everything in and try not to remove them too early

  • But on the other hand, if you've got a huge amount of data and your correlations are very high

  • This could be one way of doing it. Another option is something called forward or backward attribute selection

  • Now this is the idea that maybe we have a machine learning model or clustering algorithm in mind

  • we can measure the performance of that and then we can remove features and

  • See if the performance remains the same because if it does maybe we didn't need those features

  • so what we could do is we could train our model on let's say a

  • 720 dimensional data set and then we could get a certain level of accuracy and record that then we could try it again by removing

  • One of the dimensions and try on 719 and maybe the accuracy is exactly the same in which case we can say

  • Well, we didn't really need that dimension at all and we can start to whittle down. Are they set this way?

  • Another option is forwards attribute selection

  • this is where we literally train our machine learning on just one of the attributes and

  • then we see what our accuracy is and we keep adding attributes in and Retraining until our

  • Performance plateaus and we can say you know what? We're not gaining anything now by adding more attributes

  • Obviously, there's the question of which order do you tribus in usually bandim?

  • Lee, so what you would do is you would train on all the data for example of a backwards attribute selection

  • You take one out at random

  • If your performance stays the same you can leave it out if your performance gets much worse

  • You put it back in and you don't try that one again

  • And you try a different one and you stole slowly start to take dimensions away and hopefully Whittle Daniel data

  • Let's have a quick look at correlation analysis on this data set you might imagine that if we're calculating

  • features based on the mp3 from Lib rosa or echo nest

  • Maybe they're quite similar a lot of the time and maybe we can remove them

  • Let's have a quick look. So we're just going to focus on one of a set of Lib rosa features just for simplicity

  • So we're going to select only

  • the attributes that contain this chroma kurtosis

  • Field which is one of the attributes that you can calculate using Lib rosa

  • so I'm going to run that we're going to rename them just for a home simplicity to Kurt one Kurt -