Placeholder Image

Subtitles section Play video

  • you came back.

  • Welcome to part two of the deep Learning with python and pint torch tutorials in this video, we're gonna be talking about data and really, that's gonna be it.

  • The actual input to our neural network.

  • Now, I think a lot of times everyone wants to focus on the neural network itself.

  • But actually, the step of acquiring data pre processing your data and how you were going to generate over your data pretty much consumes, I would say, 90% of your time and energy.

  • Uh, at least in terms of thinking about your model.

  • Obviously, training time when you're just waiting for the model learn stuff can take a long time.

  • But you the work that you are going to be putting in this is the step where, uh, where you're gonna be doing probably the majority of your work.

  • So to begin, what we're gonna do is work with a kind of toy data set.

  • It's gonna be feminist.

  • This is a really popular data set to use for beginners one, because it's a machine learning, learning ble data set like it's really simple.

  • There's no question we can definitely learn this week in tinker with it.

  • Um, it just it's a good starting data set.

  • So that's what we're gonna use that we're going to use a package called Torch Vision, which you should already have installed.

  • If you haven't just Pip install.

  • That should just be Pippen stole installed torch vision like then you can get it s o fun fact.

  • I also looked tried to look up how to change font size in Jupiter Lab.

  • Uh, I found so many responses about various things you could do.

  • None of them worked.

  • So just curious for my future videos by the time you anybody tells me to be over.

  • But, like, what would I need to change?

  • What do I need to modify to modify just the font size.

  • So I want to zoom in.

  • I even tried editing some of the CSS files that dictate Jupiter Lab, and it didn't work, so Okay, anyway, totally baffled.

  • So please, if you actually know away, don't just link me to opposed.

  • If you actually no way, Let me know.

  • So under the do Dorial you guys got to give me You've got to give back to me sometimes.

  • Okay.

  • So what, we're gonna do first is we're gonna import torch, and then we're going to import torch vision and just my microphone here.

  • Uh, the other thing we're gonna do is from torch vision.

  • We're going to import transforms and data sets.

  • Let's make sure that import works.

  • At least it does cool.

  • So what is Torch vision?

  • So Torch comes with a bunch of data sets, and I can't remember now if other data sets exist now in torch vision that weren't vision tasks.

  • But basically it's a It's a collection of data that is used for vision.

  • So most training data sets with neural networks have something to do with vision just because vision seems to be the big thing that we benchmark against.

  • I wished there were more data sets for other tasks.

  • I mean, there's some, but vision is clearly like the main interest that people are working with you because really, neural networks, air soft solving vision tasks and other machine learning algorithms just haven't been able to do that.

  • So just in terms of just money from like investments and business interests, it tends to be vision tasks is that's like a low hanging fruit that we can do something with right now.

  • Um, but obviously there are other tests as well, Especially when it comes.

  • Like advertising is a huge market where we want to be able to predict what people are going to do, Uh, and funneling and all that.

  • But anyways, um, Torch Vision just has a bunch of vision data for us, and as you're going to see as we continue on the Siri's is, uh, this is basically cheating to use their their built in data.

  • You are like I said, most of your time is going to be spent getting data preparing your data formatting it in such a way that's gonna work within our own network, right?

  • Like cause you again, you gotta convert things like categories or, you know, words have to be converted to numbers, all that kind of stuff.

  • So you're gonna spend a long time doing that kind of thing here.

  • It's already done for us, and then the other thing we typically need to do is batch ing, which we'll talk about in a minute and again, it's gonna be done for us Now, I will say in the next uh, not the next to tour because we'll probably be working on this.

  • But in in the next kind of model that we build the next day to set, we're going to use our completely our own data set.

  • Because I know this isn't what most people want to do.

  • Most people are gonna want to probably apply this to their own problem.

  • We're going to do this so it makes other things easier to learn.

  • And like I said, we're going to slowly increase complexity as we go through the Siri's so anyways transforms and data sets.

  • And so now what we're gonna do is we're gonna define our two major data sets.

  • And when it comes to just machine learning in general, you're gonna have two different things.

  • You're gonna have a training data set, and then you're gonna have a testing day to say, and it's important that you separate these out as soon as possible before you forget, because in orderto validate your data or your model, rather you want tohave what's called out of sample testing data.

  • This is the most realistic test that we can have.

  • Basically, it's data that has never been seen before by your machine because if you use in sample data.

  • Basically the machine.

  • If it has learned toe over fit, it will do really good on your end sample data and very poorly on your out of sample data.

  • So it's essential that we have truly out of sample data meaning data that the machine has just simply never seen before.

  • Because when you actually go to use this model in practice, it's going to be, you know, out of sample, right?

  • It's gonna be new data that has never seen you're hoping that it performs well and with neural networks, like I said before, because you've got millions of trainable tunable parameters, it is highly likely that if you train long enough, your machine will over fit.

  • So as we progress through the Siri's, we'll be talking a little bit more about how long to train what stats, tow, watch and stuff like that.

  • But anyways, you need train and testing data.

  • So for this I am just going to write out This is a bunch of stuff, but basically we're just working with this data sets stuff.

  • So we're gonna say data sets dot capital M n'est, and then you're gonna specify where you want the data to go.

  • I just wanted to go locally.

  • Someone do open and close parenthesis Farah quotes.

  • And we're going to say this is trained equals true download equals true.

  • And then any transform so transformed equals transforms dots, Capital C composed.

  • And then in here you would paste all the transforms.

  • These are things we want to apply to the data, and I don't know why, but the data is not even though we're grabbing it from torch vision.

  • The data is not natively already intense.

  • Er, um So we have to convert it to attend sir's.

  • So the one of the transforms that we're going to apply here is going to be transforms dot too tensor.

  • Uh, this one is already built in, obviously the torch vision because we imported it here.

  • But later on you, you can actually write, like, if you really enjoy this, you can write your own data set and you can use this kind of Samos intact because because as you'll see it, there's a lot of things like, especially in this tutorial, won't become his obvious.

  • But in the coming tutorials, it will start to become pretty obvious how tedious Iterating over a data set, for example, can be I almost think it would be just as tedious to convert it to one of these, like data sets as well.

  • But anyway, just know that you can do that.

  • And then you can also write your own kind of custom transforms.

  • But for now, we'll just use the one that's built in here, and I believe that's a valid line.

  • So, actually, I'm going to go ahead and take this copy paste train will be set toe false.

  • We Do you still want to download?

  • Uh, I think we're good.

  • So let's go ahead and run that real quick and you can see it's downloading.

  • That might take a moment.

  • Hopefully, that won't take too long.

  • And now, while that's going, I think we'll just start writing, uh, writing our next two lines.

  • So basically, this gets us our data, um, downloaded and kind of in a sort of variable here.

  • The next thing we want to do is actually load this into another type of object that's going to help us iterated over that data.

  • So let me write these two lines and then I'll explain why we're doing that because it might not seem obvious Why, Okay, we've got our training and testing data.

  • Why do we need to talk about how to reiterate over it?

  • So the next thing I'm just going to go ahead and right here is I did not think it was going to do.

  • That's kind of funny.

  • Anyways, it's done.

  • So we're gonna say train set.

  • So this will be our training data set.

  • We're going to say that is equal to data sets dot and then em, uh, to write, rewrite their grabbing the data.

  • Anyway, torch dot you tills dot data dot data loader And what we want to do is load in train.

  • So we're just gonna load in the data that's currently added to or is currently this train variable here.

  • So we load that in and then we specify what batch size we want.

  • I'll talk about what that is here in a moment, I'm gonna say 10 just so it's totally easy for everybody to do this.

  • And then shuffle will be equal to true.

  • And we'll talk about shuffle as well.

  • So first, let me just copy paste this line.

  • We're gonna do that same thing for test set, uh, test and yes.

  • Okay, so So we've separated these out and first batch size.

  • This is how many at a time do we want to pass to her model?

  • So, in theory, some of these data sets like this one's not really that big of a data set.

  • You could probably so like, if I open up that this is what it is, M nus right here.

  • Um, I don't even know what those two are, but I bet if we just go to properties, we could see out yet.

  • So it's only 116 megabytes.

  • So definitely fits into your ram.

  • Almost certainly would fit onto any GPU, you'd be ableto you know, user GPU with.

  • So definitely we could fit this entire data set through our model and in one go.

  • But that's not very practical, because with deep learning where deep learning really starts to shine over other traditional machine learning algorithms is when we have huge data so millions, or, like tens of millions or hundreds of millions or even billions of samples.

  • At some point, it does get to the point where it's more samples and you can fit into your memories.

  • So on the one for one reason that you have to use batch batches of data.

  • So rather than doing the whole data set, we're gonna feed through 10 items at a time through our model, um, in the model will be optimized each step of the way in little tiny increments.

  • Uh, but you based on those 10 samples at a time, they get fed through.

  • And in general, common batch size is gonna be anything from like, I don't know, eight tow 60 okay, or 64.

  • Everybody really likes to use base eight numbers.

  • You're going to see that a lot.

  • People ask why this?

  • Why that, uh, it's just well, first of all, the bases is just for fun, like there's really no reason for it.

  • It's just everybody tends to do it, so just expect it.

  • And then, in terms of, like, how many neurons per layer, it's always trial and error.

  • And we'll talk a little bit more about how you might look at least pick a starting one, and then how you might generate through this like trial and error type process.

  • But it's basically a radiant descent operation that you're gonna be doing yourself a CZ.

  • Well, so anyway, more on that later.

  • But so one reason we have two batches because our data is just gonna be so big that we probably can't any realistic example fit it all in the GPU.

  • The second reason is because we hope this data will generalize.

  • So if you take your entire data set and you pass it through the model, what's gonna happen is like, as the model starts to be, optimize all those little weights and all those little connections.

  • Remember, there's millions of these.

  • If you pass your whole data set at once, your your machine is gonna maybe learn some generalizations.

  • But it's also gonna probably be able to figure out some weights that just there just arbitrary.

  • The machine doesn't know is this is this, you know when I tweak this and it improves the loss, right?

  • Is this Is that because this this is generally true?

  • Or is this just because this is in sample data?

  • The machine has no way to know that.

  • So instead, what we do is if we pass through batches at a time each time it optimizes the optimization sze that stick around generally or going to be the general.

  • The generalizations, the correct things that assumed those ones stick around and then the actual just like over fit mint cases.

  • They kind of just get erased out when we working with batches.

  • So it's not the case that you sorry punching my mike is not the case that you want to have the biggest batch size you possibly can.

  • There really is.

  • There's always kind of a sweet spot, batch size, and like I said, it usually is between eight and 64 regardless of like how big your memory is but usually is just, it's just usually sometimes you're going to go even bigger than that.

  • So and the reason why you'd want almost your bad sides.

  • You do want it to be as big as possible, because generally that's going to impact how quickly you can train through your data.

  • Um, but anyways, a lot of information there on batch size, but it's just again.

  • There's there's not that many things that you get to tweak as the engineer here, So that's another thing that you get to tweak.

  • And it's another thing that have general rules, but for the most part you're going to just trial and error again.

  • I tend to just go with the biggest batch size I could get away with, because that helps training time.

  • Finally shuffle.

  • You definitely want to shuffle the reason why I like, for example, like in our case, we're doing amnesty, which I realize now I haven't even explained.

  • But it is.

  • Amnesty is a hand drawn numbers data set.

  • So it's hand drawn digits from 0 to 9.

  • So it'll just be one digit and it's 28 by 28 image of this hand drawn digit.

  • So all the nines look a little bit different, but obviously they're all nines and so on.

  • If you took a neural network and you fed through a bunch of zeros first the neural network would learn to optimize and say Everything is a zero, and then when you get to your ones, it would be like Wait, wait.

  • Everything's a one and then and so on until you got to your nines and it would it would end by saying Everything is a nine, whereas if you shuffle again, the name of the game is generalization.

  • So we want to do everything we can do to give the neural network the opportunity to learn general principles rather than just simply figuring out little tricks.

  • Because if there is a quicker, quicker route to get to increasing or decreasing loss, really, the neural network is going to take that route.

  • And so you have to.

  • You have to constantly be thinking, How can I better like office?

  • Kate over fit mint.

  • So anyway, cool.

  • So we've talked about train set batch size shuffle.

  • True, cool.

  • So once we've done that so again, in most cases even here, like training and testing Ah, you would have to do that split all on your own.

  • You have to shuffle your data all on your own.

  • You have to somehow batch through your data, all on your own again.

  • We will be doing that all on our own soon.

  • But first I figured, Hey, let's just talk about it.

  • Let's keep it as simple as possible and then we'll slowly ratchet up the complexity.

  • So let's go ahead and run this cell cool.

  • And now how do we actually generate over this data?

  • So let's just use train set for example.

  • So it's actually super simple for us to liberate.

  • So we're gonna say four data in train set.

  • Uh, let's just print data and then we're gonna break because we don't want to run over all of them, so we just run once.

  • So what you have here is, uh, it's that entire batch.

  • So it'll be it'll be 10 examples of handwritten digits and then 10 10 Tsar's of the actual output.

  • So the first example should be a three second one of 716 and so on.

  • So the way that we can actually confirm that is by saying we could say X comma.

  • Why?

  • So this might be a little unclear, but remember, you guys are beyond the basics.

  • So in a four loop, whatever the last temporary variable was in the four loop can still be accessed.

  • So in this case, data.

  • So we're just gonna come down here to a new line, We're gonna say X y equals data.

  • And then again, we want to access Ah, the zero with element.

  • And and I'm trying to think here, uh, it should be, too the rip tenser I have to.

  • I just wanna make sure I'm right here.

  • So yes, so the first.

  • So?

  • So as you generate over here, data hat is a tense or object.

  • But it has its a tensor object containing first a tensor of 10.

  • Sirs, that is your images.

  • And then second, a tensor of 10 Tsar's that are your labels.

  • So the way that we can reference ah, three, for example, would be data zero with because that is your images.

  • And then we'll say the zero with image.

  • So this should be an image of, uh, three.

  • Uh, and then we would say for the why that would be data the first if Right, so these are tense hers, and we want the zero with tensor.

  • So now let's just print.

  • Um, let's print why real quick and make sure.

  • Yeah.

  • Okay, cool.

  • So the next thing that we can do is weaken.

  • Visualize this and make sure that we are correct so we can import Matt.

  • Plot lived up high.

  • Plot as penalty.

  • If you don't have this, just pip install Matt plot plant lib in again.

  • You guys aren't basics, I assume, but you would type that into your terminal not into Jupiter notebook anyway, Uh, and then we're gonna say p lt dot uh M show and we want to him show data 00 Now, here's an issue, and it's not with my diaper.

  • Um, I was gonna make a new one real quick.

  • Let us print data zero zero.

  • Boom.

  • Um, this one, actually, that's curious.

  • Is it because let's do 00 dot shape Cool.

  • So, as you can see, here is an immediate reason why If you just did tutorials and on Lee use torture vision, you would immediately out of the gate be like, Wait, What?

  • Because as soon as you tried to do your own data set because noticed this shape, it's a one by 28 by 28.

  • That is not a typical image.

  • So if you just loaded it an image and converted it to gray scale and then converted it to a tensor, it wouldn't be a one by 28 by 28.

  • It would be a 28 by 28.

  • There would be no one, but Pytorch wants that one there.

  • Right?

  • So that would be just one immediate hurdle that you'd come across, but and again, this these are the things you have to start thinking about when you want to start feeding it through a neural network.

  • The shaping stuff, uh, throws a lot of people off really quickly.

  • So I'm gonna do my best to make it clear when we do what each number in a shape actually means.

  • Even the tense airflow and pytorch docks are all the same.

  • A lot of times they throw these shapes at you and they don't explain where did that number come from?

  • They're just, like, pull it out of nowhere.

  • And it's it's really frustrating when you when you have no idea how they got it.

  • Uh, and I'll talk about that when we get there soon, One day, one day we'll get there.

  • Um, but yeah, that that could make learning very difficult.

  • So anyways, this one's a pretty easy one.

  • You would just have that one.

  • But if we want to show an image of it, uh, way can't peel Tina m.

  • Show this.

  • We would end up with the penalty.

  • That show, Let's do it.

  • We should get an error.

  • Yeah, and basically just says, Hey, that's not a valid shape for an image.

  • We actually need to make it.

  • We're gonna say dot view because remember, that's like basically reshape, but it's not in place, but it's it'll work here.

  • So we're just saying, Hey, show the image of that data as a 28 by 28 Boom.

  • It's a three.

  • So sure enough, um, that actual shaping works.

  • So, uh, let me think Here.

  • The next thing is trying to decide if I want Thio.

  • I think we'll just talk about Iterating over our data and then possibly balancing.

  • Chances are this data is already balanced is my guess again, this is a thing that is super convenient, huh?

  • But as soon as you try a real world, Data said, it's not gonna work out.

  • So So what is balancing?

  • So remember?

  • Like I said, if the motto confined a shorter path to figuring out some tea, decreasing lost, the model doesn't know, it really doesn't.

  • The model doesn't have a priori knowledge of what is the lowest loss could be.

  • So as the Optimizer is trying to decrease our loss, it doesn't know what is what how how good could we get right?

  • Has no idea.

  • It's just going to try to decrease losses best as possible and generally is easy as possible.

  • So what's going to happen if, in our data set right our data set of zeros one all the way through nine?

  • What if let's let's make an easier exams?

  • Well, looking examines.

  • It doesn't really matter what if What if 60% of our data set is number three like it's threes and then the rest are like 5% ones and 2% nines and so on.

  • Our model is going to immediately learn that the quickest way to decrease loss because losses really just measured on that output, um, the output from the neural network.

  • It's going to adjust waits in such a way that it can as quickly as possible, always protect.

  • Predict a three and then it's gonna get stuck because it can't figure out how to get out of this hole.

  • Because any way out of that hole means it's got to get a lot worse before it can get better again.

  • And it's gonna be stuck there forever, like you're not gonna be able to train out of that hole.

  • So wait.

  • Another thing you have to think about when you go to input data into the neural network.

  • Besides, you know, shuffling all this stuff and batches and global by, you gotta make sure your data is balanced.

  • Now, there are ways to get around imbalanced data sets by modifying the weights of specific classes when calculating loss.

  • But I have never had that work out for me there.

  • There's resource to suggest you could get away with that somehow.

  • It has never worked for me.

  • You generally want your data set to be as balanced as possible.

  • So one way that we can kind of, you know, at least confirm a data sets balance is so, for example, we can just make a counter.

  • So and then I'm gonna say counter dicked.

  • Um, it's probably better way to do this.

  • Actually, we could just make a list and then use counter.

  • Mmm.

  • Part of me wants to do that.

  • I can't really decide if I want to do it this way or that way.

  • We'll just we'll just continue this way because I actually want a percentage that will be easier to read.

  • So we're gonna say 00 and we'll go all the way to, like, nine.

  • So 123456789 Will make this a 12345 67 Did I really not do math?

  • Okay.

  • Cool.

  • Nine.

  • Okay, so now we're gonna do is we're gonna generate over all of this data.

  • So for data in train set were to say, the exes, Capital X, Let's just stick with the syntax equals data.

  • So we've set because Murmur Data's it's not a tupelo, but we can Can't treat it like one.

  • Right Has two sets of 10 sirs.

  • So now we've got X's and y's stored in, uh, variables.

  • Then we're going to stay for wine in wise counter underscore dicked, and then we're gonna convert it to an int.

  • Why?

  • Because we're trying to use a dictionary key here.

  • Uh, and then we're just gonna say, plus equals one.

  • So as we iterated over all of our data, all we're gonna do is add to these counter so we can kind of see ah, where we are in terms of how many samples for each number that we have.

  • And then once we've done that, we can just print, uh, counter addict, so yeah, we totally pride could have done that from collections import counter on, then just made, like, a big list and then made a counter object off that list.

  • But anyway Ah, yes.

  • 0 5000 basically 6000 but almost up to almost 7000 for a one.

  • And so on.

  • Ah, one thing that we could say is always a good thing to do for why Let's do a totals or total plus equals one.

  • Uh And then what?

  • I'll say down here is four.

  • I encounter dicks.

  • Print will do an F string.

  • Um, will say aye.

  • And then Colon and then we'll say, countered dicked I divided by however many total we have, uh, and then we'll just multiply, maybe multiply that by 100.

  • Where did this parentheses come from?

  • I have no idea.

  • When that got introduced, Do cool.

  • Yeah.

  • Okay, so this gives us the percentage kind of distribution.

  • So, as you can see, the number one is by far the most pot will not far, but it's 11% versus let's say the lowest is 9%.

  • So in this case, this is probably balanced enough like you're probably not gonna have any issue, especially because part of the problem, too, is like so So the quickest way to an optimization would at best if it predicted everything is a one would give it 11% whereas random is going to be 10%.

  • So chances are it's going to be able to, um, to deal with that kind of just slight.

  • It's not perfectly.

  • They're not all perfectly 10%.

  • But the OPTIMIZER is gonna be able to work around that.

  • But if if one was, for example 45% or something like that, it probably would not be able to work around that kind of data.

  • So Okay, uh, I think that's enough for this tutorial in the next tutorial will actually build the neural network on and talk about all that.

  • But honestly, data is more important than the neural network.

  • So, uh, yeah, um quick shoutout to some Channel members who have been with me now for a month.

  • We've got Liam ends be Moussa Kurt Lauren to unique does Dylan di, Tim Gettings and auto Capex Key.

  • Thank you guys very much for your support.

you came back.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it