Placeholder Image

Subtitles section Play video

  • you came back.

  • Welcome to part two of the deep Learning with python and pint torch tutorials in this video, we're gonna be talking about data and really, that's gonna be it.

  • The actual input to our neural network.

  • Now, I think a lot of times everyone wants to focus on the neural network itself.

  • But actually, the step of acquiring data pre processing your data and how you were going to generate over your data pretty much consumes, I would say, 90% of your time and energy.

  • Uh, at least in terms of thinking about your model.

  • Obviously, training time when you're just waiting for the model learn stuff can take a long time.

  • But you the work that you are going to be putting in this is the step where, uh, where you're gonna be doing probably the majority of your work.

  • So to begin, what we're gonna do is work with a kind of toy data set.

  • It's gonna be feminist.

  • This is a really popular data set to use for beginners one, because it's a machine learning, learning ble data set like it's really simple.

  • There's no question we can definitely learn this week in tinker with it.

  • Um, it just it's a good starting data set.

  • So that's what we're gonna use that we're going to use a package called Torch Vision, which you should already have installed.

  • If you haven't just Pip install.

  • That should just be Pippen stole installed torch vision like then you can get it s o fun fact.

  • I also looked tried to look up how to change font size in Jupiter Lab.

  • Uh, I found so many responses about various things you could do.

  • None of them worked.

  • So just curious for my future videos by the time you anybody tells me to be over.

  • But, like, what would I need to change?

  • What do I need to modify to modify just the font size.

  • So I want to zoom in.

  • I even tried editing some of the CSS files that dictate Jupiter Lab, and it didn't work, so Okay, anyway, totally baffled.

  • So please, if you actually know away, don't just link me to opposed.

  • If you actually no way, Let me know.

  • So under the do Dorial you guys got to give me You've got to give back to me sometimes.

  • Okay.

  • So what, we're gonna do first is we're gonna import torch, and then we're going to import torch vision and just my microphone here.

  • Uh, the other thing we're gonna do is from torch vision.

  • We're going to import transforms and data sets.

  • Let's make sure that import works.

  • At least it does cool.

  • So what is Torch vision?

  • So Torch comes with a bunch of data sets, and I can't remember now if other data sets exist now in torch vision that weren't vision tasks.

  • But basically it's a It's a collection of data that is used for vision.

  • So most training data sets with neural networks have something to do with vision just because vision seems to be the big thing that we benchmark against.

  • I wished there were more data sets for other tasks.

  • I mean, there's some, but vision is clearly like the main interest that people are working with you because really, neural networks, air soft solving vision tasks and other machine learning algorithms just haven't been able to do that.

  • So just in terms of just money from like investments and business interests, it tends to be vision tasks is that's like a low hanging fruit that we can do something with right now.

  • Um, but obviously there are other tests as well, Especially when it comes.

  • Like advertising is a huge market where we want to be able to predict what people are going to do, Uh, and funneling and all that.

  • But anyways, um, Torch Vision just has a bunch of vision data for us, and as you're going to see as we continue on the Siri's is, uh, this is basically cheating to use their their built in data.

  • You are like I said, most of your time is going to be spent getting data preparing your data formatting it in such a way that's gonna work within our own network, right?

  • Like cause you again, you gotta convert things like categories or, you know, words have to be converted to numbers, all that kind of stuff.

  • So you're gonna spend a long time doing that kind of thing here.

  • It's already done for us, and then the other thing we typically need to do is batch ing, which we'll talk about in a minute and again, it's gonna be done for us Now, I will say in the next uh, not the next to tour because we'll probably be working on this.

  • But in in the next kind of model that we build the next day to set, we're going to use our completely our own data set.

  • Because I know this isn't what most people want to do.

  • Most people are gonna want to probably apply this to their own problem.

  • We're going to do this so it makes other things easier to learn.

  • And like I said, we're going to slowly increase complexity as we go through the Siri's so anyways transforms and data sets.

  • And so now what we're gonna do is we're gonna define our two major data sets.

  • And when it comes to just machine learning in general, you're gonna have two different things.

  • You're gonna have a training data set, and then you're gonna have a testing day to say, and it's important that you separate these out as soon as possible before you forget, because in orderto validate your data or your model, rather you want tohave what's called out of sample testing data.

  • This is the most realistic test that we can have.

  • Basically, it's data that has never been seen before by your machine because if you use in sample data.

  • Basically the machine.

  • If it has learned toe over fit, it will do really good on your end sample data and very poorly on your out of sample data.

  • So it's essential that we have truly out of sample data meaning data that the machine has just simply never seen before.

  • Because when you actually go to use this model in practice, it's going to be, you know, out of sample, right?

  • It's gonna be new data that has never seen you're hoping that it performs well and with neural networks, like I said before, because you've got millions of trainable tunable parameters, it is highly likely that if you train long enough, your machine will over fit.

  • So as we progress through the Siri's, we'll be talking a little bit more about how long to train what stats, tow, watch and stuff like that.

  • But anyways, you need train and testing data.

  • So for this I am just going to write out This is a bunch of stuff, but basically we're just working with this data sets stuff.

  • So we're gonna say data sets dot capital M n'est, and then you're gonna specify where you want the data to go.

  • I just wanted to go locally.

  • Someone do open and close parenthesis Farah quotes.

  • And we're going to say this is trained equals true download equals true.

  • And then any transform so transformed equals transforms dots, Capital C composed.

  • And then in here you would paste all the transforms.

  • These are things we want to apply to the data, and I don't know why, but the data is not even though we're grabbing it from torch vision.

  • The data is not natively already intense.

  • Er, um So we have to convert it to attend sir's.

  • So the one of the transforms that we're going to apply here is going to be transforms dot too tensor.

  • Uh, this one is already built in, obviously the torch vision because we imported it here.

  • But later on you, you can actually write, like, if you really enjoy this, you can write your own data set and you can use this kind of Samos intact because because as you'll see it, there's a lot of things like, especially in this tutorial, won't become his obvious.

  • But in the coming tutorials, it will start to become pretty obvious how tedious Iterating over a data set, for example, can be I almost think it would be just as tedious to convert it to one of these, like data sets as well.

  • But anyway, just know that you can do that.

  • And then you can also write your own kind of custom transforms.

  • But for now, we'll just use the one that's built in here, and I believe that's a valid line.

  • So, actually, I'm going to go ahead and take this copy paste train will be set toe false.

  • We Do you still want to download?

  • Uh, I think we're good.

  • So let's go ahead and run that real quick and you can see it's downloading.

  • That might take a moment.

  • Hopefully, that won't take too long.

  • And now, while that's going, I think we'll just start writing, uh, writing our next two lines.

  • So basically, this gets us our data, um, downloaded and kind of in a sort of variable here.

  • The next thing we want to do is actually load this into another type of object that's going to help us iterated over that data.

  • So let me write these two lines and then I'll explain why we're doing that because it might not seem obvious Why, Okay, we've got our training and testing data.

  • Why do we need to talk about how to reiterate over it?

  • So the next thing I'm just going to go ahead and right here is I did not think it was going to do.

  • That's kind of funny.

  • Anyways, it's done.

  • So we're gonna say train set.

  • So this will be our training data set.

  • We're going to say that is equal to data sets dot and then em, uh, to write, rewrite their grabbing the data.

  • Anyway, torch dot you tills dot data dot data loader And what we want to do is load in train.

  • So we're just gonna load in the data that's currently added to or is currently this train variable here.

  • So we load that in and then we specify what batch size we want.

  • I'll talk about what that is here in a moment, I'm gonna say 10 just so it's totally easy for everybody to do this.

  • And then shuffle will be equal to true.

  • And we'll talk about shuffle as well.

  • So first, let me just copy paste this line.

  • We're gonna do that same thing for test set, uh, test and yes.

  • Okay, so So we've separated these out and first batch size.

  • This is how many at a time do we want to pass to her model?

  • So, in theory, some of these data sets like this one's not really that big of a data set.

  • You could probably so like, if I open up that this is what it is, M nus right here.

  • Um, I don't even know what those two are, but I bet if we just go to properties, we could see out yet.

  • So it's only 116 megabytes.

  • So definitely fits into your ram.

  • Almost certainly would fit onto any GPU, you'd be ableto you know, user GPU with.

  • So definitely we could fit this entire data set through our model and in one go.

  • But that's not very practical, because with deep learning where deep learning really starts to shine over other traditional machine learning algorithms is when we have huge data so millions, or, like tens of millions or hundreds of millions or even billions of samples.

  • At some point, it does get to the point where it's more samples and you can fit into your memories.

  • So on the one for one reason that you have to use batch batches of data.

  • So rather than doing the whole data set, we're gonna feed through 10 items at a time through our model, um, in the model will be optimized each step of the way in little tiny increments.

  • Uh, but you based on those 10 samples at a time, they get fed through.

  • And in general, common batch size is gonna be anything from like, I don't know, eight tow 60 okay, or 64.

  • Everybody really likes to use base eight numbers.

  • You're going to see that a lot.

  • People ask why this?

  • Why that, uh, it's just well, first of all, the bases is just for fun, like there's really no reason for it.

  • It's just everybody tends to do it, so just expect it.

  • And then, in terms of, like, how many neurons per layer, it's always trial and error.

  • And we'll talk a little bit more about how you might look at least pick a starting one, and then how you might generate through this like trial and error type process.

  • But it's basically a radiant descent operation that you're gonna be doing yourself a CZ.

  • Well, so anyway, more on that later.

  • But so one reason we have two batches because our data is just gonna be so big that we probably can't any realistic example fit it all in the GPU.

  • The second reason is because we hope this data will generalize.

  • So if you take your entire data set and you pass it through the model, what's gonna happen is like, as the model starts to be, optimize all those little weights and all those little connections.

  • Remember, there's millions of these.

  • If you pass your whole data set at once, your your machine is gonna maybe learn some generalizations.

  • But it's also gonna probably be able to figure out some weights that just there just arbitrary.

  • The machine doesn't know is this is this, you know when I tweak this and it improves the loss, right?

  • Is this Is that because this this is generally true?

  • Or is this just because this is in sample data?

  • The machine has no way to know that.

  • So instead, what we do is if we pass through batches at a time each time it optimizes the optimization sze that stick around generally or going to be the general.

  • The generalizations, the correct things that assumed those ones stick around and then the actual just like over fit mint cases.

  • They kind of just get erased out when we working with batches.

  • So it's not the case that you sorry punching my mike is not the case that you want to have the biggest batch size you possibly can.

  • There really is.

  • There's always kind of a sweet spot, batch size, and like I said, it usually is between eight and 64 regardless of like how big your memory is but usually is just, it's just usually sometimes you're going to go even bigger than that.

  • So and the reason why you'd want almost your bad sides.

  • You do want it to be as big as possible, because generally that's going to impact how quickly you can train through your data.

  • Um, but anyways, a lot of information there on batch size, but it's just again.

  • There's there's not that many things that you get to tweak as the engineer here, So that's another thing that you get to tweak.

  • And it's another thing that have general rules, but for the most part you're going to just trial and error again.

  • I tend to just go with the biggest batch size I could get away with, because that helps training time.

  • Finally shuffle.

  • You definitely want to shuffle the reason why I like, for example, like in our case, we're doing amnesty, which I realize now I haven't even explained.

  • But it is.

  • Amnesty is a hand drawn numbers data set.

  • So it's hand drawn digits from 0 to 9.

  • So it'll just be one digit and it's 28 by 28 image of this hand drawn digit.

  • So all the nines look a little bit different, but obviously they're all nines and so on.

  • If you took a neural network and you fed through a bunch of zeros first the neural network would learn to optimize and say Everything is a zero, and then when you get to your ones, it would be like Wait, wait.

  • Everything's a one and then and so on until you got to your nines and it would it would end by saying Everything is a nine, whereas if you shuffle again, the name of the game is generalization.

  • So we want to do everything we can do to give the neural network the opportunity to learn general principles rather than just simply figuring out little tricks.

  • Because if there is a quicker, quicker route to get to increasing or decreasing loss, really, the neural network is going to take that route.

  • And so you have to.

  • You have to constantly be thinking, How can I better like office?

  • Kate over fit mint.

  • So anyway, cool.

  • So we've talked about train set batch size shuffle.

  • True, cool.

  • So once we've done that so again, in most cases even here, like training and testing Ah, you would have to do that split all on your own.

  • You have to shuffle your data all on your own.

  • You have to somehow batch through your data, all on your own again.

  • We will be doing that all on our own soon.

  • But first I figured, Hey, let's just talk about it.

  • Let's keep it as simple as possible and then we'll slowly ratchet up the complexity.

  • So let's go ahead and run this cell cool.

  • And now how do we actually generate over this data?

  • So let's just use train set for example.

  • So it's actually super simple for us to liberate.

  • So we're gonna say four data in train set.

  • Uh, let's just print data and then we're gonna break because we don't want to run over all of them, so we just run once.

  • So what you have here is, uh, it's that entire batch.

  • So it'll be it'll be 10 examples of handwritten digits and then 10 10 Tsar's of the actual output.

  • So the first example should be a three second one of 716 and so on.

  • So the way that we can actually confirm that is by saying we could say X comma.

  • Why?

  • So this might be a little unclear, but remember, you guys are beyond the basics.

  • So in a four loop, whatever the last temporary variable was in the four loop can still be accessed.

  • So in this case, data.

  • So we're just gonna come down here to a new line, We're gonna say X y equals data.

  • And then again, we want to access Ah, the zero with element.

  • And and I'm trying to think here, uh, it should be, too the rip tenser I have to.

  • I just wanna make sure I'm right here.

  • So yes, so the first.

  • So?

  • So as you generate over here, data hat is a tense or object.

  • But it has its a tensor object containing first a tensor of 10.

  • Sirs, that is your images.

  • And then second, a tensor of 10 Tsar's that are your labels.

  • So the way that we can reference ah, three, for example, would be data zero with because that is your images.

  • And then we'll say the zero with image.

  • So this should be an image of, uh, three.

  • Uh, and then we would say for the why that would be data the first if Right, so these are tense hers, and we want the zero with tensor.

  • So now let's just print.

  • Um, let's print why real quick and make sure.

  • Yeah.

  • Okay, cool.

  • So the next thing that we can do is weaken.

  • Visualize this and make sure that we are correct so we can import Matt.

  • Plot lived up high.

  • Plot as penalty.

  • If you don't have this, just pip install Matt plot plant lib in again.

  • You guys aren't basics, I assume, but you would type that into your terminal not into Jupiter notebook anyway, Uh, and then we're gonna say p lt dot uh M show and we want to him show data 00 Now, here's an issue, and it's not with my diaper.

  • Um, I was gonna make a new one real quick.