Placeholder Image

Subtitles section Play video

  • All right.

  • Hello, world.

  • This is CS 50 Live where we do all kinds of things.

  • We program things from scratch.

  • Look at the technical concepts.

  • We looked at Callie lyrics last week.

  • Yeah, that was super exciting.

  • Time ago when he was holding.

  • This is CS videos.

  • Nick Wong Po.

  • What we talked about today.

  • Yeah.

  • So today I think, as advertised, we're talking about images and machine learning and classifying some of the three like key buzz birds.

  • They're just kind of a fun, non buzzword, but machine learning's out their classification.

  • We'll actually be doing a little bit unsupervised learning as a post of supervised on things like that, I think.

  • Previously on the stream we talked about a binary classifier that could tell you if something was like thinking cartoon.

  • That was your first was the very 1st 1 Very first thing they did with that time.

  • The episode like four fire my, uh, time out.

  • I see the goggle joke is back on.

  • The people actually put up a website.

  • That's fantastic.

  • Somebody somebody asked, we're gonna have captures for this video and we are in the process of capturing all of our street.

  • Was actually.

  • So this one will have captions eventually.

  • Maybe within a couple of weeks or so and shot us all the people that air currently the chat.

  • That's really appreciate it.

  • Jets very popping.

  • Um, way gonna caption My incessant laugh, you say, is Ha ha.

  • Laugh and be like your shirt, by the way, is very ethereal.

  • Thank you.

  • Yeah, I feel very light today.

  • You know, it's kind of a nice ish day.

  • There's a really nice and Sunday was awesome.

  • True.

  • True reason.

  • It has been very wise to me.

  • 70 messages is freezing, Dad, by the way, what I want?

  • Yeah, but sorry.

  • Sorry.

  • Oh, it is beautiful.

  • So we talked about pontification that ties into sort of a way.

  • Right.

  • So we classified kind of between two.

  • Just different groups that were completely non intersectional on dhe.

  • Now we're trying Thio, given a set of data, can we figure out how many classes there should be?

  • More or less.

  • Is this acceptable?

  • Leading to it.

  • We sort of publicized K means clustering.

  • Yes s okay means, uh for those who are kind of unfamiliar with the term, it's a form of unsupervised machine learning.

  • Unsupervised meeting.

  • I don't tell it what the right answer is.

  • It just figures out what its guess as to the right answer is.

  • And in this case, that manifest itself as their K clusters, if you will, of data.

  • And so let's say I have my entire data set.

  • And actually, we'll just talk about the data set we use today, which is I think, around 40 or 50 images.

  • So really not that much.

  • Not that much data, but it is 40 or 50 images.

  • And there I think four or five classes on those 45 classes are the different streams that we've done that had the same little box.

  • So you took multiple screenshots of one scream throughout one stream.

  • So within each cluster, there should be around 10 images.

  • And those 10 images should be all from the same stream.

  • Um, so essentially, what our classifier will want to do what a k means classifier will dio is it says Okay, I have this large dataset and Tuesday I'm going to just grab a bunch of random pieces from the deficit and those will be my initial clusters.

  • So the initial means air pretty random.

  • They're all roughly the same issue.

  • No depends on the distribution you got up there pretty reasonable.

  • So we just kind of grabbed saman images, Throw him into a cluster and we say All right, now tell me which image?

  • Eyes, I guess.

  • Actually, for each image which cluster is that image closest to by some definition of distance.

  • So on the string today will probably use Euclidean distance.

  • It's pretty common.

  • It's really easy.

  • Geometrical x y I just do like X minus X will not squared.

  • Plus why minus why not squared?

  • And I take the square root of those sums Those some together and that works at a distance formula.

  • Right?

  • The linear just your favorite.

  • Like like some of the great concept for a collision detection.

  • Yeah, he's actually looking game circle.

  • Yeah, but same right.

  • Same concept on dso will basically be doing that on every iteration of the algorithm.

  • It says, Hey, you know, for each image what cluster you closest to And then I'm gonna stick you in that cluster.

  • And once you're in that cluster, I'm then going to re compute the new mean of the cluster on DSO things that were kind of naturally closer to some set of images, they're going to end up kind of clustering themselves, and those means will get closer to the images that are all in that cluster.

  • As you iterated over the algorithm on one of the benefits of K K means clustering is, it actually doesn't take that many iterative steps.

  • You can usually do it in a couple like 10 or so, So you really just kind of straight over, you know, on the order of 10 times.

  • And after that, your loss doesn't really get any better.

  • You tend to stick around.

  • It depends on your situation.

  • Depends on what you built it and come in the random ization parameters you have in there.

  • But in general, you don't need that many iterations to get a pretty accurate representation of what things look like.

  • One of the kind of cool parts of K means is I have this kind of center of a cluster on DSO.

  • I would generally just referred as like let me an image for that cluster.

  • And those mean images are often representative of what is in that cluster.

  • So let's say one cluster is from our Callie one extreme.

  • Then if I take whatever the meaner the center of that cluster is, it should be pretty representative of what we on average looked like in that stream.

  • Now I haven't We haven't tested these, right, So we were pretty sure that it'll work reasonably well, but we leave ourselves a little untested to kind of verify that things that kind of funny.

  • And it gives you guys some room Thio experiments, I guess.

  • And we get to experiment and show you guys cool things.

  • So you will see some kind of entertaining stuff with our K means clustering.

  • But our prediction is roughly that.

  • You should see what we looked like on average on a given stream and given a new image, it'll tell us which cluster it should have belongs to.

  • And it'll been classify for us.

  • Do we usually, like, manually choose the start of the cluster, or do we just kind of randomly or how how does how do we get to the beginning?

  • So that's a great question.

  • A lot of times we actually initialize it with just completely random beginnings.

  • So we just say, Hey, here's my data set.

  • I'm gonna use like a numpty application to randomly pick some set of images are randomly subdivide the images into these clusters so our initial can be random.

  • But there is another version of this that is called semi supervised learning.

  • So let's say you were able to label about half of your data before your research intern quit on and buy you labeled your dad.

  • I mean, your intern labeled half your data and then they quit.

  • So you only have you have the status set of, you know, a 1,000,000 images and 500,000 of them have been labeled manually by some poor college student, and the other 500,000 have not been labeled.

  • So this is where if everything was labeled originally, if all 1,000,000 images have been labeled, then we would call that supervised learning because we know the answer.

  • And then we have some test data set or some prediction kind of way of accumulating things that we want to predict upon and those air unlabeled.

  • But we're going to use our trained model to then predict on those now in semi supervised learning.

  • We have this where maybe half our dad has labeled and half of it is not.

  • And so we take maybe the half that's labeled and we assigned them to the right clusters.

  • So this is where we actually start with.

  • Like, um, all of this cluster.

  • We know this is the right class.

  • All of this cluster is all in the right class and so on, but the other 500,000 images, we don't know which cluster they belong to.

  • So well, then iterated out rhythm over those images and say, Hey, which cluster you closest to And so that can be a way of, like, tightening up what you're a k means classifier will actually do.

  • But it gives it more data, which allows it to spread over, you know, different kinds of variants or whatever the underlying distribution of those things are.

  • SGST before was saying even the second guy, you know, I was actually I just We had a bug with our Facebook deployment, So I was just focused on getting us stream to Facebook, which we now are, I guess, as we were initially published publishing this O B s sort of bugged out and caused the fate extreme about No, no, I was sorry.

  • I was, like, zoomed in on getting that focus.

  • But he s air here.

  • She asked another question earlier.

  • Which was, um where was it?

  • I'm proficient, Python, but know nothing about him.

  • Is this dream for me?

  • What do you say?

  • Yeah, I think this is a very reasonable stream.

  • Even if you're not super great in pylon or if you're, like, very adequate, I would count myself is like, decent in python.

  • This would still be pretty reasonable for you.

  • So our goal is to kind of a day beginning.

  • We're gonna do this kind of high level approach.

  • Like, what are we going to go through?

  • What kind of the concepts?

  • But we're gonna repeat those concepts pretty much throughout the stream so that you really get a sense of what we're actually talking about.

  • S So, for instance, if you're really like, uh, you want to do this yourself, then you goat along.

  • We should be going at a reasonable speed is always positive.

  • Video.

  • Go back.

  • Reiterate on DDE.

  • That should be easy for you.

  • If that's how you learn.

  • If you learn through listening to what we say, then maybe a different iteration will help you if this one doesn't work, If this makes perfect sense to you, then that's great.

  • There are many different ways we're gonna try and attack the same problem on I saw somebody said, Please also talk about limitations of K means clustering.

  • As in, If there are two concentric circles kinds of data, then you're k means clustering might have some problems with, like identifying which one is which s O, for example?

  • Let's say I have some data set and then some of the data set that is a subset of the other of the original data set is it?

  • This clustering method doesn't really work super well for classifying between those two datasets right?

  • Eso Let's say I wanted to classify between, like, all kinds of dog versus like poodles, for example, right?

  • It might be a little bit tricky for me and actually that might not be the best example, but it's kind of the idea, which is that I have the subset of the data that I want to kind of classify as separate from the rest of the data.

  • And that could be really difficult for K means because K means by definition is really just looking at, like what mean image?

  • Can I get out of some data set?

  • And how can I find things that are closest to that mean by some measure of distance on DSO for something that has kind of on average, that the mean image of the entire data set is very, very close.

  • Thio the mean image of the subset data set then K means can't distinguish the two.

  • But maybe you could look at other feature discrimination algorithms, so it's a deep feeling.

  • There's a lot of things, a lot of complexity.

  • It's very interesting, but I'm very excited to learn.

  • I think a little bit more about I'm actually gonna transition thio your laptop.