Placeholder Image

Subtitles section Play video

  • Okay, so welcome to lecture two of CS231N.

  • On Tuesday we, just recall, we, sort of, gave you

  • the big picture view of what is computer vision,

  • what is the history,

  • and a little bit of the overview of the class.

  • And today, we're really going to dive in, for the first time,

  • into the details.

  • And we'll start to see, in much more depth,

  • exactly how some of these learning algorithms

  • actually work in practice.

  • So, the first lecture of the class

  • is probably, sort of, the largest big picture vision.

  • And the majority of the lectures in this class

  • will be much more detail orientated,

  • much more focused on the specific mechanics,

  • of these different algorithms.

  • So, today we'll see our first learning algorithm

  • and that'll be really exciting, I think.

  • But, before we get to that,

  • I wanted to talk about a couple of administrative issues.

  • One, is Piazza.

  • So, I saw it when I checked yesterday,

  • it seemed like we had maybe 500 students

  • signed up on Piazza.

  • Which means that there are several hundred of you

  • who are not yet there.

  • So, we really want Piazza to be the main source

  • of communication between the students and the core staff.

  • So, we've gotten a lot of questions to the staff list

  • about project ideas or questions about midterm attendance

  • or poster session attendance.

  • And, any, sort of, questions like that

  • should really go to Piazza.

  • You'll probably get answers to your questions faster

  • on Piazza, because all the TAs are knowing to check that.

  • And it's, sort of, easy for emails to get lost

  • in the shuffle if you just send to the course list.

  • It's also come to my attention that some SCPD students

  • are having a bit of a hard time signing up for Piazza.

  • SCPD students are supposed to receive a

  • @stanford.edu email address.

  • So, once you get that email address,

  • then you can use the Stanford email to sign into Piazza.

  • Probably that doesn't affect those of you who are

  • sitting in the room right now,

  • but, for those students listening on SCPD.

  • The next administrative issue is about assignment one.

  • Assignment one will be up later today,

  • probably sometime this afternoon,

  • but I promise, before I go to sleep tonight,

  • it'll be up.

  • But, if you're getting a little bit antsy

  • and really want to start working on it right now,

  • then you can look at last year's version

  • of assignment one.

  • It'll be pretty much the same content.

  • We're just reshuffling it a little bit to make it,

  • like, for example, upgrading to work with Python 3,

  • rather than Python 2.7.

  • And some of these minor cosmetic changes,

  • but the content of the assignment will still be the same

  • as last year.

  • So, in this assignment you'll be implementing your own

  • k-nearest neighbor classifier,

  • which we're going to talk about in this lecture.

  • You'll also implement several different linear classifiers,

  • including the SVM and Softmax,

  • as well as a simple two-layer neural network.

  • And we'll cover all this content

  • over the next couple of lectures.

  • So, all of our assignments are using Python and NumPy.

  • If you aren't familiar with Python or NumPy,

  • then we have written a tutorial that you can find

  • on the course website to try and get you up to speed.

  • But, this is, actually, pretty important.

  • NumPy lets you write these very efficient vectorized

  • operations that let you do quite a lot of computation

  • in just a couple lines of code.

  • So this is super important for pretty much

  • all aspects of numerical computing and machine learning

  • and everything like that,

  • is efficiently implementing these vectorized operations.

  • And you'll get a lot of practice with this

  • on the first assignment.

  • So, for those of you who don't have a lot of experience

  • with Matlab or NumPy or other types of vectorized

  • tensor computation, I recommend that you start looking

  • at this assignment pretty early

  • and also, read carefully through the tutorial.

  • The other thing I wanted to talk about

  • is that we're happy to announce that

  • we're officially supported through Google Cloud

  • for this class.

  • So, Google Cloud is somewhat similar to Amazon AWS.

  • You can go and start virtual machines up in the cloud.

  • These virtual machines can have GPUs.

  • We're working on the tutorial for exactly how to use

  • Google Cloud and get it to work for the assignments.

  • But our intention is that you'll be able to just download

  • some image, and it'll be very seamless

  • for you to work on the assignments

  • on one of these instances on the cloud.

  • And because Google has, very generously,

  • supported this course,

  • we'll be able to distribute to each of you

  • coupons that let you use Google Cloud credits for free

  • for the class.

  • So you can feel free to use these for the assignments

  • and also for the course projects

  • when you want to start using GPUs and larger machines

  • and whatnot.

  • So, we'll post more details about that,

  • probably, on Piazza later today.

  • But, I just wanted to mention,

  • because I know there had been a couple of questions

  • about, can I use my laptop?

  • Do I have to run on corn?

  • Do I have to, whatever?

  • And the answer is that, you'll be able to run on

  • Google Cloud and we'll provide you some coupons for that.

  • Yeah, so,

  • those are, kind of, the major administrative issues

  • I wanted to talk about today.

  • And then, let's dive into the content.

  • So, the last lecture we talked a little bit

  • about this task of image classification,

  • which is really a core task in computer vision.

  • And this is something that we'll really focus on

  • throughout the course of the class.

  • Is, exactly,

  • how do we work on this image classification task?

  • So, a little bit more concretely,

  • when you're doing image classification,

  • your system receives some input image,

  • which is this cute cat in this example,

  • and the system is aware of some predetermined set

  • of categories or labels.

  • So, these might be, like, a dog or a cat or a truck

  • or a plane, and there's some fixed set of category labels,

  • and the job of the computer is to look at the picture

  • and assign it one of these fixed category labels.

  • This seems like a really easy problem,

  • because so much of your own visual system in your brain

  • is hardwired to doing these, sort of,

  • visual recognition tasks.

  • But this is actually a really, really hard problem

  • for a machine.

  • So, if you dig in and think about, actually,

  • what does a computer see when it looks at this image,

  • it definitely doesn't get this holistic idea of a cat

  • that you see when you look at it.

  • And the computer really is representing the image

  • as this gigantic grid of numbers.

  • So, the image might be something like 800 by 600 pixels.

  • And each pixel is represented by three numbers,

  • giving the red, green, and blue values for that pixel.

  • So, to the computer,

  • this is just a gigantic grid of numbers.

  • And it's very difficult to distill the cat-ness

  • out of this, like, giant array of thousands, or whatever,

  • very many different numbers.

  • So, we refer to this problem as the semantic gap.

  • This idea of a cat, or this label of a cat,

  • is a semantic label that we're assigning to this image,

  • and there's this huge gap between the semantic idea of a cat

  • and these pixel values that the computer is actually seeing.

  • And this is a really hard problem because

  • you can change the picture in very small, subtle ways

  • that will cause this pixel grid to change entirely.

  • So, for example, if we took this same cat,

  • and if the cat happened to sit still

  • and not even twitch, not move a muscle,

  • which is never going to happen,

  • but we moved the camera to the other side,

  • then every single grid, every single pixel,

  • in this giant grid of numbers

  • would be completely different.

  • But, somehow, it's still representing the same cat.

  • And our algorithms need to be robust to this.

  • But, not only viewpoint is one problem,

  • another is illumination.

  • There can be different lighting conditions going on

  • in the scene.

  • Whether the cat is appearing in this very dark, moody scene,

  • or like is this very bright, sunlit scene, it's still a cat,

  • and our algorithms need to be robust to that.

  • Objects can also deform.

  • I think cats are, maybe, among the more deformable

  • of animals that you might see out there.

  • And cats can really assume a lot of different, varied poses

  • and positions.

  • And our algorithms should be robust to these different

  • kinds of transforms.

  • There can also be problems of occlusion,

  • where you might only see part of a cat, like, just the face,

  • or in this extreme example, just a tail peeking out

  • from under the couch cushion.

  • But, in these cases, it's pretty easy for you, as a person,

  • to realize that this is probably a cat,

  • and you still recognize these images as cats.

  • And this is something that our algorithms

  • also must be robust to,

  • which is quite difficult, I think.

  • There can also be problems of background clutter,

  • where maybe the foreground object of the cat,

  • could actually look quite similar in appearance

  • to the background.

  • And this is another thing that we need to handle.

  • There's also this problem of intraclass variation,

  • that this one notion of cat-ness, actually spans a lot of

  • different visual appearances.

  • And cats can come in different shapes and sizes

  • and colors and ages.

  • And our algorithm, again, needs to work

  • and handle all these different variations.

  • So, this is actually a really, really challenging problem.

  • And it's sort of easy to forget how easy this is

  • because so much of your brain is specifically tuned

  • for dealing with these things.

  • But now if we want our computer programs

  • to deal with all of these problems, all simultaneously,

  • and not just for cats, by the way,

  • but for just about any object category you can imagine,

  • this is a fantastically challenging problem.

  • And it's, actually, somewhat miraculous

  • that this works at all, in my opinion.

  • But, actually, not only does it work,

  • but these things work very close to human accuracy

  • in some limited situations.

  • And take only hundreds of milliseconds to do so.

  • So, this is some pretty amazing, incredible technology,

  • in my opinion, and over the course of the rest of the class

  • we will really see what kinds of advancements

  • have made this possible.

  • So now, if you, kind of, think about

  • what is the API for writing an image classifier,

  • you might sit down and try to write a method in Python

  • like this.

  • Where you want to take in an image

  • and then do some crazy magic

  • and then, eventually, spit out this class label

  • to say cat or dog or whatnot.

  • And there's really no obvious way to do this, right?

  • If you're taking an algorithms class

  • and your task is to sort numbers

  • or compute a convex hull

  • or, even, do something like RSA encryption,

  • you, sort of, can write down an algorithm

  • and enumerate all the steps that need to happen

  • in order for this things to work.

  • But, when we're trying to recognize objects,

  • or recognize cats or images,

  • there's no really clear, explicit algorithm

  • that makes intuitive sense,

  • for how you might go about recognizing these objects.

  • So, this is, again, quite challenging,

  • if you think about,

  • if it was your first day programming

  • and you had to sit down and write this function,

  • I think most people would be in trouble.

  • That being said,

  • people have definitely made explicit attempts

  • to try to write, sort of, high-end coded rules

  • for recognizing different animals.

  • So, we touched on this a little bit in the last lecture,

  • but maybe one idea for cats is that,

  • we know that cats have ears and eyes and mouths and noses.

  • And we know that edges, from Hubel and Wiesel,

  • we know that edges are pretty important

  • when it comes to visual recognition.

  • So one thing we might try to do is

  • compute the edges of this image

  • and then go in and try to categorize all the different

  • corners and boundaries, and say that, if we have maybe

  • three lines meeting this way, then it might be a corner,

  • and an ear has one corner here and one corner there

  • and one corner there,

  • and then, kind of, write down this explicit set of rules

  • for recognizing cats.

  • But this turns out not to work very well.

  • One, it's super brittle.

  • And, two, say, if you want to start over for another

  • object category, and maybe not worry about cats,

  • but talk about trucks or dogs or fishes or something else,

  • then you need to start all over again.

  • So, this is really not a very scalable approach.

  • We want to come up with some algorithm, or some method,

  • for these recognition tasks

  • which scales much more naturally to all the variety

  • of objects in the world.

  • So, the insight that, sort of, makes this all work

  • is this idea of the data-driven approach.

  • Rather than sitting down and writing these hand-specified

  • rules to try to craft exactly what is a cat or a fish

  • or what have you,

  • instead, we'll go out onto the internet

  • and collect a large dataset of many, many cats

  • and many, many airplanes and many, many deer

  • and different things like this.

  • And we can actually use tools like Google Image Search,

  • or something like that,

  • to go out and collect a very large number of examples

  • of these different categories.

  • By the way, this actually takes quite a lot of effort

  • to go out and actually collect these datasets

  • but, luckily, there's a lot of really good, high quality

  • datasets out there already for you to use.

  • Then once we get this dataset,

  • we train this machine learning classifier

  • that is going to ingest all of the data,

  • summarize it in some way,

  • and then spit out a model

  • that summarizes the knowledge of how to recognize

  • these different object categories.

  • Then finally, we'll use this training model

  • and apply it on new images

  • that will then be able to recognize

  • cats and dogs and whatnot.

  • So here our API has changed a little bit.

  • Rather than a single function

  • that just inputs an image and recognizes a cat,

  • we have these two functions.

  • One, called, train, that's going to input images and labels

  • and then output a model,

  • and then, separately, another function called, predict,

  • which will input the model and than make predictions

  • for images.

  • And this is, kind of, the key insight

  • that allowed all these things to start working really well

  • over the last 10, 20 years or so.

  • So, this class is primarily about neural networks

  • and convolutional neural networks

  • and deep learning and all that,

  • but this idea of a data-driven approach is much more general

  • than just deep learning.

  • And I think it's useful to, sort of,

  • step through this process

  • for a very simple classifier first,

  • before we get to these big, complex ones.

  • So, probably, the simplest classifier you can imagine

  • is something we call nearest neighbor.

  • The algorithm is pretty dumb, honestly.

  • So, during the training step we won't do anything,

  • we'll just memorize all of the training data.

  • So this is very simple.

  • And now, during the prediction step,

  • we're going to take some new image

  • and go and try to find the most similar image

  • in the training data to that new image,

  • and now predict the label of that most similar image.

  • A very simple algorithm.

  • But it, sort of, has a lot of these nice properties

  • with respect to data-drivenness and whatnot.

  • So, to be a little bit more concrete,

  • you might imagine working on this dataset called CIFAR-10,

  • which is very commonly used in machine learning,

  • as kind of a small test case.

  • And you'll be working with this dataset on your homework.

  • So, the CIFAR-10 dataset gives you 10 different classes,

  • airplanes and automobiles and birds and cats and different

  • things like that.

  • And for each of those 10 categories

  • it provides 50,000 training images,

  • roughly evenly distributed across these 10 categories.

  • And then 10,000 additional testing images

  • that you're supposed to test your algorithm on.

  • So here's an example of applying this simple

  • nearest neighbor classifier to some of these test images

  • on CIFAR-10.

  • So, on this grid on the right,

  • for the left most column,

  • gives a test image in the CIFAR-10 dataset.

  • And now on the right, we've sorted the training images

  • and show the most similar training images

  • to each of these test examples.

  • And you can see that they look kind of visually similar

  • to the training images,

  • although they are not always correct, right?

  • So, maybe on the second row, we see that the testing,

  • this is kind of hard to see,

  • because these images are 32 by 32 pixels,

  • you need to really dive in there

  • and try to make your best guess.

  • But, this image is a dog and it's nearest neighbor is also

  • a dog, but this next one, I think is actually a deer

  • or a horse or something else.

  • But, you can see that it looks quite visually similar,

  • because there's kind of a white blob in the middle

  • and whatnot.

  • So, if we're applying the nearest neighbor algorithm

  • to this image,

  • we'll find the closest example in the training set.

  • And now, the closest example, we know it's label,

  • because it comes from the training set.

  • And now, we'll simply say that this testing image is also

  • a dog.

  • You can see from these examples that is probably not

  • going to work very well,

  • but it's still kind of a nice example to work through.

  • But then, one detail that we need to know is,

  • given a pair of images,

  • how can we actually compare them?

  • Because, if we're going to take our test image and compare it

  • to all the training images,

  • we actually have many different choices

  • for exactly what that comparison function should look like.

  • So, in the example in the previous slide,

  • we've used what's called the L1 distance,

  • also sometimes called the Manhattan distance.

  • So, this is a really sort of simple, easy idea

  • for comparing images.

  • And that's that we're going to just compare individual pixels

  • in these images.

  • So, supposing that our test image is maybe just a tiny

  • four by four image of pixel values,

  • then we're take this upper-left hand pixel

  • of the test image,

  • subtract off the value in the training image,

  • take the absolute value,

  • and get the difference in that pixel between the two images.

  • And then, sum all these up across all the pixels

  • in the image.

  • So, this is kind of a stupid way to compare images,

  • but it does some reasonable things sometimes.

  • But, this gives us a very concrete way

  • to measure the difference between two images.

  • And in this case, we have this difference of 456

  • between these two images.

  • So, here's some full Python code

  • for implementing this nearest neighbor classifier

  • and you can see it's pretty short and pretty concise

  • because we've made use of many of these vectorized

  • operations offered by NumPy.

  • So, here we can see that this training function,

  • that we talked about earlier,

  • is, again, very simple, in the case of nearest neighbor,

  • you just memorize the training data,

  • there's not really much to do here.

  • And now, at test time, we're going to take in our image

  • and then go in and compare using this L1 distance function,

  • our test image to each of these training examples

  • and find the most similar example in the training set.

  • And you can see that, we're actually able to do this

  • in just one or two lines of Python code

  • by utilizing these vectorized operations in NumPy.

  • So, this is something that you'll get practice with

  • on the first assignment.

  • So now, a couple questions about this simple classifier.

  • First, if we have N examples in our training set,

  • then how fast can we expect training and testing to be?

  • Well, training is probably constant

  • because we don't really need to do anything,

  • we just need to memorize the data.

  • And if you're just copying a pointer,

  • that's going to be constant time

  • no matter how big your dataset is.

  • But now, at test time we need to do this comparison stop

  • and compare our test image

  • to each of the N training examples in the dataset.

  • And this is actually quite slow.

  • So, this is actually somewhat backwards,

  • if you think about it.

  • Because, in practice,

  • we want our classifiers to be slow at training time

  • and then fast at testing time.

  • Because, you might imagine, that a classifier might go

  • and be trained in a data center somewhere

  • and you can afford to spend a lot of computation

  • at training time to make the classifier really good.

  • But then,

  • when you go and deploy the classifier at test time,

  • you want it to run on your mobile phone

  • or in a browser or some other low power device,

  • and you really want the testing time performance

  • of your classifier to be quite fast.

  • So, from this perspective, this nearest neighbor algorithm,

  • is, actually, a little bit backwards.

  • And we'll see that once we move to

  • convolutional neural networks,

  • and other types of parametric models,

  • they'll be the reverse of this.

  • Where you'll spend a lot of compute at training time,

  • but then they'll be quite fast at testing time.

  • So then, the question is,

  • what exactly does this nearest neighbor algorithm

  • look like when you apply it in practice?

  • So, here we've drawn, what we call the decision regions

  • of a nearest neighbor classifier.

  • So, here our training set consists of these points

  • in the two dimensional plane,

  • where the color of the point represents the category,

  • or the class label, of that point.

  • So, here we see we have five classes

  • and some blue ones up in the corner here,

  • some purple ones in the upper-right hand corner.

  • And now for each pixel in this entire plane,

  • we've gone and computed what is the nearest example

  • in these training data,

  • and then colored the point of the background

  • corresponding to what is the class label.

  • So, you can see that this nearest neighbor classifier

  • is just sort of carving up the space

  • and coloring the space according to the nearby points.

  • But this classifier is maybe not so great.

  • And by looking at this picture

  • we can start to see some of the problems that might come out

  • with a nearest neighbor classifier.

  • For one, this central region actually contains

  • mostly green points,

  • but one little yellow point in the middle.

  • But because we're just looking at the nearest neighbor,

  • this causes a little yellow island to appear

  • in this middle of this green cluster.

  • And that's, maybe, not so great.

  • Maybe those points actually should have been green.

  • And then, similarly we also see these, sort of, fingers,

  • like the green region pushing into the blue region,

  • again, due to the presence of one point,

  • which may have been noisy or spurious.

  • So, this kind of motivates a slight generalization

  • of this algorithm called k-nearest neighbors.

  • So rather than just looking for the single nearest neighbor,

  • instead we'll do something a little bit fancier

  • and find K of our nearest neighbors,

  • according to our distance metric,

  • and then take a vote among each of our neighbors.

  • And then predict the majority vote

  • among our neighbors.

  • You can imagine slightly more complex ways of doing this.

  • Maybe you'd vote weighted on the distance,

  • or something like that,

  • but the simplest thing that tends to work pretty well

  • is just taking a majority vote.

  • So here we've shown the exact same set of points

  • using this K=1 nearest neighbor classifier,

  • as well as K=3 and K=5 in the middle and on the right.

  • And once we move to K=3, you can see that that spurious

  • yellow point in the middle of the green cluster

  • is no longer causing the points near that region

  • to be classified as yellow.

  • Now this entire green portion in the middle

  • is all being classified as green.

  • You can also see that these fingers

  • of the red and blue regions

  • are starting to get smoothed out

  • due to this majority voting.

  • And then, once we move to the K=5 case,

  • then these decision boundaries

  • between the blue and red regions

  • have become quite smooth and quite nice.

  • So, generally when you're using nearest neighbors

  • classifiers,

  • you almost always want to use some value of K,

  • which is larger than one

  • because this tends to smooth out your decision

  • boundaries and lead to better results.

  • Question?

  • [student asking a question]

  • Yes, so the question is,

  • what is the deal with these white regions?

  • The white regions are where there was no majority

  • among the k-nearest neighbors.

  • You could imagine maybe doing something slightly fancier

  • and maybe taking a guess or randomly selecting among

  • the majority winners,

  • but for this simple example we're just coloring it white

  • to indicate there was no nearest neighbor

  • in those points.

  • Whenever we're thinking about computer vision

  • I think it's really useful to kind of flip

  • back and forth between several different viewpoints.

  • One, is this idea of high dimensional points in the plane,

  • and then the other is actually looking at concrete images.

  • Because the pixels of the image actually

  • allow us to think of these images as high dimensional

  • vectors.

  • And it's sort of useful to ping pong back and forth

  • between these two different viewpoints.

  • So then, sort of taking this k-nearest neighbor

  • and going back to the images

  • you can see that it's actually not very good.

  • Here I've colored in red and green

  • which images would actually be classified correctly

  • or incorrectly according to their nearest neighbor.

  • And you can see that it's really not very good.

  • But maybe if we used a larger value of K

  • then this would involve actually voting among

  • maybe the top three or the top five

  • or maybe even the whole row.

  • And you could imagine that that would end up being

  • a lot more robust to some of this noise that we see

  • when retrieving neighbors in this way.

  • So another choice we have when we're working

  • with the k-nearest neighbor algorithm

  • is determining exactly how we should be comparing

  • our different points.

  • For the examples so far we've just shown

  • we've talked about this L1 distance

  • which takes the sum of the absolute values

  • between the pixels.

  • But another common choice is the L2 or Euclidean distance

  • where you take the square root of the sum of the squares

  • and take this as your distance.

  • Choosing different distance metrics actually

  • is a pretty interesting topic

  • because different distance metrics

  • make different assumptions about the underlying

  • geometry or topology that you'd expect in the space.

  • So, this L1 distance, underneath this, this is actually

  • a circle according to the L1 distance

  • and it forms this square shape thing

  • around the origin.

  • Where each of the points on this, on the square,

  • is equidistant from the origin according to L1,

  • whereas with the L2 or Euclidean distance

  • then this circle is a familiar circle,

  • it looks like what you'd expect.

  • So one interesting thing to point out between these two

  • metrics in particular,

  • is that the L1 distance depends on your choice

  • of coordinates system.

  • So if you were to rotate the coordinate frame

  • that would actually change the L1 distance

  • between the points.

  • Whereas changing the coordinate frame in the L2 distance

  • doesn't matter, it's the same thing no matter what

  • your coordinate frame is.

  • Maybe if your input features, if the individual entries

  • in your vector have some important meaning

  • for your task,

  • then maybe somehow L1 might be a more natural fit.

  • But if it's just a generic vector in some space

  • and you don't know which of the different elements,

  • you don't know what they actually mean,

  • then maybe L2 is slightly more natural.

  • And another point here is that

  • by using different distance metrics

  • we can actually generalize the k-nearest neighbor

  • classifier to many, many different types of data,

  • not just vectors, not just images.

  • So, for example, imagine you wanted to classify pieces

  • of text, then the only thing you need to do

  • to use k-nearest neighbors

  • is to specify some distance function

  • that can measure distances between maybe two paragraphs

  • or two sentences or something like that.

  • So, simply by specifying different distance metrics

  • we can actually apply this algorithm very generally

  • to basically any type of data.

  • Even though it's a kind of simple algorithm,

  • in general, it's a very good thing to try first

  • when you're looking at a new problem.

  • So then, it's also kind of interesting to think about

  • what is actually happening geometrically

  • if we choose different distance metrics.

  • So here we see the same set of points on the left

  • using the L1, or Manhattan distance,

  • and then, on the right, using the familiar L2,

  • or Euclidean distance.

  • And you can see that the shapes of these decision

  • boundaries actually change quite a bit

  • between the two metrics.

  • So when you're looking at L1 these decision boundaries

  • tend to follow the coordinate axes.

  • And this is again because the L1 depends on our choice

  • of coordinate system.

  • Where the L2 sort of doesn't really care about the

  • coordinate axis, it just puts the boundaries

  • where they should fall naturally.

  • My confession is that each of these examples

  • that I've shown you is actually from this interactive

  • web demo that I built,

  • where you can go and play with this k-nearest neighbor

  • classifier on your own.

  • And this is really hard to work on a projector screen.

  • So maybe we'll do that on your own time.

  • So, let's just go back to here.

  • Man, this is kind of embarrassing.

  • Okay, that was way more trouble than it was worth.

  • So, let's skip this, but I encourage you

  • to go play with this in your browser.

  • It's actually pretty fun

  • and kind of nice to build intuition about

  • how the decision boundary changes

  • as you change the K

  • and change your distance metric

  • and all those sorts of things.

  • Okay, so then the question is

  • once you're actually trying to use this algorithm

  • in practice, there's several choices

  • you need to make.

  • We talked about choosing different values of K.

  • We talked about choosing different distance metrics.

  • And the question becomes

  • how do you actually make these choices for your problem

  • and for your data?

  • So, these choices, of things like K and the distance metric,

  • we call hyperparameters,

  • because they are not necessarily learned from the training

  • data,

  • instead these are choices about your algorithm that you make

  • ahead of time

  • and there's no way to learn them directly from the data.

  • So, the question is how do you set these things

  • in practice?

  • And they turn out to be very problem-dependent.

  • And the simple thing that most people do is simply

  • try different values of hyperparameters for your data

  • and for your problem, and figure out which one works best.

  • There's a question?

  • [student asking a question]

  • So, the question is, where L1 distance might be preferable

  • to using L2 distance?

  • I think it's mainly problem-dependent,

  • it's sort of difficult to say

  • in which cases you think one might be better

  • than the other.

  • but I think that because L1 has this sort of coordinate

  • dependency, it actually depends on the coordinate system

  • of your data,

  • if you know that you have a vector,

  • and maybe the individual elements of the vector

  • have meaning.

  • Like maybe you're classifying employees for some reason

  • and then the different elements of that vector correspond

  • to different features or aspects of an employee.

  • Like their salary or the number of years they've been

  • working at the company or something like that.

  • So I think when your individual elements actually

  • have some meaning,

  • is where I think maybe using L1 might make a little bit

  • more sense.

  • But in general, again, this is a hyperparameter

  • and it really depends on your problem and your data

  • so the best answer is just to try them both

  • and see what works better.

  • Even this idea of trying out different values

  • of hyperparameters and seeing what works best,

  • there are many different choices here.

  • What exactly does it mean to try hyperparameters

  • and see what works best?

  • Well, the first idea you might think of

  • is simply choosing the hyperparameters that give you

  • the best accuracy or best performance

  • on your training data.

  • This is actually a really terrible idea.

  • You should never do this.

  • In the concrete case of the nearest neighbor

  • classifier, for example,

  • if we set K=1, we will always classify the training data

  • perfectly.

  • So if we use this strategy we'll always pick K=1,

  • but, as we saw from the examples earlier,

  • in practice it seems that setting K equals to larger values

  • might cause us to misclassify some of the training data,

  • but, in fact, lead to better performance

  • on points that were not in the training data.

  • And ultimately in machine learning

  • we don't care about fitting the training data,

  • we really care about how our classifier,

  • or how our method,

  • will perform on unseen data after training.

  • So, this is a terrible idea, don't do this.

  • So, another idea that you might think of,

  • is maybe we'll take our full dataset

  • and we'll split it into some training data

  • and some test data.

  • And now I'll try training my algorithm with different

  • choices of hyperparameters on the training data

  • and then I'll go and apply that trained classifier

  • on the test data and now I will pick

  • the set of hyperparameters that cause me to perform best

  • on the test data.

  • This seems like maybe a more reasonable strategy,

  • but, in fact, this is also a terrible idea

  • and you should never do this.

  • Because, again, the point of machine learning systems

  • is that we want to know how our algorithm will perform.

  • So, the point of the test set

  • is to give us some estimate of how our method will do

  • on unseen data that's coming out from the wild.

  • And if we use this strategy of training many different

  • algorithms with different hyperparameters,

  • and then, selecting the one which does the best

  • on the test data,

  • then, it's possible, that we may have just picked

  • the right set of hyperparameters

  • that caused our algorithm to work quite well

  • on this testing set,

  • but now our performance on this test set

  • will no longer be representative

  • of our performance of new, unseen data.

  • So, again, you should not do this, this is a bad idea,

  • you'll get in trouble if you do this.

  • What is much more common, is to actually split your data

  • into three different sets.

  • You'll partition most of your data into a training set

  • and then you'll create a validation set

  • and a test set.

  • And now what we typically do is go and train our algorithm

  • with many different choices of hyperparameters

  • on the training set,

  • evaluate on the validation set,

  • and now pick the set of hyperparameters

  • which performs best on the validation set.

  • And now, after you've done all your development,

  • you've done all your debugging,

  • after you've dome everything,

  • then you'd take that best performing classifier

  • on the validation set

  • and run it once on the test set.

  • And now that's the number that goes into your paper,

  • that's the number that goes into your report,

  • that's the number that actually is telling you how

  • your algorithm is doing on unseen data.

  • And this is actually really, really important

  • that you keep a very strict separation between

  • the validation data and the test data.

  • So, for example, when we're working on research papers,

  • we typically only touch the test set

  • at the very last minute.

  • So, when I'm writing papers,

  • I tend to only touch the test set for my problem

  • in maybe the week before the deadline or so

  • to really insure that we're not

  • being dishonest here and we're not reporting a number

  • which is unfair.

  • So, this is actually super important

  • and you want to make sure to keep your test data

  • quite under control.

  • So another strategy for setting hyperparameters

  • is called cross validation.

  • And this is used a little bit more commonly

  • for small data sets, not used so much in deep learning.

  • So here the idea is we're going to take our test data,

  • or we're going to take our dataset,

  • as usual, hold out some test set to use at the very end,

  • and now, for the rest of the data,

  • rather than splitting it into a single training

  • and validation partition,

  • instead, we can split our training data

  • into many different folds.

  • And now, in this way, we've cycled through choosing which

  • fold is going to be the validation set.

  • So now, in this example,

  • we're using five fold cross validation,

  • so you would train your algorithm with one set of

  • hyperparameters on the first four folds,

  • evaluate the performance on fold four,

  • and now go and retrain your algorithm on folds

  • one, two, three, and five,

  • evaluate on fold four,

  • and cycle through all the different folds.

  • And, when you do it this way,

  • you get much higher confidence about

  • which hyperparameters are going to perform

  • more robustly.

  • So this is kind of the gold standard to use,

  • but, in practice in deep learning

  • when we're training large models

  • and training is very computationally expensive,

  • these doesn't get used too much in practice.

  • Question?

  • [student asking a question]

  • Yeah, so the question is,

  • a little bit more concretely,

  • what's the difference between the training and the

  • validation set?

  • So, if you think about the k-nearest neighbor classifier

  • then the training set is this set of images with labels

  • where we memorize the labels.

  • And now, to classify an image,

  • we're going to take the image and compare it to each element

  • in the training data,

  • and then transfer the label from the nearest training point.

  • So now our algorithm will memorize everything

  • in the training set,

  • and now we'll take each element of the validation set

  • and compare it to each element in the training data

  • and then use this to determine what is the accuracy

  • of our classifier when it's applied on the validation set.

  • So this is the distinction between training

  • and validation.

  • Where your algorithm is able to see the labels

  • of the training set,

  • but for the validation set,

  • your algorithm doesn't have direct access to the labels.

  • We only use the labels of the validation set

  • to check how well our algorithm is doing.

  • A question?

  • [student asking a question]

  • The question is, whether the test set,

  • is it possible that the test set might not be

  • representative of data out there in the wild?

  • This definitely can be a problem in practice,

  • the underlying statistical assumption here is that

  • your data are all independently and identically distributed,

  • so that all of your data points should be

  • drawn from the same underlying probability distribution.

  • Of course, in practice, this might not always be the case,

  • and you definitely can run into cases

  • where the test set might not be super representative

  • of what you see in the wild.

  • So this is kind of a problem that dataset creators and

  • dataset curators need to think about.

  • But when I'm creating datasets, for example,

  • one thing I do,

  • is I'll go and collect a whole bunch of data all at once,

  • using the exact same methodology for collecting the data,

  • and then afterwards you go and partition it randomly

  • between train and test.

  • One thing that can screw you up here is

  • maybe if you're collecting data over time

  • and you make the earlier data, that you collect first,

  • be the training data,

  • and the later data that you collect be the test data,

  • then you actually might run into this shift

  • that could cause problems.

  • But as long as this partition is random

  • among your entire set of data points,

  • then that's how we try to alleviate this problem

  • in practice.

  • So then, once you've gone through this

  • cross validation procedure,

  • then you end up with graphs that look something like this.

  • So here, on the X axis, we are showing the value of K

  • for a k-nearest neighbor classifier on some problem,

  • and now on the Y axis, we are showing what is the accuracy

  • of our classifier on some dataset

  • for different values of K.

  • And you can see that, in this case,

  • we've done five fold cross validation over the data,

  • so, for each value of K we have five different examples

  • of how well this algorithm is doing.

  • And, actually, going back to the question about

  • having some test sets that are better or worse

  • for your algorithm,

  • using K fold cross validation

  • is maybe one way to help quantify that a little bit.

  • And, in that, we can see the variance of how this algorithm

  • performs on different of the validation folds.

  • And that gives you some sense of,

  • not just what is the best,

  • but, also, what is the distribution of that performance.

  • So, whenever you're training machine learning models

  • you end up making plots like this,

  • where they show you what is your accuracy,

  • or your performance as a function of your hyperparameters,

  • and then you want to go and pick the model,

  • or the set of hyperparameters,

  • at the end of the day,

  • that performs the best on the validation set.

  • So, here we see that maybe about K=7 probably works

  • about best for this problem.

  • So, k-nearest neighbor classifiers on images

  • are actually almost never used in practice.

  • Because, with all of these problems that we've talked about.

  • So, one problem is that it's very slow at test time,

  • which is the reverse of what we want,

  • which we talked about earlier.

  • Another problem is that

  • these things like Euclidean distance, or L1 distance,

  • are really not a very good way

  • to measure distances between images.

  • These, sort of, vectorial distance functions

  • do not correspond very well to perceptual similarity

  • between images.

  • How you perceive differences between images.

  • So, in this example, we've constructed,

  • there's this image on the left of a girl,

  • and then three different distorted images on the right

  • where we've blocked out her mouth,

  • we've actually shifted down by a couple pixels,

  • or tinted the entire image blue.

  • And, actually, if you compute the Euclidean distance

  • between the original and the boxed,

  • the original and the shuffled,

  • and original in the tinted,

  • they all have the same L2 distance.

  • Which is, maybe, not so good

  • because it sort of gives you the sense that

  • the L2 distance is really not doing a very good job

  • at capturing these perceptional distances between images.

  • Another, sort of, problem with the k-nearest neighbor

  • classifier has to do with something we call the curse

  • of dimensionality.

  • So, if you recall back this viewpoint we had of the

  • k-nearest neighbor classifier,

  • it's sort of dropping paint around each of the training

  • data points and using that to sort of partition the space.

  • So that means that if we expect the k-nearest neighbor

  • classifier to work well,

  • we kind of need our training examples to cover the space

  • quite densely.

  • Otherwise our nearest neighbors could actually be quite far

  • away and might not actually be very similar to our testing

  • points.

  • And the problem is,

  • that actually densely covering the space,

  • means that we need a number of training examples,

  • which is exponential in the dimension of the problem.

  • So this is very bad, exponential growth is always bad,

  • basically, you're never going to get enough images

  • to densely cover this space of pixels

  • in this high dimensional space.

  • So that's maybe another thing to keep in mind

  • when you're thinking about using k-nearest neighbor.

  • So, kind of the summary is that we're using

  • k-nearest neighbor to introduce this idea

  • of image classification.

  • We have a training set of images and labels

  • and then we use that

  • to predict these labels on the test set.

  • Question?

  • [student asking a question]

  • Oh, sorry, the question is,

  • what was going on with this picture?

  • What are the green and the blue dots?

  • So here, we have some training samples

  • which are represented by points,

  • and the color of the dot maybe represents the category

  • of the point, of this training sample.

  • So, if we're in one dimension,

  • then you maybe only need four training samples

  • to densely cover the space,

  • but if we move to two dimensions,

  • then, we now need, four times four is 16 training examples

  • to densely cover this space.

  • And if we move to three, four, five, many more dimensions,

  • the number of training examples that we need

  • to densely cover the space,

  • grows exponentially with the dimension.

  • So, this is kind of giving you the sense,

  • that maybe in two dimensions

  • we might have this kind of funny curved shape,

  • or you might have sort of arbitrary manifolds of labels

  • in different dimensional spaces.

  • Because the k-nearest neighbor algorithm

  • doesn't really make any assumptions about these

  • underlying manifolds,

  • the only way it can perform properly

  • is if it has quite a dense sample of training points

  • to work with.

  • So, this is kind of the overview of k-nearest neighbors

  • and you'll get a chance to actually implement this

  • and try it out on images in the first assignment.

  • So, if there's any last minute questions about K and N,

  • I'm going to move on to the next topic.

  • Question?

  • [student is asking a question]

  • Sorry, say that again.

  • [student is asking a question]

  • Yeah, so the question is,

  • why do these images have the same L2 distance?

  • And the answer is that, I carefully constructed them

  • to have the same L2 distance.

  • [laughing]

  • But it's just giving you the sense that the L2 distance

  • is not a very good measure of similarity between images.

  • And these images are actually all different from

  • each other in quite disparate ways.

  • If you're using K and N,

  • then the only thing you have to measure distance

  • between images,

  • is this single distance metric.

  • And this kind of gives you an example where

  • that distance metric is actually not capturing

  • the full description of distance or difference

  • between images.

  • So, if this case, I just sort of carefully constructed these

  • translations and these offsets to match exactly.

  • Question?

  • [student asking a question]

  • So, the question is,

  • maybe this is actually good,

  • because all of these things

  • are actually having the same distance to the image.

  • That's maybe true for this example,

  • but I think you could also construct examples where

  • maybe we have two original images

  • and then by putting the boxes in the right places

  • or tinting them,

  • we could cause it to be nearer to pretty much

  • anything that you want, right?

  • Because in this example, we can kind of like do arbitrary

  • shifting and tinting

  • to kind of change these distances nearly arbitrarily

  • without changing the perceptional nature of these images.

  • So, I think that this can actually screw you up

  • if you have many different original images.

  • Question?

  • [student is asking a question]

  • The question is,

  • whether or not it's common in real-world cases

  • to go back and retrain the entire dataset

  • once you've found those best hyperparameters?

  • So, people do sometimes do this in practice,

  • but it's somewhat a matter of taste.

  • If you're really rushing for that deadline

  • and you've really got to get this model out the door

  • then, if it takes a long time to retrain the model

  • on the whole dataset,

  • then maybe you won't do it.

  • But if you have a little bit more time to spare

  • and a little bit more compute to spare,

  • and you want to squeeze out that maybe that extra 1%

  • of performance, then that is a trick you can use.

  • So we kind of saw that the k-nearest neighbor

  • has a lot of the nice properties

  • of machine learning algorithms,

  • but in practice it's not so great,

  • and really not used very much in images.

  • So the next thing I'd like to talk about is

  • linear classification.

  • And linear classification is, again, quite a simple learning

  • algorithm, but this will become super important

  • and help us build up to whole neural networks

  • and whole convolutional networks.

  • So, one analogy people often talk about

  • when working with neural networks

  • is we think of them as being kind of like Lego blocks.

  • That you can have different kinds of components

  • of neural networks and you can stick these components

  • together to build these large different towers of

  • convolutional networks.

  • One of the most basic building blocks that we'll see

  • in different types of deep learning applications

  • is this linear classifier.

  • So, I think it's actually really important to

  • have a good understanding of what's happening

  • with linear classification.

  • Because these will end up generalizing quite nicely

  • to whole neural networks.

  • So another example of kind of this modular nature

  • of neural networks

  • comes from some research in our own lab on image captioning,

  • just as a little bit of a preview.

  • So here the setup is that we want to input an image

  • and then output a descriptive sentence

  • describing the image.

  • And the way this kind of works is that

  • we have one convolutional neural network that's looking

  • at the image,

  • and a recurrent neural network that knows

  • about language.

  • And we can kind of just stick these two pieces together

  • like Lego blocks and train the whole thing together

  • and end up with a pretty cool system

  • that can do some non-trivial things.

  • And we'll work through the details of this model as we go

  • forward in the class,

  • but this just gives you the sense that,

  • these deep neural networks are kind of like Legos

  • and this linear classifier

  • is kind of like the most basic building blocks

  • of these giant networks.

  • But that's a little bit too exciting for lecture two,

  • so we have to go back to CIFAR-10 for the moment.

  • [laughing]

  • So, recall that CIFAR-10 has these 50,000 training examples,

  • each image is 32 by 32 pixels and three color channels.

  • In linear classification, we're going to take a bit

  • of a different approach from k-nearest neighbor.

  • So, the linear classifier is one of the simplest examples

  • of what we call a parametric model.

  • So now, our parametric model actually has two different

  • components.

  • It's going to take in this image, maybe, of a cat on the left,

  • and this,

  • that we usually write as X for our input data,

  • and also a set of parameters, or weights,

  • which is usually called W, also sometimes theta,

  • depending on the literature.

  • And now we're going to write down some function

  • which takes in both the data, X, and the parameters, W,

  • and this'll spit out now 10 numbers describing

  • what are the scores corresponding to each of those 10

  • categories in CIFAR-10.

  • With the interpretation that, like the larger score for cat,

  • indicates a larger probability of that input X being cat.

  • And now, a question?

  • [student asking a question]

  • Sorry, can you repeat that?

  • [student asking a question]

  • Oh, so the question is what is the three?

  • The three, in this example, corresponds to the three color

  • channels, red, green, and blue.

  • Because we typically work on color images,

  • that's nice information that you don't want to throw away.

  • So, in the k-nearest neighbor setup

  • there was no parameters, instead,

  • we just kind of keep around the whole training data,

  • the whole training set,

  • and use that at test time.

  • But now, in a parametric approach,

  • we're going to summarize our knowledge of the training data

  • and stick all that knowledge into these parameters, W.

  • And now, at test time, we no longer need the actual

  • training data, we can throw it away.

  • We only need these parameters, W, at test time.

  • So this allows our models to now be more efficient

  • and actually run on maybe small devices like phones.

  • So, kind of, the whole story in deep learning

  • is coming up with the right structure for this

  • function, F.

  • You can imagine writing down different functional forms

  • for how to combine weights and data in different

  • complex ways, and these could correspond to different

  • network architectures.

  • But the simplest possible example

  • of combining these two things

  • is just, maybe, to multiply them.

  • And this is a linear classifier.

  • So here our F of X, W is just equal to the W times X.

  • Probably the simplest equation you can imagine.

  • So here,

  • if you kind of unpack the dimensions of these things,

  • we recall that our image was maybe 32 by 32 by 3 values.

  • So then, we're going to take those values and then stretch

  • them out into a long column vector

  • that has 3,072 by one entries.

  • And now we want to end up with 10 class scores.

  • We want to end up with 10 numbers for this image

  • giving us the scores for each of the 10 categories.

  • Which means that now our matrix, W,

  • needs to be ten by 3072.

  • So that once we multiply these two things out

  • then we'll end up with a single column vector

  • 10 by one, giving us our 10 class scores.

  • Also sometimes, you'll typically see this,

  • we'll often add a bias term

  • which will be a constant vector of 10 elements

  • that does not interact with the training data,

  • and instead just gives us some sort of data independent

  • preferences for some classes over another.

  • So you might imagine that if you're dataset was

  • unbalanced and had many more cats than dogs,

  • for example, then the bias elements corresponding

  • to cat would be higher than the other ones.

  • So if you kind of think about pictorially

  • what this function is doing,

  • in this figure we have an example on the left

  • of a simple image with just a two by two image,

  • so it has four pixels total.

  • So the way that the linear classifier works

  • is that we take this two by two image,

  • we stretch it out into a column vector

  • with four elements,

  • and now, in this example, we are just restricting to

  • three classes, cat, dog, and ship,

  • because you can't fit 10 on a slide,

  • and now our weight matrix is going to be four by three,

  • so we have four pixels and three classes.

  • And now, again, we have a three element bias vector

  • that gives us data independent bias terms

  • for each category.

  • Now we see that the cat score is going to be the enter

  • product between the pixels of our image

  • and this row in the weight matrix

  • added together with this bias term.

  • So, when you look at it this way

  • you can kind of understand linear classification

  • as almost a template matching approach.

  • Where each of the rows in this matrix

  • correspond to some template of the image.

  • And now the enter product or dot product

  • between the row of the matrix and the column

  • giving the pixels of the image,

  • computing this dot product kind of gives us

  • a similarity between this template for the class

  • and the pixels of our image.

  • And then bias just, again, gives you this data

  • independence scaling offset to each of the classes.

  • If we think about linear classification

  • from this viewpoint of template matching

  • we can actually take the rows of that weight matrix

  • and unravel them back into images

  • and actually visualize those templates as images.

  • And this gives us some sense of what a linear

  • classifier might actually be doing

  • to try to understand our data.

  • So, in this example, we've gone ahead and trained

  • a linear classifier on our images.

  • And now on the bottom we're visualizing

  • what are those rows in that learned weight matrix

  • corresponding to each of the 10 categories

  • in CIFAR-10.

  • And in this way we kind of get a sense for what's

  • going on in these images.

  • So, for example, in the left, on the bottom left,

  • we see the template for the plane class,

  • kind of consists of this like blue blob,

  • this kind of blobby thing in the middle

  • and maybe blue in the background,

  • which gives you the sense that this linear classifier

  • for plane is maybe looking for blue stuff

  • and blobby stuff, and those features are going to cause

  • the classifier to like planes more.

  • Or if we look at this car example,

  • we kind of see that there's a red blobby thing

  • through the middle and a blue blobby thing at the top

  • that maybe is kind of a blurry windshield.

  • But this is a little bit weird,

  • this doesn't really look like a car.

  • No individual car actually looks like this.

  • So the problem is that the linear classifier

  • is only learning one template for each class.

  • So if there's sort of variations in how that class

  • might appear,

  • it's trying to average out all those different variations,

  • all those different appearances,

  • and use just one single template

  • to recognize each of those categories.

  • We can also see this pretty explicitly in the horse

  • classifier.

  • So in the horse classifier we see green stuff on the bottom

  • because horses are usually on grass.

  • And then, if you look carefully, the horse actually

  • seems to have maybe two heads, one head on each side.

  • And I've never seen a horse with two heads.

  • But the linear classifier is just doing the best

  • that it can, because it's only allowed to learn

  • one template per category.

  • And as we move forward into neural networks

  • and more complex models,

  • we'll be able to achieve much better accuracy

  • because they no longer have this restriction

  • of just learning a single template per category.

  • Another viewpoint of the linear classifier

  • is to go back to this idea of images

  • as points and high dimensional space.

  • And you can imagine that each of our images

  • is something like a point in this high dimensional space.

  • And now the linear classifier is putting in these

  • linear decision boundaries to try to draw linear

  • separation between one category

  • and the rest of the categories.

  • So maybe up on the upper-left hand side

  • we see these training examples of airplanes

  • and throughout the process of training

  • the linear classier will go and try to draw this

  • blue line to separate out with a single line

  • the airplane class from all the rest of the classes.

  • And it's actually kind of fun if you watch during

  • the training process these lines will start out randomly

  • and then go and snap into place to try to separate

  • the data properly.

  • But when you think about linear classification

  • in this way, from this high dimensional point of view,

  • you can start to see again what are some of the problems

  • that might come up with linear classification.

  • And it's not too hard to construct examples

  • of datasets where a linear classifier will totally fail.

  • So, one example, on the left here,

  • is that, suppose we have a dataset of two categories,

  • and these are all maybe somewhat artificial,

  • but maybe our dataset has two categories,

  • blue and red.

  • And the blue categories are the number of pixels

  • in the image, which are greater than zero, is odd.

  • And anything where the number of pixels greater

  • than zero is even, we want to classify as the red category.

  • So if you actually go and draw what these different

  • decisions regions look like in the plane,

  • you can see that our blue class with an odd number of pixels

  • is going to be these two quadrants in the plane,

  • and even will be the opposite two quadrants.

  • So now, there's no way that we can draw a single linear line

  • to separate the blue from the red.

  • So this would be an example where a linear classifier

  • would really struggle.

  • And this is maybe not such an artificial thing after all.

  • Instead of counting pixels,

  • maybe we're actually trying to count whether the number

  • of animals or people in an image is odd or even.

  • So this kind of a parity problem

  • of separating odds from evens

  • is something that linear classification

  • really struggles with traditionally.

  • Other situations where a linear classifier really struggles

  • are multimodal situations.

  • So here on the right,

  • maybe our blue category has these three different islands

  • of where the blue category lives,

  • and then everything else is some other category.

  • So, for something like horses,

  • we saw on the previous example,

  • is something where this actually might be happening

  • in practice.

  • Where there's maybe one island in the pixel space of

  • horses looking to the left,

  • and another island of horses looking to the right.

  • And now there's no good way to draw a single linear

  • boundary between these two isolated islands of data.

  • So anytime where you have multimodal data,

  • like one class

  • that can appear in different regions of space,

  • is another place where linear classifiers might struggle.

  • So there's kind of a lot of problems with

  • linear classifiers, but it is a super simple algorithm,

  • super nice and easy to interpret and easy to understand.

  • So you'll actually be implementing these things

  • on your first homework assignment.

  • At this point,

  • we kind of talked about

  • what is the functional form corresponding to a

  • linear classifier.

  • And we've seen that this functional form

  • of matrix vector multiply

  • corresponds this idea of template matching

  • and learning a single template for each category

  • in your data.

  • And then once we have this trained matrix

  • you can use it to actually go and get your scores

  • for any new training example.

  • But what we have not told you is

  • how do you actually go about choosing the right W

  • for your dataset.

  • We've just talked about what is the functional form

  • and what is going on with this thing.

  • So that's something we'll really focus on next time.

  • And next lecture we'll talk about

  • what are the strategies and algorithms

  • for choosing the right W.

  • And this will lead us to questions

  • of loss functions and optimization

  • and eventually ConvNets.

  • So, that's a bit of the preview for next week.

  • And that's all we have for today.

Okay, so welcome to lecture two of CS231N.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it