 ## Subtitles section Play video

• The following content is provided under a Creative

• Your support will help MIT OpenCourseWare

• To make a donation, or to view additional materials

• from hundreds of MIT courses, visit MIT OpenCourseWare

• at ocw.mit.edu.

• GILBERT STRANG: So this is a pretty key lecture.

• This lecture is about principal component analysis, PCA--

• which is a major tool in understanding a matrix of data.

• So what is PCA about?

• Well first of all, let me remember what

• was the whole point of last--

• yesterday's lecture-- the singular value decomposition,

• that any matrix A could be broken into r rank 1 pieces--

• r being the rank of the matrix.

• And each piece has a U times a V transpose.

• And the good-- special thing is, the U's are orthonormal,

• and also, the V's are orthonormal.

• OK.

• So that's the whole matrix.

• But we have a big matrix, and we want

• to get the important information out of it--

• not all the information.

• And people say, in machine learning,

• if you've learned all the training data, you haven't

• learned anything, really.

• You've just copied it all in.

• The whole point of neural nets and the process

• of machine learning is to learn important facts about the data.

• And now, here we're at the most basic stage of that.

• And I claim that the important facts about the matrix

• are in its largest k singular values--

• the largest k pieces.

• We can take-- k equal 1 would tell us

• the largest single piece.

• But maybe we have space and computing power

• to handle a hundred pieces.

• So I would take k equal 100.

• The matrix might have ranked thousands.

• So I claim that Ak is the best.

• Now here's the one theorem for today, that Ak--

• using the first k pieces of the SVD--

• is the best approximation to A of rank k.

• So I'll write that down.

• So that really says why the SVD is perfect.

• OK.

• So that statement says, that if B--

• another matrix-- has rank k, then the distance from A to B--

• the error you're making in just using B--

• that error is greater than or equal to the error

• you make for the best guy.

• Now that's a pretty straightforward,

• beautiful fact.

• And it goes back to people who discovered

• the SVD in the first place.

• But then a couple of psychologists

• gave a proof in a later paper--

• and it's often called the Eckart-Young Theorem.

• There is the theorem.

• Isn't that straightforward?

• And the hypothesis is straightforward.

• That's pretty nice.

• But of course, we have to think, why is it true?

• Why is it true?

• And to give meaning to the theorem,

• we have to say what these double bars are.

• Do you know the right name for this?

• So that double bar around a matrix is called the--

• the norm of the matrix, the norm.

• So I have to say something about matrix norms.

• How big is-- that's a measure of how big it is.

• And what I have to say is, there are many different measures

• of a matrix--

• how large that matrix is.

• Let me tell you, for today, three possible measures

• of a matrix.

• So different ways to measure--

• I'll call the matrix just A, maybe.

• But then I'm going to apply the measure to A minus B,

• and to A minus AK, and show that that is smaller.

• OK.

• So I want to tell you about the norm of A--

• about some possible norms of A. And actually, the norms I'm

• going to take today will be--

• will have the special feature that they can be found--

• computed by their singular values.

• So let me mention the L2 norm.

• That is the largest singular value.

• So that's an important measure of the--

• sort of the size of a matrix.

• I'm talking here about a general m by n matrix

• A. Sigma 1 is an important norm--

• often called the L2 norm.

• And that's where that index 2 goes.

• Oh.

• norms of vectors-- and then build to the norms of matrices.

• Let me do norms of vectors over on this side.

• The L2 norm of a vector--

• do we know what that is?

• That's the regular length of the vector that we all expect--

• the square root of v1 squared up to vn squared.

• The hypotenuse-- the length of the hypotenuse

• in n dimensional space.

• That's the L2 norm, because of that 2.

• The L1 norm of a vector is just add up those pieces

• without squaring and square rooting them.

• That's the L1 norm.

• And you might say, why do we want two norms?

• Or there are more norms.

• Let me just tell you one more.

• The infinity norm-- and there is a reason for the 1 and the 2

• and the infinity--

• is the largest of the v's.

• OK.

• Have you met norms before?

• I don't know.

• These are vector norms, but maybe you have met.

• Then we're going to have matrix norms, that maybe will be new.

• So this is the norm that we usually think of.

• But this one has become really, really important,

• and let me tell you just why.

• And then we'll-- later section of the notes and a later

• lecture in this course will develop that--

• develop this.

• This is the L1 norm.

• So this is L2, L1, and L infinity--

• [INAUDIBLE]

• Well, it just turned out-- and it was only discovered

• that when you minimize some function using the L1 norm,

• you minimize some, let's say, signal the noise,

• or whatever you minimize--

• some function.

• If you use L1, the winning vector--

• the minimizing vector-- turns out to be sparse.

• And what does sparse mean?

• Sparse means mostly zero components.

• Somehow, when I minimize in L2--

• which historically goes back to Gauss,

• the greatest mathematician of all time.

• When you minimize something in L2, you do the least squares.

• And you find that the guy that gives you the minimum

• has a lot of little numbers--

• lot of little components.

• Because when you're square those little ones,

• they don't hurt much.

• But Gauss-- so Gauss didn't do least L1 norm.

• That has different names-- basis pursuit.

• And it comes into signal processing and sensing.

• Right.

• And then it was discovered that if you minimize--

• as we'll see in that norm--

• you amazingly get-- the winning vector has--

• is mostly zeros.

• And the advantage of that is that you can understand

• what its components are.

• The one with many small components,

• you have no interpretation for that answer.

• But for an answer that just has a few non-zero components,

• you really see what's happening.

• And then this is a important one, too.

• OK.

• Now I'm going to turn just to-- so

• what's the property of a norm?

• Well, you can see that the norm of C times a vector is--

• just multiplying by 6, or 11, or minus

• pi, or whatever-- is the size of C. Norms

• have that nice property.

• They're homogeneous, or whatever word.

• If you double the vector, you should double the norm--

• double the length.

• That makes sense.

• And then the important property is that--

• is the famous triangle in equality--

• that if v and w are two sides of a triangle,

• and you take the norm of v and add to the norm of w--

• the two sides--

• you get more than the straight norm along the hypotenuse.

• Yeah.

• So those are properties that we require,

• and the fact that the norm is positive, which is--

• I won't write down.

• But it's important too.

• OK.

• So those are norms, and those will apply also

• to matrix norms.

• So if I double the matrix, I want to double its norm.

• And of course, that works for that 2 norm.

• And actually, probably-- so the triangle in equality for this

• norm is saying that the largest singular value of A plus B--

• two matrices-- is less or equal to the larger

• the singular value of A plus the largest singular value of B.

• And that's-- we won't take class time to check minor,

• straightforward things like that.

• So now I'm going to continue with the three norms

• that I want to tell you about.

• That's a very important one.

• Then there is another norm that's named--

• has an F. And it's named after Frobenius.

• And what is that norm?

• That norm looks at all the entries in the matrix--

• just like it was a long vector--

• and squares them all, and adds them up.

• So in a way, it's like the 2 norm for a vector.

• It's-- so the squared--

• or shall I put square root?

• Maybe I should.

• It's the square root of all the little people in the matrix.

• So a1, n squared, plus the next a2, 1 squared, and so on.

• You finally get to a-m-n squared.

• You just treat the matrix like a long vector.

• And take this square root just like so.

• That's the Frobenius norm.

• And then finally, not so well known,

• is something that's more like L1.

• It's called the nuclear norm.

• So it is the sum of the sigma of the singular values.

• I guess there are r of them.

• So that's where we would stop.

• Oh, OK.

• So those are three norms.

• Now why do I pick on those three norms?

• And here's the point--

• that for those three norms, this statement is true.

• I could cook up other matrix norms

• for which this wouldn't work.

• But for these three highly important norms,

• this Eckart-Young statement, that the closest rank k

• approximation is found from the first k pieces.

• You see, that's a good thing, because this is

• what we compute from the SVD.

• So now we've solved an approximation problem.

• We found the best B is Ak.

• And the point is, it could use all the-- any of those norms.

• So there would be a--

• well, somebody finally came up with a proof that

• does all three norms at once.

• In the notes, I do that one separately from Frobenius.

• And actually, I found--

• in an MIT thesis--

• I was just reading a course 6 PhD thesis--

• and the author-- who is speaking tomorrow, or Friday in IDSS--

• Dr. [? Cerebro ?] found a nice new proof of Frobenius.

• And it's in the notes, as well as an older proof.

• OK.

• You know, as I talk here, I'm not too sure

• whether it is essential for me to go through the proof,

• either in the L2 norm--

• which takes half a page in then notes--

• or in the Frobenius norm, which takes more.

• I'd rather you see the point.

• The point is that, in these norms-- and now,

• what is special about these norms of a matrix?

• These depend only on the sigmas--