Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

About this Video
Playlist
Transcript
Problem Set
Download this Video

Description

Professor Strang begins the lecture talking about ImageNet, a large visual database used in visual object recognition software research. ImageNet is an example of a convolutional neural network (CNN). The rest of the lecture focuses on convolution.

Summary

Convolution matrices have \(\leq\) \(n\) parameters (not \(n\)²).
Fewer weights to compute in deep learning
Component \(k\) from convolution \(c*d\): Add all \(c(j)d(k-j)\)
Convolution Rule: \(F(c*d) = Fc\) times \(Fd\) (component by component)
\(F\) = Fourier matrix with \(j\), \(k\) entry \(= \exp (2 \pi i j k /n)\)

Related section in textbook: IV.2

Instructor: Prof. Gilbert Strang

Lecture 1: The Column Space...

Lecture 2: Multiplying and ...

Lecture 3: Orthonormal Colu...

Lecture 4: Eigenvalues and ...

Lecture 5: Positive Definit...

Lecture 6: Singular Value D...

Lecture 7: Eckart-Young: Th...

Lecture 8: Norms of Vectors...

Lecture 9: Four Ways to Sol...

Lecture 10: Survey of Diffi...

Lecture 11: Minimizing ‖x...

Lecture 12: Computing Eigen...

Lecture 13: Randomized Matr...

Lecture 14: Low Rank Change...

Lecture 15: Matrices A(t) D...

Lecture 16: Derivatives of ...

Lecture 17: Rapidly Decreas...

Lecture 18: Counting Parame...

Lecture 19: Saddle Points C...

Lecture 20: Definitions and...

Lecture 21: Minimizing a Fu...

Lecture 22: Gradient Descen...

Lecture 23: Accelerating Gr...

Lecture 24: Linear Programm...

Lecture 25: Stochastic Grad...

Lecture 26: Structure of Ne...

Lecture 27: Backpropagation...

Lecture 30: Completing a Ra...

Lecture 31: Eigenvectors of...

Now Playing

Lecture 32: ImageNet is a C...

Lecture 33: Neural Nets and...

Lecture 34: Distance Matric...

Lecture 35: Finding Cluster...

Lecture 36: Alan Edelman an...

Download English-US transcript (PDF)

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

GILBERT STRANG: I'm determined to tell you something about the convolution rule. I just get close to it, but haven't quite got there. And actually, I'd like to say something also about why convolution is so important. I mentioned here a paper about images in deep learning by-- it has three authors, and these are two of them. Maybe you recognize Hinton's name. He's originally English. He was in San Diego for quite a few years, and now he's in Canada. So Toronto and Montreal are big centers now for deep learning. And he's really one of the leaders, and so is Sutskever.

So maybe you know that the sort of progress of deep learning can often be measured in these competitions that are held about every year for how well does-- people design and execute a whole neural net. And this was a competition about images. So that is really demanding, because, as I said last time, an image has so many samples, so many pixels that the computational problem is enormous. And that's when you would go to convolution neural nets, CNN, because a convolutional net takes fewer weights, because of the same weight as appearing along diagonals. It doesn't need a full matrix of weights, just one top row of weights.

Anyway, so this is one of the historical papers in the history of deep learning. I'll just read a couple of sentences. We trained-- so this is the abstract. We trained a large deep convolutional neural network. I'll just say that it ran for five days on two GPUs. So it was an enormous problem, as we'll see. So we trained a large deep network, CNN, to classify 1.2 million high res images in ImageNet. So ImageNet is a source of millions of images.

And on the test data, they-- well, the last sentence is maybe a key. We entered a variant of this model in the competition, 2012 competition, and we achieved a winning top five test error rate of 15% compared to 26% for the second place team. So 15% error. They got 26% was the best that the rest of the world did.

And so that-- and when he shows the network, you realize what's gone into it. It has convolution layers, and it has some normal layers, and it has max pooling layers to cut the dimension down a little bit. And half the samples go on one GPU and half another. And at certain points, layers interconnect between the two GPUs.

And so to reduce overfitting-- you remember that. It's a key problem is to reduce overfitting in the fully connected layers. Those are the ordinary layers with full weight matrices. We employed a recently developed regularization called dropout. So dropout is a tool which, if you're in this world, you-- I think Hinton proposed it, again, by seeing that it worked. It's just a careful dropout of some of the data. It reduces the amount of data, and it doesn't harm the problem.

So the neural network has 60 million parameters. 60 million. With 650,000 neurons, five convolutional layers, and three fully connected layers. I just mention this. If you just Google these two names on the web, this paper would come up. So we're talking about the real thing here. Convolution is something everybody wants to understand.

And I'd like to-- since I've started several days ago, and I'd like to remember what convolution means. Let me-- so if I convolve two vectors and I look for the k-th component of the answer, the formula is I add up all the c's times d's where are index i plus j adds to k. Why do you do such a thing? Because c might be represented by a polynomial, say x plus c Nx to the N. And d might be represented by another 1d 1x plus dm x to the m, let's say.

And convolution arises when I multiply those polynomials. Because for a typical-- and then collect terms. Because a typical power of x, say x to the k, the coefficients are-- well, how do we get x to the k in multiplying these? I multiply c0 times a dk. Somewhere in here would be a dk x to the k. So a c0 times a dk would give me an x to the k term. And a c1 times-- everybody sees this coming now? c1 has an x in it already. So over there, we would look at dk minus 1 with one less x. So it would be c1 dk minus 1.

This is just what you do when you multiply a polynomial. And the point is that the way we recognize those terms is that the exponents 0 and k, the exponents 1 and k minus 1, always add to k. So that's where this formula comes from. We take a c times a d hiding behind our cx to the i and a dj x to the j and when i plus j is k, this is x to the k. And that's the term we're capturing. So this is the coefficient of that term.

And let me write it as a slightly different way, where you actually see even more clearly convolution operating. So j is k minus i, right? So it's the sum of cidj, but the j has to be k minus i. So this is the way to remember the formula for the coefficients in c star d in the convolution. You look at c's times d's. It's a form of multiplication. It comes from ordinary multiplication of polynomials. And when you collect terms, you're collecting c, the i-th c and the k minus id, and you're taking all possible i's. So it's a sum over all possible i's there to give you the k-th answer.

Well, just to see if you got the idea, what would be the convolution of two functions? Suppose I have a function f of x. And I want I convolve that with a function g of x. OK. And notice that I have not circled this symbol. So I'm not doing periodic convolution. I'm just doing straightforward convolution. So what are we going to have in the case of two functions? What would that mean, a convolution of functions?

I'm in parallel here with a convolution of two vectors. So think of these now have become functions. The case component has become-- really, I should say f star g at x. That's really the parallel to this. So let me. So I'm telling you the answer at x. Here I told you the answer at k. The k-th component looks like that. What does the x value of the convolution look like for functions?

OK, I'm just going to do this. I'm going to do the same as this. Instead of summing, what will I do? Integrate. Instead of c sub i, I'll have f of x. The index i is changing over to the continuous variable x. And now g instead of dk minus i, what do I have here? So it's the k minus i component. That will go to-- let me just write it down-- t minus x.

So in this translation, f is being translated to c. Or sorry, f corresponds to c. g corresponds to d. k corresponds to x. Oh no, sorry. i corresponds to x. And k minus i corresponds to t minus x. So k corresponds to t. This would be the convolution of two functions. Oh, it's a function of t. Bad notation.

The t is sort of the amount of shift. See, I've shifted g. I've reversed it. I've flipped it and shifting it by different amounts t. It's what you have in a filter. It's just also always present in signal processing. So that that would be a definition. Or I could, if you like, if you want an x variable to come out, let me make an x variable come out by exchanging t and x. So this would be x minus t dt. I like that, actually, a little better. And it's the integral over t minus infinity to infinity if our functions were on the whole line.

So there will be a convolution rule for that. This will connect to the Fourier transform of the two functions. Over here, I'm connecting it to the discrete Fourier transform of the two functions. And I've been making the convolution cyclic. So what does-- can I add cyclic now? This is ordinary convolution. This is what you had in the first lab, I think, from Raj Rao.

The first lab, you remember you had to figure out how many components the convolution would have? And you didn't make it cyclic. So a cyclic convolution, if this has n components and this has n components, then the convolution has n components. Because keeping n is the key number there, the length of the period.

And similarly, over here, if f is 2 pi periodic and g is 2 pi periodic, then we might want to do a periodic convolution and bring it-- get an answer that also has 2 pi period 2 pi. So you could compute the convolution of sine x with cos x, for example.

OK, let's stick with vectors. So what's the deal when I make it cyclic? When I make it cyclic, then in this multiplication, I really should use-- I've introduced w as that instead of x. So cyclic. x becomes this number w, which is e to the 2 pi i over n and has the property then that w to the n-th is 1 so that all vectors of length greater than n can be folded back using this rule to a vector of length n. So we get a cyclic guy.

So how does that change the answer? Well, I only want k going from 0 to n minus 1 in the cyclic case. I don't want infinitely many components. I've got to bring them back again. And let me just say what the rule would be. You just ask, say, i plus j. You would look at that modulo n. That's what a number theory person would call it. We only look at the remainder when we divide by n. So now the sums go only from 0 to n minus 1, and I only get an answer from 0 to n minus 1. Well, I've done that pretty quickly. That's if I wanted to do justice to--

So the difference between non-periodic. So non-periodic and periodic will be the difference between-- so I have some number t0 on the diagonals. t1, t2, t minus 1, t minus 2, and so on. Constant diagonals. So the key name there is Toeplitz. And if it's periodic, then I have, I'll say, c, c, c. And then the next one will be c1, c1, coming around to c1. And c2 coming around. So it's n by n period n. So it's a circulant matrix. N by N.

OK. That's the big picture. And I think in that first lab, you were asked to do the non-circulant case. Because that's the one where you have to do a little patience. What will be the length? Yeah, what would be the length of a non-circulant? So not circulant. Now, suppose the c vector has p components and the d vector has q components. How many components in their convolution? Shall I write that question down? Because that brings out the difference here.

So if I have p, if c has P components, d has q components, then the convolution of c and d has how many? So I'm multiplying. So it's really this corresponds to a polynomial of degree p minus 1, right? Polynomials of degree p minus 1.

And this guy would be degree q minus 1. Degree q minus 1. And when I multiply them, what's the degree? Just add. And how many coefficients? Well, one more I have to remember for that stupid 0 order term. So this would have p plus q minus 1 components. So that would have been the number that you've somehow had to work out in that first lab.

So that if this had n components and this had n, this would have 2n minus 1. It's just what you would have-- like you say 3 plus x times 1 plus 2x. In this case, p is 2, q is two, two components, two components. And if I multiply those, I get 3x and 6x is 7x and 2x squared. And so I have 2 plus 2 minus 1 equals 3 components. The constant x and x squared. Yeah, clear, right.

Yeah, so that's not the-- that's what I would get if I multiplied these matrices, if I had a two diagonal matrix, Toeplitz matrix, times a two diagonal Toeplitz matrix, that would give me a three diagonal answer. But if I am doing it periodically, I would only have two. That 2x squared would come back if I-- come back as a 2. so I just have 5 plus 7x. Right, good, good, good.

OK. So that's a reminder of what convolution is. Cyclic and non-cyclic, vectors and functions. OK, then eigenvalues and eigenvectors are the next step, and then the convolution rule is the last step. So eigenvectors. Eigenvectors of the circulant. Of course, I can only do square matrices.

So I'm doing the periodic case. So the eigenvectors are the columns of the eigenvector matrix. And I'm going to call it F for Fourier. So F is-- the first eigenvector is all 1s. An x eigenvector is the fourth root of 1, then the square root of 1, i6, i8, i fourth, i6, and finally, 1 i cubed i sixth i ninth. OK, that's F. Those are the four eigenvectors of the permutation p and of any polynomial in p. So my circulant is some c0 i plus c 1p plus c 2p squared and c3 pq.

OK. And finally, this is the step we've been almost ready to do but didn't quite do. What are the eigenvectors-- what eigenvectors are its eigenvectors? So those are the eigenvectors of p. And now we have just a combination of p's. So I think the eigenvectors I just multiply. I take that same combination of the eigenvectors. Does that look right?

So sorry. I'm sorry. Its eigenvectors, they're the columns of f. The question I meant to ask is what are its eigenvalues? That's the key question. What are the eigenvalues? And I think that if I just multiply F times c, I get the eigenvalues of the matrix C.

That's the beauty. That's the nice formula. If my matrix is just P alone, then this is 0, 1, 0, 0, and I get 1, i, i squared, i cubed. But if c is some other combination of the p's, then I take the same combination of the eigenvectors to see-- yeah. Do you see it?

So I'm claiming that I'll get four eigenvalues of C from this multiplication. So of course, if there's only c0, then I only get c0, c0, c0, c0. It's four times repeated. But if it's this combination, then that matrix multiplication takes the same combination of-- this is a combination of the eigenvectors. And that gives us the right thing. OK. Now I just have one more step for this convolution rule, and then I'm happy.

Really, the convolution rule is stating what we-- it's stating a relation between multiplication, which we saw here, and the convolution, which we saw for the coefficients. So the convolution rule is a connection between multiplying and convolution. And so let me say what that convolution rule is and let me write it correctly.

So here I take a cyclic convolution. I'm dealing with square matrices. Everything is cyclic here. And then I get-- if I multiply by F, what do I have now? What does that represent? This was c and d, and I convolve them. So I got another circulant matrix.

So up here, the multiplication of matrices is C times D. I want to connect multiplying those matrices with convolving the c's. I want to make that connection. And that connection is the convolution rule. So this would be the eigenvalues of CD.

Let's just pause there. Why am I looking at the eigenvalues of CD? Because if I do that multiplication, I get another Toeplitz matrix, C times D. And the polynomial-- the coefficients associated on the diagonals of C times D are the coefficients of the convolution. So its diagonals come from convolving c with d cyclically. OK.

Now I want to find the same eigenvalues in a second way and match-- and the equation will be the convolution rule. So how can I find the eigenvalues of CD? Well, amazingly, they are the eigenvalues of C times the eigenvalues of D. I'm going to test this rule on 2 by 2. So you'll see everything happening. So this is the main-- this is the fact that I want to use.

Because C and D commute. C and D commute. They have the same eigenvectors. And then the eigenvalues just multiply. So I can multiply. I can get that in a second way by taking the eigenvalues of c and multiplying those by the eigenvalues of d. And I multiply component by component. I multiply the eigenvalue for the all 1s vector by the eigenvalue for the all 1s vector.

Do you know this MATLAB command? Component by component multiplication? This is an important one. There's a guy's name is also associated with that. So that's a vector. That's a vector. And what comes out of that operation? If I have a vector with three components. So n is 3 here. And I do point star or dot star. I'm not sure what people usually say. Component by component, a three component vector times a three component vector, I get a three component vector, just like that. So this is the convolution rule. That's the convolution rule.

And the proof is the fact that when matrices commute, the eigenvalues of the product are just these eigenvalues times these eigenvalues, because they have the same-- the eigenvectors are always the same here for all these circulants. So there's the convolution rule that I can convolve and then transform. Or I can transform separately and then multiply.

So I just maybe better right that convolution rule. Let's call it the C rule. Convolve then transform by F. Or transform separately by F. And then multiply point one. Element by element. Component by component. OK. So that's the convolution rule.

And why is it sort of-- why is it so important? Because transforming by F, multiplying by the Fourier matrix, is extremely fast by the FFT. So it's useful because of the FFT, the Fast Fourier Transform, to multiply. Or to transform. Whichever. Equal to transform. Multiply by F transform. So it's the presence of the FFT that makes this-- it gives us really two different ways to do it.

In fact, which is the faster way? So we can produce the same result this way or this way. And if I don't count the cost of-- if the cost of multiplying by F is low, because I have the FFT, which would you do? Which would you do? So let me just think aloud before we answer that question, and then we're good.

So my vectors have n components. So one way I can do is to do convolution. How many steps is that? If I take a vector with n components and I convolve with a vector with n components, how many little multiplications do I have to do? N squared, right? Because each of the c's has to multiply each of the d's. So that takes N squared. And Fourier is cheap. It's N log N. Log to base 2. So the left hand side is effectively N cubed.

What about this one? How many to do these two guys? To find the Fourier transform to multiply by the matrix F. OK, those are fast again. That's just I've got two multiplications by F. So that's 2 N log N. And what's the cost of this? I have a vector with n components. Dot star vector. Another vector with n components.

How many little multiplications do I have to do for a Hadamard product or a component by component product? N, only n. Plus N. Yeah, maybe I should have made that plus. I had two. No, I had one N log N. Plus it took N squared to find that vector. And then N log N. So it's effectively N squared. But this one where I do the N log N twice and then it only takes me N more. So this is the fast way.

So if you wanted to multiply two really big, long integers, as you would want to do in cryptography, if you had two long integers, say, of length 125, 126, 128 components, to multiply those, you would be better off to separately take the cyclic transform of each of those 128 guys and do it this way. Take the transforms, do the component by component product, and then transform back to get that. The convolution rule is what makes that go.

Oh, one more thought, I guess, about all this convolution stuff. Suppose we're in 2D. We have to think what is a two dimensional convolution? What does this become in two dimensions? Suppose we have functions. So now I'm gonna do 2D functions of x and y. Periodic or not periodic. But what's a convolution? What's the operation we have to do in two dimensions?

Well, it's a double integral, of course. t and u. We would do f of t and u times g of x minus ty minus u dtdu. And that would produce a function. So I'm convolving a function of x and y with another function of x and y. And again, I'm looking for this. This is the key to watch for. x minus t. y minus u. That's the signal of a convolution integral. So that's what we would have in 2D.

In general, So maybe now my final thought is to move to think about two dimensional matrices and their products and so on. And this is why you need them. Because if you have two dimensional signals, then the components fit into a matrix. And we just have to operate in both dimensions.

So the key operation in 2D is in MATLAB. The MATLAB command that you need to know to get-- if you know what you're doing in 1D and you want to do it in 2D, the MATLAB command is Kron. So imagine we have one dimensional matrices A and B. And so those are in 1D, and we want to produce a natural two dimensional matrix. So these will be N by N. And we want the sort of natural product, let me call it K for Kron, which will be N squared by N squared.

I want to create a 2D matrix connected to an image that's N in each direction. So it has N squared pixels. These are 1D signals, and K is a 2D one. And this K would be the-- this is the operation to know. Given two one dimensional n by n matrices, Kron produces an N squared by N squared matrix. It's the operation to know. So I'll just write it, and if you know what Kron is, then you know it before I write it.

So I want to produce a big matrix, N squared by N squared. Somehow appropriately multiplying these two guys. And the appropriate way to do it is to take a11 and multiply it by B. So there, what do I have? What size have I got there already just in that one corner? N by N, right? It's a number of times an N by N matrix. Then a12 times B. That's another N by N matrix. Up to a1N times B. So I have now-- sorry, cap N.

So I have cap N matrices in a row. Each of those matrices is N by N. So that row has length n squared. And of course, the next row is-- I've allowed myself to number from 1 to N, but very often that numbering should be 0 to n minus 1. And finally on down here down to an1 B to aNN B. So so that's the N squared by N squared matrix that you would need to know.

For example, if you wanted to do a two dimensional Fourier transform, that would be-- yeah, so what would a two dimensional Fourier transform produce? What matrix? Is this the matrix you would use for a 2D? I haven't sort of got started properly on 2D Fourier transforms. So would it be F times F? So let me write down the full name of this guy. Kronecker. So it's called the Kronecker product. It's just the right thing to know in moving from one dimension to two dimensions.

For example. Let me do an example. Oops, that's full. Have I got one board left? Yeah. So here's a standard matrix. Call it A. 2s and minus 1s. So that corresponds to a second derivative or actually minus a second derivative.

Now, suppose I have another, the same matrix, corresponding to second derivatives in the y direction. Same. And what I really want to do is both. I want to have a matrix K that corresponds to minus the second in the x direction minus the second in the y. So this is the Laplace. Laplacian. Which is all over the place in differential equations.

At a typical point, I want to do minus one of these, two of these minus one of those in the x direction, and I want to add to that minus 1. Now that 2 becomes a 4 and minus 1 in the y direction. So I'm looking for the 2 by 2 matrix-- sorry, the two dimensional matrix that takes-- that does that five point scheme. Five weights at each point. It takes four of the-- on the diagonal and minus 1 on the four neighbors.

And the operation that would do that would be you would use Kron. It wouldn't be Kron of A B. That would just-- K of A B is not what I-- a Kron of A B is not what I want. Yeah, that would do one and then the other one. And then that would probably produce nine non-zeroes. I want something that adds here. So I want Kron of A times the identity. That gives me the two dimensional thing for this part.

And then I'll add on Kron of I B for the vertical derivative, the derivatives in the y direction. So that's called a Kronecker sum. The other was a Kronecker product. So that would be a Kronecker product. This would be another Kronecker product, and the total is called the Kronecker sum.

OK. I wanted just to get those notations out. Because really, Fourier transforming is such a central operation in all of applied math, and especially in signal processing.

OK, so I'm good for it today. Let's see. I've got one volunteer so far to talk about a project. Can I encourage an email from anybody that doesn't-- you don't have to be a superstar. You're just willing to do it.

Tell us something about what you've learned. Get comments from the audience. And 10 or 15 minutes is all I'm thinking about. OK, I'll let you send me an email if you'd like to tell us that and get some feedback. OK, good. So I'll see you Wednesday. Thanks.

Problems for Lecture 32
From textbook Section IV.2

4. Any two circulant matrices of the same size commute: \(CD=DC\). They have the same eigenvectors \(\boldsymbol{q}_k\) (the columns of the Fourier matrix \(F\)). Show that the eigenvalues \(\lambda_k(CD)\) are equal to \(\lambda_k(C)\) times \(\lambda_k(D)\).

Description

Summary

Free Downloads

Video

Caption