1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:23,130 --> 00:00:26,790 GILBERT STRANG: So I've got a list of things 9 00:00:26,790 --> 00:00:28,740 I'm hoping to do today. 10 00:00:28,740 --> 00:00:32,445 I'll begin with a few final words about saddle points. 11 00:00:32,445 --> 00:00:35,100 The reason I'm interested in saddle points 12 00:00:35,100 --> 00:00:42,780 is when we get to this deep learning direction, 13 00:00:42,780 --> 00:00:45,900 you know that the big step there is 14 00:00:45,900 --> 00:00:53,020 finding a minimum of the total cost function and gradient 15 00:00:53,020 --> 00:00:55,350 descent, which we'll certainly discuss 16 00:00:55,350 --> 00:01:00,520 as the usual method or stochastic gradient descent. 17 00:01:00,520 --> 00:01:04,360 And all kinds of issues arise, what 18 00:01:04,360 --> 00:01:11,860 happens if you have a saddle point or a degenerate minimum? 19 00:01:11,860 --> 00:01:17,560 All these possibilities and the understanding of deep learning 20 00:01:17,560 --> 00:01:20,890 is focusing more and more on what 21 00:01:20,890 --> 00:01:24,700 does that the gradient descent algorithm produce. 22 00:01:24,700 --> 00:01:31,180 So I just thought minima and maxima we know about. 23 00:01:31,180 --> 00:01:34,135 Saddle points are kind of a little hazier. 24 00:01:38,110 --> 00:01:41,300 So this is a perfect example. 25 00:01:41,300 --> 00:01:43,450 And I'll just say a few more words about it. 26 00:01:43,450 --> 00:01:48,310 Then, I want to talk about the Lab 3 27 00:01:48,310 --> 00:01:54,900 that I boldly posted on the Stellar 28 00:01:54,900 --> 00:02:00,970 and also about projects, just to get us thinking about those. 29 00:02:00,970 --> 00:02:09,900 And then my real math topic for today and this week 30 00:02:09,900 --> 00:02:15,420 is basic ideas of statistics, particularly 31 00:02:15,420 --> 00:02:17,190 the covariance matrix. 32 00:02:17,190 --> 00:02:19,470 I'm sure you've met mean and variance. 33 00:02:19,470 --> 00:02:21,480 Those are the most used words. 34 00:02:21,480 --> 00:02:23,130 And we'll use them again. 35 00:02:23,130 --> 00:02:26,190 But then I want to go on to covariance. 36 00:02:26,190 --> 00:02:28,050 So that's what's coming today-- 37 00:02:28,050 --> 00:02:30,990 a few words on saddle points, a lot 38 00:02:30,990 --> 00:02:36,210 of words about the lab and anything 39 00:02:36,210 --> 00:02:43,080 you want to ask about projects, and then some basic statistics. 40 00:02:43,080 --> 00:02:47,520 OK, saddle point so the example I'm taking 41 00:02:47,520 --> 00:02:49,785 is this Rayleigh quotient. 42 00:02:52,310 --> 00:02:55,310 And I'm taking a simple matrix S. 43 00:02:55,310 --> 00:02:59,180 I might as well take a diagonal matrix. 44 00:02:59,180 --> 00:03:00,770 It's symmetric, of course. 45 00:03:00,770 --> 00:03:05,030 And any symmetric matrix I could just change variables 46 00:03:05,030 --> 00:03:08,180 by a cube matrix, an orthogonal matrix 47 00:03:08,180 --> 00:03:10,520 to get to something like that. 48 00:03:10,520 --> 00:03:13,850 And then the x, we're in 3D. 49 00:03:13,850 --> 00:03:17,900 So we got a sort of manageable size here. 50 00:03:17,900 --> 00:03:22,930 And the x vector is uvw. 51 00:03:22,930 --> 00:03:27,110 So this is the quotient. 52 00:03:27,110 --> 00:03:30,440 x transpose x, you see is just exactly 53 00:03:30,440 --> 00:03:35,780 5 u squareds, 3 v squareds, and 1 w squared. 54 00:03:35,780 --> 00:03:39,800 And I divide by the length to normalize things. 55 00:03:39,800 --> 00:03:42,050 So what are the main facts that we 56 00:03:42,050 --> 00:03:44,660 know that I'm not going to prove, 57 00:03:44,660 --> 00:03:46,880 but what are the main facts? 58 00:03:46,880 --> 00:03:50,630 What's the maximum value of R? 59 00:03:50,630 --> 00:03:56,030 What's the minimum value of R, of that function? 60 00:03:56,030 --> 00:03:59,570 And is there a saddle point? 61 00:03:59,570 --> 00:04:05,950 So saddle of R. 62 00:04:05,950 --> 00:04:08,290 OK, what's the maximum value? 63 00:04:08,290 --> 00:04:11,380 How large could you make that ratio, capital R? 64 00:04:11,380 --> 00:04:14,260 I just think, you know, this isn't 65 00:04:14,260 --> 00:04:18,579 a standard topic in 18.06. 66 00:04:18,579 --> 00:04:22,880 But with an example like this, you'll see the whole point. 67 00:04:22,880 --> 00:04:27,070 OK, so how large could I make R? 68 00:04:27,070 --> 00:04:28,880 Yeah, go ahead and say it. 69 00:04:28,880 --> 00:04:29,760 AUDIENCE: Sigma. 70 00:04:29,760 --> 00:04:30,760 GILBERT STRANG: Sigma 1. 71 00:04:30,760 --> 00:04:31,780 And what is it here? 72 00:04:31,780 --> 00:04:34,660 Let's just do with these numbers. 73 00:04:34,660 --> 00:04:37,380 How big can I make that ratio R? 74 00:04:37,380 --> 00:04:41,730 And what choice of uvw makes it big? 75 00:04:41,730 --> 00:04:43,970 So how big I can get it is? 76 00:04:43,970 --> 00:04:44,690 AUDIENCE: 5. 77 00:04:44,690 --> 00:04:45,850 GILBERT STRANG: 5. 78 00:04:45,850 --> 00:04:49,160 That ratio can't be more than 5. 79 00:04:49,160 --> 00:04:51,140 You see it would be 5. 80 00:04:51,140 --> 00:04:52,550 Well, how do I get to 5? 81 00:04:52,550 --> 00:04:55,190 Maximum of R is 5. 82 00:04:55,190 --> 00:04:58,970 And what is the uvw that-- 83 00:04:58,970 --> 00:05:01,010 so I'll say at-- 84 00:05:01,010 --> 00:05:05,180 what choice of uvw would give us 5 here? 85 00:05:05,180 --> 00:05:10,220 You see it immediately, 1, 0, 0. 86 00:05:10,220 --> 00:05:12,410 And what about the minimum of R? 87 00:05:12,410 --> 00:05:16,550 The minimum of this ratio, how do I make that ratio small? 88 00:05:16,550 --> 00:05:20,600 Well, I put I load stuff on the w instead of loading it up 89 00:05:20,600 --> 00:05:22,170 on to u. 90 00:05:22,170 --> 00:05:24,260 It's just clear. 91 00:05:24,260 --> 00:05:26,420 So what is the minimum value of R? 92 00:05:26,420 --> 00:05:27,880 AUDIENCE: 1. 93 00:05:27,880 --> 00:05:32,220 GILBERT STRANG: 1, because I'll load everything into w. 94 00:05:32,220 --> 00:05:34,260 So the minimum value will be 1. 95 00:05:34,260 --> 00:05:37,530 And that will be at the vector 0, 0, 1. 96 00:05:37,530 --> 00:05:39,810 I've loaded everything there. 97 00:05:39,810 --> 00:05:43,500 And then the point of this short discussion 98 00:05:43,500 --> 00:05:48,450 is, is there another place where the derivatives, 99 00:05:48,450 --> 00:05:51,690 first derivatives, of R are all zero? 100 00:05:51,690 --> 00:05:55,360 That's where, of course, the first derivatives 101 00:05:55,360 --> 00:06:00,510 are 0 at the max, at the min. 102 00:06:00,510 --> 00:06:04,260 But we also have three variables here. 103 00:06:04,260 --> 00:06:07,380 And we're going to find a third point. 104 00:06:07,380 --> 00:06:08,790 And what is that point? 105 00:06:08,790 --> 00:06:11,100 You probably guess. 106 00:06:11,100 --> 00:06:13,780 And what will be the saddle value? 107 00:06:13,780 --> 00:06:18,490 So you have to see some kind of a surface that-- 108 00:06:18,490 --> 00:06:19,810 I guess, what are we in? 109 00:06:19,810 --> 00:06:21,250 4D. 110 00:06:21,250 --> 00:06:28,060 So we have base coordinates, uvw. 111 00:06:28,060 --> 00:06:31,360 And R goes vertically. 112 00:06:31,360 --> 00:06:33,880 And we plot that surface. 113 00:06:33,880 --> 00:06:39,340 And we don't really understand it unless we think a lot about 114 00:06:39,340 --> 00:06:40,570 it, which we haven't. 115 00:06:40,570 --> 00:06:43,810 But we can pretty well guess what's what. 116 00:06:43,810 --> 00:06:46,720 And so what do you think is the saddle value? 117 00:06:46,720 --> 00:06:50,800 And where is it going to be reached? 118 00:06:50,800 --> 00:06:52,960 Everybody is going to tell me correctly. 119 00:06:52,960 --> 00:06:54,970 Saddle value would be? 120 00:06:54,970 --> 00:06:57,400 Three at this middle point. 121 00:07:03,460 --> 00:07:07,920 And what are these three with respect to the matrix? 122 00:07:07,920 --> 00:07:10,120 They're its eigenvectors. 123 00:07:10,120 --> 00:07:13,710 What are these three numbers with respect to the matrix? 124 00:07:13,710 --> 00:07:16,090 There its eigenvalues. 125 00:07:16,090 --> 00:07:18,820 That's why that Rayleigh quotient 126 00:07:18,820 --> 00:07:20,830 is such an important function. 127 00:07:20,830 --> 00:07:23,020 It's kind of a messy function. 128 00:07:23,020 --> 00:07:24,940 If you take its derivative, you've 129 00:07:24,940 --> 00:07:31,710 got to use the quotient rule or use Lagrange multiplier. 130 00:07:31,710 --> 00:07:34,600 That's the way to make it more manageable. 131 00:07:34,600 --> 00:07:35,770 But it's kind of messy. 132 00:07:35,770 --> 00:07:38,320 But the results could not be better. 133 00:07:38,320 --> 00:07:43,300 The values there are the eigenvalues. 134 00:07:43,300 --> 00:07:47,110 And the places where you reach them are the eigenvectors. 135 00:07:47,110 --> 00:07:50,530 And so the max is the most important. 136 00:07:50,530 --> 00:07:54,160 So that's sigma 1. 137 00:07:54,160 --> 00:07:58,630 Here's sigma 3-- or lambda and sigma, 138 00:07:58,630 --> 00:08:01,780 because the matrix is symmetric positive definite. 139 00:08:01,780 --> 00:08:04,910 And here in the middle is sigma 2. 140 00:08:04,910 --> 00:08:10,750 And if we want to compute eigenvectors, 141 00:08:10,750 --> 00:08:13,060 which I'm not planning to do today, 142 00:08:13,060 --> 00:08:15,040 just to make this remark-- 143 00:08:15,040 --> 00:08:20,350 computing eigenvectors, getting the largest one or the smallest 144 00:08:20,350 --> 00:08:25,810 one is a lot quicker in general than getting these ones 145 00:08:25,810 --> 00:08:27,940 in the middle. 146 00:08:27,940 --> 00:08:36,460 You have to use good codes and pay attention 147 00:08:36,460 --> 00:08:39,490 to computing those saddle point values. 148 00:08:39,490 --> 00:08:44,620 So is there anything nice I can do with saddle points? 149 00:08:44,620 --> 00:08:46,840 How does one think about saddle points? 150 00:08:46,840 --> 00:08:49,390 So again, saddle point is defined 151 00:08:49,390 --> 00:08:56,820 by first derivatives equals 0. 152 00:08:56,820 --> 00:09:00,230 That's-- and the second derivatives-- 153 00:09:03,890 --> 00:09:06,480 OK, so that's a matrix. 154 00:09:06,480 --> 00:09:09,420 Here's a vector, the gradient vector. 155 00:09:09,420 --> 00:09:11,060 The derivative with respect to u, 156 00:09:11,060 --> 00:09:15,620 the derivative with respect to v, NDR DW, just a vector. 157 00:09:15,620 --> 00:09:19,160 And all those components are zero. 158 00:09:19,160 --> 00:09:21,380 The gradient vector is zero. 159 00:09:21,380 --> 00:09:24,170 But what about second derivatives? 160 00:09:24,170 --> 00:09:25,640 Well, that's getting more. 161 00:09:25,640 --> 00:09:28,460 There are nine of those now, because I've 162 00:09:28,460 --> 00:09:36,440 got Ruu, second derivative with respect to u. 163 00:09:36,440 --> 00:09:37,850 But I've got mixed derivative. 164 00:09:37,850 --> 00:09:42,780 Second derivative of R with respect to Du Dv. 165 00:09:42,780 --> 00:09:45,770 So I have a 3 by 3 matrix. 166 00:09:45,770 --> 00:09:49,410 Fortunately, that matrix is symmetric, 167 00:09:49,410 --> 00:09:53,100 because we're blessed by that wonderful fact 168 00:09:53,100 --> 00:09:56,550 that the derivative with respect to u 169 00:09:56,550 --> 00:09:59,770 and then v is the same as v and then u. 170 00:09:59,770 --> 00:10:01,290 So we get a symmetric matrix. 171 00:10:07,950 --> 00:10:18,360 Well, I won't write it down, but it's 172 00:10:18,360 --> 00:10:24,180 got the maximum, minimum, and saddle information built in. 173 00:10:24,180 --> 00:10:27,210 Here's this one additional thought 174 00:10:27,210 --> 00:10:32,860 that I want to communicate about saddle points, 175 00:10:32,860 --> 00:10:35,680 because it's really nice to somehow get back 176 00:10:35,680 --> 00:10:38,540 to maxima and minima. 177 00:10:38,540 --> 00:10:43,390 So this idea for a saddle point is 178 00:10:43,390 --> 00:10:46,975 to be able to write it as the maximum of a minimum. 179 00:10:51,570 --> 00:10:55,410 So let me do that, and then I'm all done. 180 00:10:55,410 --> 00:10:59,370 So I'm going to say that lambda 2, that value, 181 00:10:59,370 --> 00:11:05,340 is the maximum of over something of the minimum over something 182 00:11:05,340 --> 00:11:09,300 of x trans of our function. 183 00:11:13,470 --> 00:11:17,610 Now, of course, I have to tell you what you're maximizing over 184 00:11:17,610 --> 00:11:19,200 and what you're minimizing over. 185 00:11:19,200 --> 00:11:21,990 But that's the idea is that one way 186 00:11:21,990 --> 00:11:25,680 to get into the middle place there where 187 00:11:25,680 --> 00:11:31,140 the saddles are sitting is to have a maximum of a minimum. 188 00:11:31,140 --> 00:11:35,700 And that leads-- that's what I'm about to complete here-- 189 00:11:35,700 --> 00:11:38,490 would lead you, for example, very quickly 190 00:11:38,490 --> 00:11:45,660 to the interlacing theorem that I spoke about for eigenvalues 191 00:11:45,660 --> 00:11:51,690 and for singular values of when you perturb S 192 00:11:51,690 --> 00:11:55,320 or when you throw away a row and column of S. 193 00:11:55,320 --> 00:11:58,170 The eigenvalues go in between. 194 00:11:58,170 --> 00:12:03,240 That is the kind of conclusion that this max min 195 00:12:03,240 --> 00:12:05,460 stuff is set up to produce. 196 00:12:05,460 --> 00:12:08,610 So here, let me just tell you what it would be. 197 00:12:08,610 --> 00:12:10,650 I'm aiming to get lambda 2. 198 00:12:13,250 --> 00:12:20,353 So I'm going to take a maximum over two dimensional spaces-- 199 00:12:27,200 --> 00:12:28,790 subspaces of R3. 200 00:12:28,790 --> 00:12:31,690 We're in 3D. 201 00:12:31,690 --> 00:12:37,360 So you can see that's sort of like 2-dimensional spaces. 202 00:12:37,360 --> 00:12:44,980 Let me give that subspace a name like V. That'll do. 203 00:12:44,980 --> 00:12:48,800 Cap V, everybody can see that that's a cap V. 204 00:12:48,800 --> 00:12:53,680 And then this will be the minimum over V. 205 00:12:53,680 --> 00:12:58,760 So it's is kind of tricky. 206 00:12:58,760 --> 00:13:03,700 So I take any subspace that's two dimensional. 207 00:13:03,700 --> 00:13:06,660 And I'll take one in a moment. 208 00:13:06,660 --> 00:13:08,430 And I'll figure out the minimum. 209 00:13:08,430 --> 00:13:14,820 Well, suppose I take this subspace V, 210 00:13:14,820 --> 00:13:18,420 which is spanned by the first two-- 211 00:13:18,420 --> 00:13:20,760 it's supposed to be a 2D subspace-- 212 00:13:20,760 --> 00:13:23,280 spanned by the first. 213 00:13:23,280 --> 00:13:24,945 Suppose I try example. 214 00:13:28,980 --> 00:13:39,370 The span of 1, 0, 0, and 0, 1, 0. 215 00:13:39,370 --> 00:13:47,800 In other words, all vectors u, v, 0. 216 00:13:47,800 --> 00:13:50,860 That's a 2D space. 217 00:13:50,860 --> 00:13:57,460 What is the minimum of that Rayleigh quotient 218 00:13:57,460 --> 00:14:00,310 over that two-dimensional space? 219 00:14:00,310 --> 00:14:01,730 So now I'm taking a minimum. 220 00:14:01,730 --> 00:14:03,490 I don't have to think about saddle points. 221 00:14:06,640 --> 00:14:09,910 So I'm looking at the thing, but w is zero now. 222 00:14:09,910 --> 00:14:14,260 Everybody sees that I've squeezed it down to 2D. 223 00:14:14,260 --> 00:14:15,760 So w is zero. 224 00:14:15,760 --> 00:14:18,540 So what is the minimum now? 225 00:14:18,540 --> 00:14:22,090 So this thing would become-- 226 00:14:22,090 --> 00:14:26,640 for this space-- would become the 5u squared 227 00:14:26,640 --> 00:14:31,440 and the 3v squared over the u squared plus the v squared, 228 00:14:31,440 --> 00:14:33,750 but the W is 0. 229 00:14:33,750 --> 00:14:35,550 So what's the minimum of that? 230 00:14:35,550 --> 00:14:36,050 AUDIENCE: 3. 231 00:14:36,050 --> 00:14:36,920 GILBERT STRANG: 3. 232 00:14:36,920 --> 00:14:38,190 3. 233 00:14:38,190 --> 00:14:42,770 OK, the minimum is 3 for this particular space. 234 00:14:42,770 --> 00:14:47,040 Let me call it V special. 235 00:14:47,040 --> 00:14:52,080 For that particular space, the minimum is 3, correct? 236 00:14:52,080 --> 00:14:53,570 Everybody sees that. 237 00:14:53,570 --> 00:14:57,170 Because I just have u and v to play with, the 5 and the 3. 238 00:14:57,170 --> 00:15:00,690 So if I put everything into v, I get to 3. 239 00:15:00,690 --> 00:15:02,605 And now, I take the maximum. 240 00:15:06,870 --> 00:15:12,300 So the maximum is at least 3, because this particular choice 241 00:15:12,300 --> 00:15:15,750 of V gave me the answer 3. 242 00:15:15,750 --> 00:15:20,920 And now, I'm taking the maximum over all possible 2D spaces. 243 00:15:20,920 --> 00:15:26,460 And I got 3 for one of the possible spaces of V. 244 00:15:26,460 --> 00:15:29,880 And I might get higher than 3 for some other one. 245 00:15:29,880 --> 00:15:32,160 But actually I don't. 246 00:15:32,160 --> 00:15:35,660 The truth is that this turns out to be 247 00:15:35,660 --> 00:15:39,520 3 which is, of course, exactly what we wanted. 248 00:15:39,520 --> 00:15:43,350 So I'm saying that this particular two-dimensional 249 00:15:43,350 --> 00:15:49,260 space, the minimum of over that, minimum, 250 00:15:49,260 --> 00:15:52,200 the minimum there is 3. 251 00:15:52,200 --> 00:15:55,080 And now, I maximize over all others. 252 00:15:55,080 --> 00:16:00,820 And so the idea is that for any other one, 253 00:16:00,820 --> 00:16:04,320 the minimum value will be below 3. 254 00:16:04,320 --> 00:16:09,695 And, therefore, when I go for the max of the mins, I get 3. 255 00:16:09,695 --> 00:16:12,330 So I just repeat that and then be quiet 256 00:16:12,330 --> 00:16:14,910 about this whole subject. 257 00:16:14,910 --> 00:16:20,340 So it's a maximum over subspaces of a minimum of the Rayleigh 258 00:16:20,340 --> 00:16:21,540 quotient. 259 00:16:21,540 --> 00:16:26,130 If that subspace is exactly the perfect choice, this one, 260 00:16:26,130 --> 00:16:28,410 I get the value 3. 261 00:16:28,410 --> 00:16:31,640 And I'm claiming that's the biggest value I can get, 262 00:16:31,640 --> 00:16:33,840 because if I pick any other subspace-- 263 00:16:33,840 --> 00:16:37,200 what if I picked a subspace that-- 264 00:16:37,200 --> 00:16:46,200 suppose another v would be all vectors 0, v, w. 265 00:16:49,940 --> 00:16:56,130 What would I get for the minimum of this thing? 266 00:16:56,130 --> 00:17:00,740 But now w is in the picture and u is not in the picture. 267 00:17:00,740 --> 00:17:02,660 What I get for the minimum there? 268 00:17:02,660 --> 00:17:03,160 AUDIENCE: 1. 269 00:17:03,160 --> 00:17:03,910 GILBERT STRANG: 1. 270 00:17:03,910 --> 00:17:05,089 I'd get 1. 271 00:17:05,089 --> 00:17:08,010 The minimum would be when I put everything into w 272 00:17:08,010 --> 00:17:09,220 and I got one. 273 00:17:09,220 --> 00:17:12,130 And then when I take the max, it's not a winner. 274 00:17:12,130 --> 00:17:13,750 It's thrown out. 275 00:17:13,750 --> 00:17:19,040 The winner will be the that space and the 3. 276 00:17:19,040 --> 00:17:22,390 So I guess I'm hoping that you sort of see 277 00:17:22,390 --> 00:17:28,540 in this small example that you can express this middle saddle 278 00:17:28,540 --> 00:17:31,990 value as-- 279 00:17:31,990 --> 00:17:35,200 it's reasonable to think of it as a maximum in some directions 280 00:17:35,200 --> 00:17:36,880 and a minimum in another. 281 00:17:36,880 --> 00:17:42,380 Think of the-- well, try to think of some surface, which 282 00:17:42,380 --> 00:17:45,370 is going up in some direction. 283 00:17:45,370 --> 00:17:47,960 So it's a minimum in those directions. 284 00:17:47,960 --> 00:17:50,360 And it's going down in other directions. 285 00:17:50,360 --> 00:17:52,940 So it's a max in those directions. 286 00:17:52,940 --> 00:17:59,960 And saddle points is perched in there right at that place, 287 00:17:59,960 --> 00:18:02,150 at the saddle point. 288 00:18:02,150 --> 00:18:04,220 You know, if you're like hiking from here 289 00:18:04,220 --> 00:18:07,210 to California or something, you're 290 00:18:07,210 --> 00:18:09,290 going to pass a saddle point. 291 00:18:09,290 --> 00:18:11,290 Actually, you see it on the Mass-- 292 00:18:11,290 --> 00:18:16,970 the Mass Pike has an amazing little sign. 293 00:18:16,970 --> 00:18:19,160 I don't know if you've noticed it. 294 00:18:19,160 --> 00:18:23,600 If you drive west on the Mass Pike, pretty far west 295 00:18:23,600 --> 00:18:27,710 of Boston, there's a little sign telling you the altitude 296 00:18:27,710 --> 00:18:29,490 or elevation, whatever. 297 00:18:29,490 --> 00:18:33,830 And it says there this is the highest point until you 298 00:18:33,830 --> 00:18:36,380 reach the Rockies basically. 299 00:18:36,380 --> 00:18:40,930 I'd say like, OK, Midwest is pretty flat, right? 300 00:18:44,980 --> 00:18:46,280 Because that's a long way away. 301 00:18:46,280 --> 00:18:49,220 You don't think of Massachusetts as like really 302 00:18:49,220 --> 00:18:52,130 in the big league with high spots. 303 00:18:52,130 --> 00:18:53,570 But there it is. 304 00:18:53,570 --> 00:18:55,650 It's the highest one until you get-- 305 00:18:55,650 --> 00:18:58,280 and I think it tells you where the next one will 306 00:18:58,280 --> 00:18:59,570 be in Colorado. 307 00:18:59,570 --> 00:19:06,130 Anyway, those highest points tend to be saddles. 308 00:19:06,130 --> 00:19:08,570 The very, very highest point-- 309 00:19:08,570 --> 00:19:11,770 where's that in Alaska or somewhere-- 310 00:19:11,770 --> 00:19:15,020 that's a max, of course, by definition. 311 00:19:15,020 --> 00:19:21,370 But there are a lot of saddle points in other places. 312 00:19:21,370 --> 00:19:25,700 And those would be maxima of minima or minima of maxima. 313 00:19:25,700 --> 00:19:26,200 Good. 314 00:19:26,200 --> 00:19:29,250 I'm stopping there. 315 00:19:29,250 --> 00:19:34,860 We might see this again when we start gradient descent. 316 00:19:34,860 --> 00:19:38,250 But at least, because saddle points 317 00:19:38,250 --> 00:19:41,190 don't come up much in teaching calculus, 318 00:19:41,190 --> 00:19:42,940 I thought that was good. 319 00:19:42,940 --> 00:19:52,170 OK, second point is models, Lab 3, 320 00:19:52,170 --> 00:19:56,910 and projects, anything you'd like to ask about projects. 321 00:19:56,910 --> 00:20:00,360 So, please, this is your chance to ask. 322 00:20:00,360 --> 00:20:04,020 You could also ask by email. 323 00:20:04,020 --> 00:20:08,110 If you have suggested or idea for our project, 324 00:20:08,110 --> 00:20:15,880 let me encourage you or a team to work on it or just yourself. 325 00:20:15,880 --> 00:20:18,120 And if you want to think, OK, shall I 326 00:20:18,120 --> 00:20:21,900 get some feedback of does this sound sensible? 327 00:20:21,900 --> 00:20:23,910 Any suggestions? 328 00:20:23,910 --> 00:20:26,000 Send me an email. 329 00:20:26,000 --> 00:20:27,890 I'd be happy to-- 330 00:20:27,890 --> 00:20:32,560 of course, I'm a total beginner here, too. 331 00:20:32,560 --> 00:20:40,300 When I created this Lab 3, I was like desperate, 332 00:20:40,300 --> 00:20:42,730 not for model 1. 333 00:20:42,730 --> 00:20:47,690 For model 1, have you looked at-- 334 00:20:47,690 --> 00:20:49,876 it's reached Stellar. 335 00:20:49,876 --> 00:20:52,420 And it's only one printed page. 336 00:20:52,420 --> 00:20:55,430 Have people had a look at this? 337 00:20:55,430 --> 00:20:57,910 So I'll just repeat quickly. 338 00:20:57,910 --> 00:21:05,590 Model 1 is an example where of overfitting. 339 00:21:05,590 --> 00:21:08,440 And what's going on with model 1? 340 00:21:08,440 --> 00:21:10,030 So model 1 says take-- 341 00:21:14,490 --> 00:21:18,420 5 would be enough, but I probably said 10 or something. 342 00:21:18,420 --> 00:21:21,450 So I'll make it six points. 343 00:21:21,450 --> 00:21:29,330 And put a curve through them. 344 00:21:29,330 --> 00:21:31,780 So if you put a curve-- 345 00:21:31,780 --> 00:21:34,940 and the curve is going to be a polynomial. 346 00:21:34,940 --> 00:21:41,750 So we're going to set a fit by polynomial. 347 00:21:41,750 --> 00:21:49,055 Everybody knows polynomial is C0 plus C1x plus whatever CK 348 00:21:49,055 --> 00:21:53,780 x to the K, let's say. 349 00:21:56,480 --> 00:22:00,470 For K equals 0-- 350 00:22:00,470 --> 00:22:02,780 well, I don't know if I even ask 0. 351 00:22:02,780 --> 00:22:04,640 That would be the best straight line. 352 00:22:04,640 --> 00:22:07,190 That would run along the average. 353 00:22:07,190 --> 00:22:11,300 K equal to 1, that would be a straight line fit. 354 00:22:11,300 --> 00:22:14,900 And you would compute that by least squares, 355 00:22:14,900 --> 00:22:17,980 because of course no straight line is going 356 00:22:17,980 --> 00:22:19,820 to go through all the points. 357 00:22:19,820 --> 00:22:23,120 You're going to have some error by least squares. 358 00:22:23,120 --> 00:22:27,990 2 would be fitting by a parabola. 359 00:22:27,990 --> 00:22:31,710 Again, you'll have some error, but smaller since parabolas 360 00:22:31,710 --> 00:22:33,520 include straight lines. 361 00:22:33,520 --> 00:22:38,820 So you can only reduce the total sum of squares error 362 00:22:38,820 --> 00:22:40,800 by going to a degree two. 363 00:22:40,800 --> 00:22:43,760 Degree 3 and up to-- 364 00:22:43,760 --> 00:22:45,480 how many points shall we take? 365 00:22:45,480 --> 00:22:49,140 1, 2-- let me just use the same letter I've used here. 366 00:22:54,240 --> 00:22:57,270 Well, m is the number of points, but m 367 00:22:57,270 --> 00:23:07,230 varies between 1 point I guess and probably n point. 368 00:23:07,230 --> 00:23:09,450 Maybe K here. 369 00:23:09,450 --> 00:23:16,680 Up to n-- up to 6, let's say. 370 00:23:19,400 --> 00:23:24,680 And I want to make a comment about 6. 371 00:23:24,680 --> 00:23:27,140 No, 5 would do it. 372 00:23:27,140 --> 00:23:33,260 Degree 5 will fit the 6 points that we've got 6 points here. 373 00:23:33,260 --> 00:23:36,860 But if I stop at degree 5, I was better there, 374 00:23:36,860 --> 00:23:40,640 because degree 5 polynomial also has a constant term. 375 00:23:40,640 --> 00:23:43,070 So it really has six coefficients. 376 00:23:43,070 --> 00:23:47,390 So there is a 1 degree 5 polynomial 377 00:23:47,390 --> 00:23:50,360 with six numbers, six coefficients, that 378 00:23:50,360 --> 00:23:53,040 goes through those six points. 379 00:23:53,040 --> 00:23:55,760 And so it's a perfect fit. 380 00:23:55,760 --> 00:23:58,650 That would be an exact fit of the data. 381 00:23:58,650 --> 00:24:00,500 So here's the data. 382 00:24:00,500 --> 00:24:05,150 Create a polynomial of degree 5 that goes through those points 383 00:24:05,150 --> 00:24:06,140 exactly. 384 00:24:06,140 --> 00:24:09,350 And look at the result. And what would you 385 00:24:09,350 --> 00:24:12,210 see if you look at the result? 386 00:24:12,210 --> 00:24:13,710 Would it be smooth? 387 00:24:13,710 --> 00:24:16,380 Of course, it's a polynomial. 388 00:24:16,380 --> 00:24:18,940 Would it be nice? 389 00:24:18,940 --> 00:24:21,290 No, it will be horrible. 390 00:24:21,290 --> 00:24:24,460 To get through those points-- 391 00:24:24,460 --> 00:24:26,090 did I get six points? 392 00:24:26,090 --> 00:24:26,830 Yeah. 393 00:24:26,830 --> 00:24:30,250 To get through those points, I'm guessing 394 00:24:30,250 --> 00:24:33,760 that that fifth degree polynomial, the perfect fit, 395 00:24:33,760 --> 00:24:36,400 is an example that occurs to practically 396 00:24:36,400 --> 00:24:40,900 everybody of overfitting, because 397 00:24:40,900 --> 00:24:45,100 making that decision, perfect fit, learn the data, 398 00:24:45,100 --> 00:24:49,960 training data exactly will send a polynomial-- 399 00:24:49,960 --> 00:24:52,510 I don't know what it looks like. 400 00:24:52,510 --> 00:24:54,430 I don't want to-- well, I do want to know, 401 00:24:54,430 --> 00:24:55,720 but not right now. 402 00:24:55,720 --> 00:24:59,870 Anyway, craziness. 403 00:24:59,870 --> 00:25:03,500 And, of course, I'm going to ask-- 404 00:25:03,500 --> 00:25:06,290 it doesn't look like that probably-- 405 00:25:06,290 --> 00:25:09,380 I'm going to ask you to plot the results. 406 00:25:09,380 --> 00:25:12,080 Well, what's the least squares error when 407 00:25:12,080 --> 00:25:14,060 you fit by a straight line? 408 00:25:14,060 --> 00:25:17,990 When you fit by horizontal line, a constant, 409 00:25:17,990 --> 00:25:20,590 fit by a straight line, move up to parabolas, 410 00:25:20,590 --> 00:25:22,010 move up to a cubics? 411 00:25:22,010 --> 00:25:27,200 But when you hit this, you're not making any error at all. 412 00:25:27,200 --> 00:25:30,470 You're not really needing to use least squares. 413 00:25:30,470 --> 00:25:36,600 You can solve Ax equals B. Ac equal b. 414 00:25:36,600 --> 00:25:40,040 So this is the b thing. 415 00:25:40,040 --> 00:25:46,070 And c is the vector of coefficients. 416 00:25:46,070 --> 00:25:53,850 And the matrix A is bad news when it's a 6 by 6, 417 00:25:53,850 --> 00:25:55,970 when you get up to a complete fit. 418 00:25:55,970 --> 00:26:00,080 And I guess what I wanted just to see is-- 419 00:26:00,080 --> 00:26:05,120 a lot of things I don't know, like suppose I change six to 20 420 00:26:05,120 --> 00:26:06,740 or something. 421 00:26:06,740 --> 00:26:11,660 Then I'm pretty sure that out there at 18, 19, 20, 422 00:26:11,660 --> 00:26:14,000 this thing is really off the map. 423 00:26:14,000 --> 00:26:20,990 And you could compute its max and you'd 424 00:26:20,990 --> 00:26:24,020 see a very big number. 425 00:26:24,020 --> 00:26:25,670 But, of course, for a straight line, 426 00:26:25,670 --> 00:26:28,690 that would be pretty safe. 427 00:26:28,690 --> 00:26:32,820 The slope would be pretty moderate. 428 00:26:32,820 --> 00:26:36,100 And I don't know where you-- 429 00:26:36,100 --> 00:26:38,970 so that's probably underfitting to try 430 00:26:38,970 --> 00:26:41,310 to fit this by a straight line. 431 00:26:41,310 --> 00:26:44,250 It's not as close as you would want. 432 00:26:44,250 --> 00:26:49,080 But fitting by a full perfect filling, a high degree 433 00:26:49,080 --> 00:26:51,530 polynomial, is certainly overfitting. 434 00:26:51,530 --> 00:26:52,990 Where is the boundary? 435 00:26:52,990 --> 00:26:55,920 I'm sure people know about this, but I 436 00:26:55,920 --> 00:27:00,720 think it is something we could learn from. 437 00:27:00,720 --> 00:27:04,560 So that's what that model 1 is about. 438 00:27:04,560 --> 00:27:06,960 And just to make one final comment, 439 00:27:06,960 --> 00:27:10,440 that matrix A has a name in the case 440 00:27:10,440 --> 00:27:16,980 where it's a square matrix, where you're fitting exactly-- 441 00:27:16,980 --> 00:27:19,200 interpolating would be the word. 442 00:27:19,200 --> 00:27:28,010 So that exact fit, that corresponds 443 00:27:28,010 --> 00:27:36,600 to square matrix A. And the word for it is interpolation. 444 00:27:41,960 --> 00:27:47,050 And I guess it's Lagrange again. 445 00:27:47,050 --> 00:27:50,280 Seeing that guy too often here. 446 00:27:50,280 --> 00:27:52,220 So it would be Lagrange interpretation. 447 00:27:52,220 --> 00:27:55,940 But the matrix has a different name. 448 00:27:55,940 --> 00:27:58,775 And whose name is associated with that matrix? 449 00:27:58,775 --> 00:27:59,692 AUDIENCE: Vandermonde. 450 00:27:59,692 --> 00:28:01,300 GILBERT STRANG: Vandermonde. 451 00:28:01,300 --> 00:28:02,670 Vandermonde. 452 00:28:02,670 --> 00:28:06,720 So this is the square matrix, which was-- 453 00:28:06,720 --> 00:28:08,090 so let me write it. 454 00:28:08,090 --> 00:28:09,900 It's called a Vandermonde matrix. 455 00:28:18,600 --> 00:28:23,130 And it's a matrix that has a crazy large inverse, 456 00:28:23,130 --> 00:28:27,780 because just as I'm saying, the C that comes out 457 00:28:27,780 --> 00:28:30,660 from the perfect fit, from the interpolation, 458 00:28:30,660 --> 00:28:35,280 from the square matrix, the C is going to be giant. 459 00:28:35,280 --> 00:28:44,910 And so you will construct a matrix, of course, to do this, 460 00:28:44,910 --> 00:28:48,900 and it will be identical to the-- 461 00:28:48,900 --> 00:28:50,780 so we've heard this word Vandermonde 462 00:28:50,780 --> 00:28:54,380 matrix in this class within the last week. 463 00:28:54,380 --> 00:29:00,590 Anybody remember where the word Vandermonde came up in class? 464 00:29:00,590 --> 00:29:03,840 It was in professor Townsend's lecture. 465 00:29:03,840 --> 00:29:05,570 So you could go back to that video 466 00:29:05,570 --> 00:29:10,940 if you wanted as an example of a matrix which 467 00:29:10,940 --> 00:29:15,350 had a horrible inverse, a giant matrix. 468 00:29:15,350 --> 00:29:17,720 The Hilbert matrix was another example. 469 00:29:17,720 --> 00:29:21,480 I think he did two examples, Vandermonde and Hilbert. 470 00:29:21,480 --> 00:29:23,570 So this Vandermonde matrix-- 471 00:29:23,570 --> 00:29:28,170 I could write it down, but I'll leave that to you-- 472 00:29:28,170 --> 00:29:30,410 has a big inverse. 473 00:29:30,410 --> 00:29:34,910 And its eigenvalues-- well, no singular values, 474 00:29:34,910 --> 00:29:37,330 because it's not symmetric-- 475 00:29:37,330 --> 00:29:41,450 it's singular values are way scattered. 476 00:29:41,450 --> 00:29:44,060 It has tiny little singular values 477 00:29:44,060 --> 00:29:47,460 plus an ordinary sized singular values. 478 00:29:47,460 --> 00:29:52,220 So that's the example that I just think you could go with. 479 00:29:52,220 --> 00:29:58,180 And as far as I can see, sending it to auto grader as a Julia 480 00:29:58,180 --> 00:30:02,870 file, it would be even worse than usual, 481 00:30:02,870 --> 00:30:05,750 sending it to auto grader. 482 00:30:05,750 --> 00:30:08,870 I think it wouldn't know what to do as far as I can see. 483 00:30:08,870 --> 00:30:13,520 So I'm thinking of submissions coming to Gradescope. 484 00:30:13,520 --> 00:30:18,200 And I'm thinking of some plots to show 485 00:30:18,200 --> 00:30:26,330 what happens as K increases and some tables of data maybe 486 00:30:26,330 --> 00:30:33,800 and then maybe a paragraph of conclusion, like what degree 487 00:30:33,800 --> 00:30:36,530 is safe and when does it become risky 488 00:30:36,530 --> 00:30:39,160 and when does it become disaster. 489 00:30:39,160 --> 00:30:42,230 So stuff like that. 490 00:30:42,230 --> 00:30:47,350 Really, these are sort of open-ended labs 491 00:30:47,350 --> 00:30:49,970 and you use any language. 492 00:30:49,970 --> 00:30:54,560 Questions about that example? 493 00:30:54,560 --> 00:31:00,070 Which is really that's what I'm expecting to be ready 494 00:31:00,070 --> 00:31:09,260 and quite a good example for Wednesday after the break. 495 00:31:12,500 --> 00:31:13,970 Question? 496 00:31:13,970 --> 00:31:17,030 Anyway, you can email me. 497 00:31:17,030 --> 00:31:21,830 You probably see that see what the model looks like. 498 00:31:21,830 --> 00:31:30,470 Then the second one I've taken that first jump into, networks. 499 00:31:30,470 --> 00:31:35,210 Made a very simple network without any hidden layers 500 00:31:35,210 --> 00:31:36,770 at all actually. 501 00:31:36,770 --> 00:31:41,240 And just wrote down what I think might work. 502 00:31:41,240 --> 00:31:46,240 But you may find that you want to modify model 2. 503 00:31:46,240 --> 00:31:48,310 Go for it. 504 00:31:48,310 --> 00:31:56,500 I don't have any patent or personal stake in the way model 505 00:31:56,500 --> 00:31:57,410 2 is written. 506 00:31:57,410 --> 00:32:01,850 But the idea is fit data-- 507 00:32:01,850 --> 00:32:08,030 well, start with data, but don't make it too perfect, 508 00:32:08,030 --> 00:32:14,830 because we want some learning to happen here. 509 00:32:14,830 --> 00:32:17,080 So it's the classification problem. 510 00:32:17,080 --> 00:32:24,400 So it won't be least squares with variables like u and v 511 00:32:24,400 --> 00:32:25,150 and w. 512 00:32:25,150 --> 00:32:29,560 It's just plus 1 or minus 1, or 1, 0, 513 00:32:29,560 --> 00:32:32,860 or cat and dog, whatever that classification is. 514 00:32:32,860 --> 00:32:38,800 So that's the basic problem to start with in deep learning. 515 00:32:38,800 --> 00:32:42,220 For quite a long time, that's the natural problem. 516 00:32:42,220 --> 00:32:45,860 So it's a classification problem. 517 00:32:45,860 --> 00:32:50,320 And the description here suggests one way 518 00:32:50,320 --> 00:33:01,900 to set up that training data and execute a neural net 519 00:33:01,900 --> 00:33:08,140 like experiment, but without getting very far away 520 00:33:08,140 --> 00:33:11,140 from ordinary linear algebra. 521 00:33:11,140 --> 00:33:18,130 So as I say, if you want to change this, develop it 522 00:33:18,130 --> 00:33:23,380 further, get some ideas about it, 523 00:33:23,380 --> 00:33:26,520 that's what the whole point is here. 524 00:33:29,500 --> 00:33:35,130 Actually, the faculty meeting this week, maybe today-- 525 00:33:35,130 --> 00:33:36,007 what's today? 526 00:33:36,007 --> 00:33:36,840 AUDIENCE: Wednesday. 527 00:33:36,840 --> 00:33:38,580 GILBERT STRANG: Wednesday? 528 00:33:38,580 --> 00:33:40,780 Yeah, so it's this afternoon. 529 00:33:40,780 --> 00:33:43,150 And the faculty doesn't come to much. 530 00:33:43,150 --> 00:33:46,090 Of course, it's late in the afternoon. 531 00:33:46,090 --> 00:33:49,360 But faculty meeting this afternoon 532 00:33:49,360 --> 00:33:55,770 is about MIT's plans for requirements or courses 533 00:33:55,770 --> 00:33:58,630 in computational thinking. 534 00:33:58,630 --> 00:34:02,470 And in a way, this course within the math department 535 00:34:02,470 --> 00:34:07,270 is among the ones that are in that direction. 536 00:34:07,270 --> 00:34:11,940 Of course, in other departments, those are further along. 537 00:34:11,940 --> 00:34:20,889 Anyway, when Raj Rao taught the course last spring, 538 00:34:20,889 --> 00:34:26,739 he had the Julia system better developed. 539 00:34:26,739 --> 00:34:31,630 And it was a chance to bring computers and bring laptops 540 00:34:31,630 --> 00:34:35,889 and do things in class. 541 00:34:35,889 --> 00:34:37,780 And you'll have that chance again when 542 00:34:37,780 --> 00:34:41,480 he visits in a month. 543 00:34:41,480 --> 00:34:44,110 OK, enough. 544 00:34:44,110 --> 00:34:48,820 And I'm open to questions about the project. 545 00:34:48,820 --> 00:34:54,639 Should I maybe ask you to email me a rough idea of a project? 546 00:34:54,639 --> 00:34:57,460 And tell me if you are in a group 547 00:34:57,460 --> 00:35:02,380 or if you would like to find a group, maybe two or three 548 00:35:02,380 --> 00:35:02,920 people. 549 00:35:02,920 --> 00:35:05,362 I'm not thinking of groups of 50. 550 00:35:05,362 --> 00:35:10,820 Two or three would be sensible. 551 00:35:10,820 --> 00:35:13,430 Questions about projects? 552 00:35:13,430 --> 00:35:17,890 I mean, I just introduced this idea of our project, 553 00:35:17,890 --> 00:35:24,400 and I apologize for not bringing it up the first week. 554 00:35:24,400 --> 00:35:26,140 But I just couldn't see-- 555 00:35:26,140 --> 00:35:28,840 I don't want to do exams on linear algebra. 556 00:35:28,840 --> 00:35:30,520 We passed that point. 557 00:35:30,520 --> 00:35:34,210 So this seemed the right way to go. 558 00:35:34,210 --> 00:35:40,830 But I'm not looking for a PhD thesis here. 559 00:35:40,830 --> 00:35:41,520 Questions? 560 00:35:41,520 --> 00:35:44,170 Thoughts? 561 00:35:44,170 --> 00:35:46,690 I guess I hope you know you can ask. 562 00:35:46,690 --> 00:35:47,854 Yeah, oh good. 563 00:35:47,854 --> 00:35:50,274 AUDIENCE: So could you maybe describe 564 00:35:50,274 --> 00:35:54,503 the scope of the project? 565 00:35:54,503 --> 00:35:55,420 GILBERT STRANG: Right. 566 00:35:55,420 --> 00:36:04,740 How will I-- yeah, so the scope is connected to the time 567 00:36:04,740 --> 00:36:06,950 that you would devote to it. 568 00:36:06,950 --> 00:36:10,730 And what should I say about scope? 569 00:36:10,730 --> 00:36:14,210 Maybe the equivalent of three homeworks or something. 570 00:36:14,210 --> 00:36:21,602 Because I'll tamp down homeworks as project date gets closer. 571 00:36:21,602 --> 00:36:22,560 Does that give an idea? 572 00:36:25,560 --> 00:36:33,030 So it's not infinite, but it's not something tiny and trivial. 573 00:36:33,030 --> 00:36:34,140 Yeah, good. 574 00:36:34,140 --> 00:36:37,320 AUDIENCE: Do you have an example projects of what 575 00:36:37,320 --> 00:36:38,640 were done in past years? 576 00:36:38,640 --> 00:36:40,265 GILBERT STRANG: Well, that's the thing. 577 00:36:40,265 --> 00:36:42,600 There aren't really past years. 578 00:36:42,600 --> 00:36:46,840 We are the ones. 579 00:36:46,840 --> 00:36:56,585 So I will have next year if contribute some good ideas. 580 00:36:59,230 --> 00:37:03,690 Maybe I should I ask Professor Rao 581 00:37:03,690 --> 00:37:08,350 to maybe send us the projects he uses in Michigan? 582 00:37:08,350 --> 00:37:10,330 That would be some ideas. 583 00:37:10,330 --> 00:37:15,160 But remember that he hasn't, up to now anyway, 584 00:37:15,160 --> 00:37:18,220 moved the course toward deep learning. 585 00:37:18,220 --> 00:37:22,910 He did other topics, all of which would be fine. 586 00:37:22,910 --> 00:37:26,980 But quite a few people have had some 6.036 587 00:37:26,980 --> 00:37:31,290 or know something about conventional neural nets. 588 00:37:31,290 --> 00:37:36,240 And I'm certainly excited to get to that topic. 589 00:37:36,240 --> 00:37:38,880 So the project could get there or it could not. 590 00:37:38,880 --> 00:37:40,900 Both totally fine. 591 00:37:40,900 --> 00:37:42,500 OK, that's a good idea. 592 00:37:42,500 --> 00:37:48,400 I'll ask Raj for just the projects-- 593 00:37:48,400 --> 00:37:51,490 and you'll recognize a couple, because you've done a couple. 594 00:37:51,490 --> 00:37:55,480 But there are a bunch more. 595 00:37:55,480 --> 00:37:58,780 Then there was another question or thought? 596 00:37:58,780 --> 00:38:04,000 And I'm remembering that I think maybe everybody got 597 00:38:04,000 --> 00:38:07,360 an email or a Stellar announcement 598 00:38:07,360 --> 00:38:12,280 that some members of the class took an initiative, which 599 00:38:12,280 --> 00:38:17,230 was wonderful, to open the possibility of people 600 00:38:17,230 --> 00:38:21,400 just showing up one evening a week in the Media Lab was it? 601 00:38:21,400 --> 00:38:24,240 Or was there a location? 602 00:38:24,240 --> 00:38:29,327 And has it happened or is it a future event? 603 00:38:29,327 --> 00:38:30,790 AUDIENCE: It happened. 604 00:38:30,790 --> 00:38:33,400 GILBERT STRANG: It happened. 605 00:38:33,400 --> 00:38:36,800 But I hadn't mentioned it in class, 606 00:38:36,800 --> 00:38:38,690 so probably you didn't have-- 607 00:38:38,690 --> 00:38:41,500 and we're not really into projects yet, 608 00:38:41,500 --> 00:38:43,890 so it was probably a quiet evening? 609 00:38:43,890 --> 00:38:44,473 AUDIENCE: Yep. 610 00:38:44,473 --> 00:38:45,348 GILBERT STRANG: Yeah. 611 00:38:45,348 --> 00:38:46,240 Yeah, and that's-- 612 00:38:46,240 --> 00:38:47,080 AUDIENCE: Productive but quiet. 613 00:38:47,080 --> 00:38:48,370 GILBERT STRANG: Productive but quiet. 614 00:38:48,370 --> 00:38:49,090 OK. 615 00:38:49,090 --> 00:38:51,350 So will it happen again? 616 00:38:51,350 --> 00:38:53,186 AUDIENCE: Sure, I think maybe now we'll 617 00:38:53,186 --> 00:38:54,682 be looking after spring break. 618 00:38:54,682 --> 00:38:56,140 GILBERT STRANG: After spring break. 619 00:38:56,140 --> 00:39:02,590 OK, so post again on Stellar the plan for the next meeting 620 00:39:02,590 --> 00:39:06,460 that people could come to. 621 00:39:06,460 --> 00:39:12,060 So this is David Anderton, so you'll recognize his name. 622 00:39:12,060 --> 00:39:14,980 And did you have the meeting in the Media Lab? 623 00:39:14,980 --> 00:39:18,430 AUDIENCE: Yeah, we had it on the Thursday and Friday. 624 00:39:18,430 --> 00:39:19,330 GILBERT STRANG: OK. 625 00:39:19,330 --> 00:39:22,930 So with the break coming and spring 626 00:39:22,930 --> 00:39:31,000 hopefully coming after today's potential storm, 627 00:39:31,000 --> 00:39:33,600 when we come back, good. 628 00:39:33,600 --> 00:39:34,510 OK. 629 00:39:34,510 --> 00:39:36,640 Is that good? 630 00:39:36,640 --> 00:39:38,080 I hope some of that is helpful. 631 00:39:38,080 --> 00:39:39,850 You'll get an idea. 632 00:39:39,850 --> 00:39:42,370 You're seeing about as much as I know, 633 00:39:42,370 --> 00:39:48,660 which is model 1 is definitely doable and very significant. 634 00:39:48,660 --> 00:39:53,460 And Vandermonde matrices and so on are truly important. 635 00:39:53,460 --> 00:39:55,890 And their instability is a big issue. 636 00:39:58,530 --> 00:40:06,370 But then moving toward weights and training data and test data 637 00:40:06,370 --> 00:40:08,270 is where we want to go. 638 00:40:08,270 --> 00:40:08,850 Good. 639 00:40:08,850 --> 00:40:10,160 OK. 640 00:40:10,160 --> 00:40:11,520 So do I have some time? 641 00:40:11,520 --> 00:40:18,660 I do just to speak about mean and variance, the two 642 00:40:18,660 --> 00:40:21,480 golden words of statistics-- 643 00:40:21,480 --> 00:40:25,050 and covariance, the matrix, the intersection 644 00:40:25,050 --> 00:40:29,370 of linear algebra with statistics, and then 645 00:40:29,370 --> 00:40:31,530 some famous inequality. 646 00:40:31,530 --> 00:40:35,100 So I'll continue with this on Friday 647 00:40:35,100 --> 00:40:37,630 and post some other material. 648 00:40:37,630 --> 00:40:40,620 So that it's coming from a later section of notes. 649 00:40:40,620 --> 00:40:51,740 OK, can I just so I either have probabilities P1 up to Pn 650 00:40:51,740 --> 00:40:56,780 adding to 1, or I have our continuous distribution 651 00:40:56,780 --> 00:41:01,850 of probabilities, maybe from all x's from minus infinity 652 00:41:01,850 --> 00:41:04,970 to infinity, again, giving 1. 653 00:41:04,970 --> 00:41:07,010 Let me work with the discrete example. 654 00:41:07,010 --> 00:41:10,010 That's where people naturally start. 655 00:41:10,010 --> 00:41:14,520 So what is the mean? 656 00:41:14,520 --> 00:41:25,410 So I've n possible outcomes with those probabilities. 657 00:41:25,410 --> 00:41:35,840 And I can ask you about the sample mean 658 00:41:35,840 --> 00:41:40,958 or I can ask you about the expected mean. 659 00:41:45,230 --> 00:41:47,780 So the sample means, we've done an experiment. 660 00:41:47,780 --> 00:41:50,000 We've got some output. 661 00:41:50,000 --> 00:41:53,030 The expected mean means we know probabilities, 662 00:41:53,030 --> 00:41:54,350 but we haven't used them yet. 663 00:41:57,410 --> 00:42:01,160 So this uses actual output. 664 00:42:01,160 --> 00:42:04,490 And the sample mean is simply-- 665 00:42:04,490 --> 00:42:13,120 shall I just say, m for mean-- well, 666 00:42:13,120 --> 00:42:15,970 these two are importantly different. 667 00:42:15,970 --> 00:42:20,140 One is something where you've done the experiment. 668 00:42:20,140 --> 00:42:23,290 And this is before you do the experiment. 669 00:42:23,290 --> 00:42:27,190 And the letters get-- maybe mu, I'll change it to mu. 670 00:42:29,760 --> 00:42:34,470 I don't want to use S because S gets used with variance. 671 00:42:34,470 --> 00:42:39,495 So it's just the average, average output from the sample. 672 00:42:44,750 --> 00:42:48,080 Like I've flipped a coin a million times. 673 00:42:48,080 --> 00:42:51,590 And the output was 0 or 1. 674 00:42:51,590 --> 00:42:55,210 So I got a million 1s and 0s. 675 00:42:55,210 --> 00:42:58,960 And I take the average, and I'm expecting a number like half 676 00:42:58,960 --> 00:43:00,132 a million. 677 00:43:00,132 --> 00:43:03,620 Of course, I'm thinking of a fair coin. 678 00:43:03,620 --> 00:43:05,930 And the law of large numbers would 679 00:43:05,930 --> 00:43:15,410 say that this sample mean does approach 1/2 with probability 1 680 00:43:15,410 --> 00:43:18,050 as the number of samples gets larger. 681 00:43:18,050 --> 00:43:21,240 So sample mean is straightforward. 682 00:43:21,240 --> 00:43:23,060 The expected mean means-- 683 00:43:23,060 --> 00:43:27,185 these are actual sample outputs. 684 00:43:32,530 --> 00:43:38,060 They happened, whereas the expected mean is just the-- 685 00:43:38,060 --> 00:43:40,100 and I'll use m for that-- 686 00:43:40,100 --> 00:43:46,430 it's the probability of the first output times that output 687 00:43:46,430 --> 00:43:49,760 plus the probability of the second output times that output 688 00:43:49,760 --> 00:43:52,370 plus, plus Pn xn. 689 00:43:56,690 --> 00:44:05,140 So that will approach that with probability 1 as this number 690 00:44:05,140 --> 00:44:06,760 capital N-- notice the difference. 691 00:44:06,760 --> 00:44:12,320 Capital N here is the number of samples, the number of trials. 692 00:44:12,320 --> 00:44:14,650 And it gets big. 693 00:44:14,650 --> 00:44:17,710 We keep doing things more and more. 694 00:44:17,710 --> 00:44:21,730 This little n is the number of possible different outputs 695 00:44:21,730 --> 00:44:23,170 with their probabilities. 696 00:44:23,170 --> 00:44:24,930 And there you see it. 697 00:44:24,930 --> 00:44:27,370 And, of course, in the continuous case, 698 00:44:27,370 --> 00:44:30,392 we would take the interval of x p of x dx. 699 00:44:34,480 --> 00:44:38,680 So let me just, by analogy, that you 700 00:44:38,680 --> 00:44:41,140 should know what the continuous version is 701 00:44:41,140 --> 00:44:43,420 and what the discrete version is. 702 00:44:43,420 --> 00:44:44,900 OK, that's the mean. 703 00:44:47,770 --> 00:44:49,870 Now, for variance. 704 00:44:49,870 --> 00:45:00,120 Sample variance-- and, shall I say, expected variance? 705 00:45:00,120 --> 00:45:00,960 I don't know. 706 00:45:00,960 --> 00:45:05,442 Just variance is what people would usually say. 707 00:45:05,442 --> 00:45:07,650 I don't know if I've remembered the right word there, 708 00:45:07,650 --> 00:45:10,690 sample variance. 709 00:45:10,690 --> 00:45:19,770 Is that-- I included this topic in linear algebra book. 710 00:45:19,770 --> 00:45:21,300 Anyway, yeah. 711 00:45:21,300 --> 00:45:21,990 OK. 712 00:45:21,990 --> 00:45:23,565 So what's the sample variance? 713 00:45:28,010 --> 00:45:32,948 So I guess I'm-- 714 00:45:32,948 --> 00:45:37,800 hm-- yeah, so what is the sample variance? 715 00:45:37,800 --> 00:45:39,390 What's the variance about anyway? 716 00:45:39,390 --> 00:45:42,990 What's the key point of variance? 717 00:45:42,990 --> 00:45:46,210 It's the distance from the mean. 718 00:45:46,210 --> 00:45:49,050 So this will be a distance from the sample mean, 719 00:45:49,050 --> 00:45:51,480 and this will be a distance from the expected mean. 720 00:45:54,600 --> 00:45:59,260 So not distance from zero, but distance from mu and m, 721 00:45:59,260 --> 00:46:02,270 from the center of the thing. 722 00:46:02,270 --> 00:46:04,360 So the sample variance-- 723 00:46:04,360 --> 00:46:06,790 so, again, we have n samples. 724 00:46:06,790 --> 00:46:10,720 But for some wonderful reason in statistics 725 00:46:10,720 --> 00:46:14,820 you divide by n minus 1 this time. 726 00:46:14,820 --> 00:46:19,500 And the reason has to do with the fact that you used one-- 727 00:46:19,500 --> 00:46:21,510 this will involve the mean. 728 00:46:21,510 --> 00:46:26,850 So this would be the first output 729 00:46:26,850 --> 00:46:37,750 minus mu squared up to the n-th output minus mu squared. 730 00:46:37,750 --> 00:46:41,500 So it's the average distance from mu-- 731 00:46:41,500 --> 00:46:44,140 average squared distance from mu-- 732 00:46:44,140 --> 00:46:48,490 but with this little twist that, of course, when n is large, 733 00:46:48,490 --> 00:46:52,100 it's not a very significant difference between n and n 734 00:46:52,100 --> 00:46:52,600 minus 1. 735 00:46:52,600 --> 00:46:56,190 I Think that's about right. 736 00:46:56,190 --> 00:47:00,940 All this is that I'm just doing one experiment over and over. 737 00:47:00,940 --> 00:47:07,460 Covariance, which is the deeper idea, 738 00:47:07,460 --> 00:47:09,200 is where linear algebra comes in. 739 00:47:09,200 --> 00:47:11,660 I have a matrix because why? 740 00:47:11,660 --> 00:47:17,450 Because I'm doing multiple experiments at the same time. 741 00:47:17,450 --> 00:47:20,160 I'm flipping two coins. 742 00:47:20,160 --> 00:47:22,270 I'm flipping 15 coins. 743 00:47:22,270 --> 00:47:25,960 I'm doing other things. 744 00:47:25,960 --> 00:47:30,010 So that will be covariances when I'm doing several experiments 745 00:47:30,010 --> 00:47:31,960 at once. 746 00:47:31,960 --> 00:47:35,680 That will involve matrices of that size. 747 00:47:35,680 --> 00:47:38,960 So what's the variance? 748 00:47:38,960 --> 00:47:42,340 I should have given you the usual notation. 749 00:47:42,340 --> 00:47:47,710 The expected value of x, that's what's the mean. 750 00:47:47,710 --> 00:47:53,300 And here, I'm looking at the expected value of what? 751 00:47:53,300 --> 00:47:58,880 So when I'm computing a variance, using probabilities-- 752 00:47:58,880 --> 00:48:04,510 so I'm using expectations, not trial runs-- 753 00:48:04,510 --> 00:48:07,880 expectations means use the probabilities. 754 00:48:07,880 --> 00:48:11,950 And it's the expectation of the distance 755 00:48:11,950 --> 00:48:17,300 from x to the mean squared. 756 00:48:19,990 --> 00:48:27,550 And that is when I'm doing is an expectation for a discrete set, 757 00:48:27,550 --> 00:48:30,700 I think of the probability, the first probability, 758 00:48:30,700 --> 00:48:37,950 that goes with an output x1 and a second probability that 759 00:48:37,950 --> 00:48:41,010 goes with an output x2. 760 00:48:41,010 --> 00:48:48,210 And each time I subtract from the mean and square. 761 00:48:48,210 --> 00:48:51,090 So that's the variance that everybody calls sigma squared. 762 00:48:54,610 --> 00:49:02,020 Now, two minutes left is enough to say a few more 763 00:49:02,020 --> 00:49:04,650 words about covariance. 764 00:49:04,650 --> 00:49:06,770 Oh, to get to covariance, I really 765 00:49:06,770 --> 00:49:09,440 have to speak about joint probabilities. 766 00:49:09,440 --> 00:49:11,510 That's a key idea-- 767 00:49:11,510 --> 00:49:13,140 joint probabilities. 768 00:49:13,140 --> 00:49:15,320 So I'm doing two experiments at once. 769 00:49:18,450 --> 00:49:22,720 So each one has its own probabilities. 770 00:49:22,720 --> 00:49:27,760 But together, I have to ask-- 771 00:49:27,760 --> 00:49:30,770 so here are two easy cases. 772 00:49:30,770 --> 00:49:34,030 Suppose I'm flipping two coins. 773 00:49:34,030 --> 00:49:35,770 So that I might get heads heads. 774 00:49:35,770 --> 00:49:38,140 I may get heads tails, tails heads, 775 00:49:38,140 --> 00:49:40,450 or tails tails, four possibility, 776 00:49:40,450 --> 00:49:44,890 four possible outputs there, four possible pairs. 777 00:49:44,890 --> 00:49:49,510 And if you're flipping one coin and I'm flipping another one, 778 00:49:49,510 --> 00:49:52,250 those are independent results. 779 00:49:52,250 --> 00:49:54,500 Those are independent results. 780 00:49:54,500 --> 00:50:01,190 There won't be a covariance where 781 00:50:01,190 --> 00:50:04,040 by knowing what my flip was I would know more 782 00:50:04,040 --> 00:50:04,910 about your flip. 783 00:50:07,440 --> 00:50:10,260 But now, the other possibility would 784 00:50:10,260 --> 00:50:12,000 be to glue the coins together. 785 00:50:14,820 --> 00:50:20,620 Now, if I do a flip, they always come up heads and heads 786 00:50:20,620 --> 00:50:22,720 or tails and tails. 787 00:50:22,720 --> 00:50:26,230 So the heads tails combination is not possible. 788 00:50:26,230 --> 00:50:30,970 In fact, one output is totally dependent on the other outputs. 789 00:50:30,970 --> 00:50:32,770 So that's the other extreme. 790 00:50:32,770 --> 00:50:36,970 We have independent outputs with covariance 0, 791 00:50:36,970 --> 00:50:40,900 and we have totally dependent outputs 792 00:50:40,900 --> 00:50:44,620 when the things are just glued together, when one result tells 793 00:50:44,620 --> 00:50:46,960 us what the other result is, then 794 00:50:46,960 --> 00:50:51,340 that's a situation where the covariance is a maximum. 795 00:50:51,340 --> 00:50:53,290 It couldn't be bigger than that. 796 00:50:53,290 --> 00:50:58,360 And say in polling, if you were polling a family, 797 00:50:58,360 --> 00:51:04,720 say political polling, well, there 798 00:51:04,720 --> 00:51:10,110 would be some covariance expected there. 799 00:51:10,110 --> 00:51:13,110 The two or three or five people that 800 00:51:13,110 --> 00:51:15,280 are living in the same house wouldn't 801 00:51:15,280 --> 00:51:18,310 be independent, entirely independent, 802 00:51:18,310 --> 00:51:24,370 but nor would all five give the same answer. 803 00:51:24,370 --> 00:51:31,690 So their covariance matrix would have some off diagonal, 804 00:51:31,690 --> 00:51:36,970 but it would still be invertible. 805 00:51:36,970 --> 00:51:40,000 And actually, what I wanted to tell you about next time 806 00:51:40,000 --> 00:51:42,880 at the start is that covariance matrix, which 807 00:51:42,880 --> 00:51:47,760 I have to define for you, will be symmetric positive definite, 808 00:51:47,760 --> 00:51:49,710 or semidefinite. 809 00:51:49,710 --> 00:51:52,200 What's the semidefinite case? 810 00:51:52,200 --> 00:51:57,610 Of course, that's the case where the coins are glued together. 811 00:51:57,610 --> 00:51:59,820 OK, thanks. 812 00:51:59,820 --> 00:52:01,480 So you know what's coming Friday. 813 00:52:01,480 --> 00:52:04,470 I know that holiday is also coming Friday. 814 00:52:04,470 --> 00:52:08,070 And just you'll make a good plan, 815 00:52:08,070 --> 00:52:11,830 and I'll move on after the break. 816 00:52:11,830 --> 00:52:13,380 Good.