1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:17,795 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,795 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,800 --> 00:00:26,180 GILBERT STRANG: Well, OK, I am happy to be back, 9 00:00:26,180 --> 00:00:29,910 and I am really happy about the project 10 00:00:29,910 --> 00:00:32,229 proposals that are coming in. 11 00:00:32,229 --> 00:00:36,840 This is like, OK, this is really a good part of the course. 12 00:00:36,840 --> 00:00:42,360 And so keep them coming, and I'm happy to give whatever 13 00:00:42,360 --> 00:00:45,000 feedback I can on those proposals, 14 00:00:45,000 --> 00:00:49,120 and do make a start there. 15 00:00:49,120 --> 00:00:53,100 They're really good, and if some are completed 16 00:00:53,100 --> 00:00:55,230 before the end of the semester and we 17 00:00:55,230 --> 00:01:00,860 can to offer you a chance to report on them, 18 00:01:00,860 --> 00:01:04,019 that that's good too. 19 00:01:04,019 --> 00:01:07,930 So well done with those proposals. 20 00:01:07,930 --> 00:01:12,830 So today, I'm jumping to part six. 21 00:01:12,830 --> 00:01:15,870 So part six and part seven are optimization 22 00:01:15,870 --> 00:01:17,910 which is the fundamental algorithm that 23 00:01:17,910 --> 00:01:19,980 goes into deep learning. 24 00:01:19,980 --> 00:01:22,140 So we've got to start with optimization. 25 00:01:22,140 --> 00:01:26,010 Everybody has to get that picture, 26 00:01:26,010 --> 00:01:29,580 and then part seven will be the structure 27 00:01:29,580 --> 00:01:33,660 of CNNs, Convolution Neural Nets, 28 00:01:33,660 --> 00:01:37,560 and all kinds of applications. 29 00:01:37,560 --> 00:01:41,430 And so can we start with optimization? 30 00:01:41,430 --> 00:01:44,370 So first, can I like get the basic facts 31 00:01:44,370 --> 00:01:49,840 about three terms of a Taylor series? 32 00:01:49,840 --> 00:01:51,030 So that's the typical. 33 00:01:51,030 --> 00:01:55,590 It's seldom that we would go up to third derivatives 34 00:01:55,590 --> 00:01:56,970 in optimization. 35 00:01:56,970 --> 00:02:01,920 So that's the most useful approximation to a function. 36 00:02:01,920 --> 00:02:03,720 Everybody recognizes it. 37 00:02:03,720 --> 00:02:08,729 Here, I'm thinking of F as just one function, 38 00:02:08,729 --> 00:02:12,990 and x as just one variable, but now 39 00:02:12,990 --> 00:02:18,180 I really want to go to more variables. 40 00:02:18,180 --> 00:02:21,360 So what do I have to change if F is 41 00:02:21,360 --> 00:02:26,790 a function of more variables? 42 00:02:26,790 --> 00:02:31,050 So now, I'm thinking of x as-- 43 00:02:31,050 --> 00:02:32,730 well, now let me see. 44 00:02:37,670 --> 00:02:42,470 Yeah, I want n variables here. 45 00:02:42,470 --> 00:02:45,170 x is x1 up to xn. 46 00:02:48,830 --> 00:02:52,840 So just to get the words straight 47 00:02:52,840 --> 00:02:56,800 so we can begin on optimization, so what 48 00:02:56,800 --> 00:03:02,740 will be the similar step so the function F at x-- 49 00:03:02,740 --> 00:03:06,590 remember, x is n variables. 50 00:03:06,590 --> 00:03:07,330 OK? 51 00:03:07,330 --> 00:03:08,350 Now, what do I have? 52 00:03:08,350 --> 00:03:13,170 Delta x, so what's the point about delta x now? 53 00:03:13,170 --> 00:03:17,500 It's a vector, delta x1 to delta xn, 54 00:03:17,500 --> 00:03:20,190 and what about the derivative of F? 55 00:03:20,190 --> 00:03:22,810 It's a vector too, the derivative 56 00:03:22,810 --> 00:03:26,175 of F with respect to x1, the derivative of F 57 00:03:26,175 --> 00:03:29,660 with respect to x2, and so on. 58 00:03:29,660 --> 00:03:32,420 What do I have to change about that? 59 00:03:32,420 --> 00:03:36,310 I know those guys are vectors, so it's their dot product. 60 00:03:36,310 --> 00:03:44,360 So it's delta x transpose at vector times this dF/dx. 61 00:03:48,340 --> 00:03:52,620 So now I'm replacing this by all the derivatives, 62 00:03:52,620 --> 00:03:58,040 and it's the gradient. 63 00:03:58,040 --> 00:04:03,820 So the gradient of F at x is the derivatives-- 64 00:04:07,257 --> 00:04:09,920 let's see. 65 00:04:09,920 --> 00:04:14,320 It's essential to get the notation straight here. 66 00:04:14,320 --> 00:04:17,140 Yeah, so it'll be the partial derivatives 67 00:04:17,140 --> 00:04:26,290 of the function F. So grad F is the partial derivatives of F 68 00:04:26,290 --> 00:04:30,846 with respect to x1 down to partial derivative with respect 69 00:04:30,846 --> 00:04:32,650 to xn. 70 00:04:32,650 --> 00:04:34,180 OK, good. 71 00:04:34,180 --> 00:04:37,490 That's the linear term, and now what's the quadratic term? 72 00:04:37,490 --> 00:04:42,010 1/2, now delta x isn't a scalar anymore. 73 00:04:42,010 --> 00:04:43,160 It's a vector. 74 00:04:43,160 --> 00:04:47,500 So I'm going to have delta x transpose 75 00:04:47,500 --> 00:04:51,970 and a delta x, and what goes in between 76 00:04:51,970 --> 00:04:55,240 is the second derivatives, but I've 77 00:04:55,240 --> 00:04:57,460 got a function of n variables. 78 00:05:00,190 --> 00:05:04,240 So now, I have a matrix of second derivatives, 79 00:05:04,240 --> 00:05:10,510 and I'll call it H. This is the matrix of second derivatives, 80 00:05:10,510 --> 00:05:19,840 Hjk is the second derivative of F with respect to xj and xk, 81 00:05:19,840 --> 00:05:23,360 and what's the name for this guy? 82 00:05:23,360 --> 00:05:26,485 The Hessian, Hessian matrix. 83 00:05:30,350 --> 00:05:33,260 How the Hessians got into this picture I don't know. 84 00:05:33,260 --> 00:05:34,960 The only Hessians I know are the ones 85 00:05:34,960 --> 00:05:39,530 who fought in the Revolutionary War for somebody. 86 00:05:39,530 --> 00:05:40,030 Who? 87 00:05:40,030 --> 00:05:42,220 Which side were they on? 88 00:05:42,220 --> 00:05:43,750 I think maybe the wrong side. 89 00:05:43,750 --> 00:05:47,470 The French were on our side and-- 90 00:05:47,470 --> 00:05:50,530 Anyway, Hessian matrix, and what are 91 00:05:50,530 --> 00:05:52,345 the facts about that matrix? 92 00:05:52,345 --> 00:05:54,220 Well, the first fact is that it's [INAUDIBLE] 93 00:05:54,220 --> 00:05:56,883 and the key fact is it's symmetric. 94 00:06:00,200 --> 00:06:02,140 Yeah. 95 00:06:02,140 --> 00:06:07,420 OK, and again, it's an approximation. 96 00:06:07,420 --> 00:06:12,880 And everybody recognizes that if n is very large, 97 00:06:12,880 --> 00:06:16,360 and we have a function of many variables. 98 00:06:16,360 --> 00:06:19,540 Then, we had n derivatives to compute here, 99 00:06:19,540 --> 00:06:22,650 and about 1/2 n squared derivatives. 100 00:06:22,650 --> 00:06:25,960 The 1/2 comes from the symmetry, but the key point 101 00:06:25,960 --> 00:06:30,670 is the n squared derivatives to compute there. 102 00:06:30,670 --> 00:06:36,240 So computing the gradient is feasible 103 00:06:36,240 --> 00:06:40,050 if n is small or moderately large. 104 00:06:40,050 --> 00:06:43,900 Actually, by using automatic differentiation, 105 00:06:43,900 --> 00:06:46,050 the key idea of back propagation, 106 00:06:46,050 --> 00:06:54,750 back prop, you can speed up the computation 107 00:06:54,750 --> 00:06:56,820 of derivatives quite amazingly. 108 00:06:56,820 --> 00:07:02,400 But still for the size of deep learning problems 109 00:07:02,400 --> 00:07:04,300 that's out of reach. 110 00:07:04,300 --> 00:07:05,340 OK. 111 00:07:05,340 --> 00:07:10,140 So that's the picture, and then I 112 00:07:10,140 --> 00:07:16,170 will want to use this to solve equations. 113 00:07:16,170 --> 00:07:25,140 There is a parallel picture for a vector f. 114 00:07:25,140 --> 00:07:26,940 So now, this is a vector function. 115 00:07:26,940 --> 00:07:39,210 This is f1 of x up to fn of x, an x is x1 to xn. 116 00:07:42,030 --> 00:07:45,950 So I have n functions of n variables, 117 00:07:45,950 --> 00:07:47,700 n functions of n variables. 118 00:07:47,700 --> 00:07:51,380 Well, that's exactly what I have in the gradient. 119 00:07:54,200 --> 00:07:57,230 Think of these two as parallel, the parallel 120 00:07:57,230 --> 00:08:02,230 being f corresponds to the gradient of F, 121 00:08:02,230 --> 00:08:06,980 n functions of n variables. 122 00:08:10,800 --> 00:08:11,300 OK. 123 00:08:15,010 --> 00:08:18,730 Now maybe, what I'm after here is to solve f equals 0. 124 00:08:18,730 --> 00:08:25,570 So I'm going to think about the f at x plus delta x, 125 00:08:25,570 --> 00:08:28,890 so it starts with f of x. 126 00:08:28,890 --> 00:08:34,539 And then we have the correction times the matrix 127 00:08:34,539 --> 00:08:42,309 of first derivatives, and what's the name for that matrix 128 00:08:42,309 --> 00:08:45,850 of first derivatives? 129 00:08:45,850 --> 00:08:51,670 Well, if I'm just given n functions-- 130 00:08:54,280 --> 00:08:56,830 yeah, what am I after here? 131 00:08:56,830 --> 00:09:05,120 I'm looking for the Jacobian. 132 00:09:05,120 --> 00:09:10,580 So here we'll go the Jacobian, J. This is the Jacobian named 133 00:09:10,580 --> 00:09:16,320 after Jacoby, Jacobian matrix. 134 00:09:16,320 --> 00:09:18,920 And what are its entries? 135 00:09:18,920 --> 00:09:23,060 J, the jk entry is the derivative 136 00:09:23,060 --> 00:09:30,780 of the J function with respect to the kth variable, 137 00:09:30,780 --> 00:09:34,520 and I'm stopping at first order there. 138 00:09:34,520 --> 00:09:38,360 OK, so these are sort of like facts of calculus, 139 00:09:38,360 --> 00:09:41,000 facts of 18.02 you could say. 140 00:09:41,000 --> 00:09:44,550 Multivariable calculus, that's the point. 141 00:09:44,550 --> 00:09:47,870 Notice that we're doing just like the first half of 18.02, 142 00:09:47,870 --> 00:09:52,880 just do differential calculus, derivatives, Taylor series. 143 00:09:52,880 --> 00:09:54,750 We're not doing multiple integrals. 144 00:09:54,750 --> 00:09:57,050 That's not part of our world here. 145 00:09:57,050 --> 00:10:00,330 OK, so that's the background. 146 00:10:00,330 --> 00:10:05,840 Now, I want to look at optimization. 147 00:10:05,840 --> 00:10:12,610 So over here, I want to optimize-- 148 00:10:12,610 --> 00:10:24,290 well, over here, let me try to minimize F of x, 149 00:10:24,290 --> 00:10:27,340 and I'll be in the vector case here. 150 00:10:27,340 --> 00:10:38,610 And over here, I want to solve f equals 0, 151 00:10:38,610 --> 00:10:45,330 and of course, that means f of 1 equals 0 152 00:10:45,330 --> 00:10:48,720 all the way along to fn equals 0. 153 00:10:48,720 --> 00:10:54,080 Here, I have n equations, and n unknowns. 154 00:10:59,720 --> 00:11:02,660 Let me start with that one, and I'll 155 00:11:02,660 --> 00:11:05,450 start with Newton's method, Newton's method 156 00:11:05,450 --> 00:11:09,890 to solve these n equations and n unknowns. 157 00:11:09,890 --> 00:11:22,440 OK, so Newton, Newton's method which is often 158 00:11:22,440 --> 00:11:24,210 not presented in 18.02. 159 00:11:24,210 --> 00:11:29,130 That's a crime, because that's the big application 160 00:11:29,130 --> 00:11:33,180 of gradients in Jacobians. 161 00:11:33,180 --> 00:11:37,150 OK, so I'm trying to solve n equations and n unknowns, 162 00:11:37,150 --> 00:11:42,930 and so I want f at x plus delta x to be 0. 163 00:11:42,930 --> 00:11:43,500 Right? 164 00:11:43,500 --> 00:11:46,740 So I want f of x plus delta x to be 0. 165 00:11:46,740 --> 00:11:49,890 So f at x plus delta x is-- 166 00:11:49,890 --> 00:11:51,150 I'm putting in a 0. 167 00:11:51,150 --> 00:11:53,340 I'm just copying that equation-- 168 00:11:53,340 --> 00:11:58,440 is f at where I am. 169 00:11:58,440 --> 00:12:02,370 Let me use K for the case iteration. 170 00:12:02,370 --> 00:12:04,490 So I'm at a point xK. 171 00:12:04,490 --> 00:12:07,230 I want to get to a point xK plus 1. 172 00:12:07,230 --> 00:12:14,890 And so I have 0 is f of x plus J, at that point, 173 00:12:14,890 --> 00:12:21,700 times delta x which is xK plus 1 minus xK. 174 00:12:21,700 --> 00:12:24,030 Good. 175 00:12:24,030 --> 00:12:25,020 That's Newton's method. 176 00:12:25,020 --> 00:12:29,010 Of course, 0 isn't quite true. 177 00:12:29,010 --> 00:12:35,250 Well, 0 will be true if I'm constructing xK plus 1 here. 178 00:12:35,250 --> 00:12:37,950 I'm constructing xK plus 1. 179 00:12:37,950 --> 00:12:39,030 OK. 180 00:12:39,030 --> 00:12:43,410 So let me just rewrite that, and we've got Newton's method. 181 00:12:43,410 --> 00:12:52,750 So we're looking for this change, xK plus 1 minus xK. 182 00:12:52,750 --> 00:13:00,010 I'll put it on this side as plus xK, so that's this. 183 00:13:00,010 --> 00:13:01,960 Now, I have to invert that and put it 184 00:13:01,960 --> 00:13:04,010 on the other side of the equation. 185 00:13:04,010 --> 00:13:06,280 So that will go with a minus. 186 00:13:06,280 --> 00:13:12,880 This guy will be inverted and f at xK. 187 00:13:15,472 --> 00:13:17,170 So that's Newton's methods. 188 00:13:17,170 --> 00:13:19,980 It's natural. 189 00:13:19,980 --> 00:13:21,310 So let me just repeat that. 190 00:13:21,310 --> 00:13:26,140 You see where the xK plus 1 minus xK is sitting? 191 00:13:26,140 --> 00:13:27,670 Right? 192 00:13:27,670 --> 00:13:32,930 And I moved f of xK to the other side with a minus sign, 193 00:13:32,930 --> 00:13:37,150 and then I multiplied through by J inverse, so I got that. 194 00:13:37,150 --> 00:13:45,020 So that's Newton's method for a system of equations, 195 00:13:45,020 --> 00:13:47,380 and over there, I'm going to write down Newton's method 196 00:13:47,380 --> 00:13:49,030 for minimizing a function. 197 00:13:49,030 --> 00:13:53,890 This is such basic stuff that we have to begin here. 198 00:13:53,890 --> 00:13:58,540 Let me even begin with an extremely straightforward 199 00:13:58,540 --> 00:14:02,440 example of Newton's method here. 200 00:14:02,440 --> 00:14:05,950 Suppose my function-- suppose I've only 201 00:14:05,950 --> 00:14:10,900 got one function actually. 202 00:14:10,900 --> 00:14:14,450 Suppose I only had one function. 203 00:14:14,450 --> 00:14:21,240 So suppose my function is x squared minus 9, 204 00:14:21,240 --> 00:14:26,020 and I want to solve f of x equals 0. 205 00:14:26,020 --> 00:14:28,990 I want to find the square root of 9. 206 00:14:28,990 --> 00:14:33,790 OK, so what is Newton's method for it? 207 00:14:33,790 --> 00:14:37,360 My point is just to see how Newton's method is written 208 00:14:37,360 --> 00:14:43,200 and then rewrite it a little bit so that we see the convergence. 209 00:14:43,200 --> 00:14:48,100 OK, so of course, the Jacobian is 2x. 210 00:14:48,100 --> 00:14:51,580 So Newton's method says that xK plus 1-- 211 00:14:51,580 --> 00:14:56,230 I'm just going to copy that Newton's method-- 212 00:14:56,230 --> 00:14:59,170 minus 1 over 2xK. 213 00:14:59,170 --> 00:15:01,060 Right? 214 00:15:01,060 --> 00:15:08,680 That's the derivative times f at xK which is xK squared minus 9. 215 00:15:12,190 --> 00:15:13,280 OK. 216 00:15:13,280 --> 00:15:17,720 We followed the formula, this determines xK plus 1, 217 00:15:17,720 --> 00:15:20,350 and let's simplify it. 218 00:15:20,350 --> 00:15:26,500 So here I have xK minus that looks like 1/2 of xK, 219 00:15:26,500 --> 00:15:30,730 so I think I have 1/2xK, and then 220 00:15:30,730 --> 00:15:40,137 this times this is 9/2 of 1 over xK. 221 00:15:40,137 --> 00:15:40,720 Is that right? 222 00:15:43,450 --> 00:15:50,100 1/2 of xk from this stuff and plus 9/2 of 1 over xK. 223 00:15:50,100 --> 00:15:50,600 OK. 224 00:15:53,940 --> 00:15:59,850 Can I just like check that I know the answer is 3? 225 00:15:59,850 --> 00:16:04,260 Can I be sure that I get the right answer, 3? 226 00:16:04,260 --> 00:16:10,170 That if xK was exactly 3, then of course, I expect xK plus 1 227 00:16:10,170 --> 00:16:11,730 to stay at 3. 228 00:16:11,730 --> 00:16:13,740 So does that happen? 229 00:16:13,740 --> 00:16:24,330 So 1/2 of 3 and 9/2 of 1/3, what's that, 1/2 of 3 and 9/2 230 00:16:24,330 --> 00:16:25,440 of 1/3? 231 00:16:25,440 --> 00:16:28,720 OK, that's 3/2 and 3/2. 232 00:16:28,720 --> 00:16:31,580 That's 6/2, and that's 3. 233 00:16:31,580 --> 00:16:32,080 OK. 234 00:16:35,690 --> 00:16:38,870 So we've checked that the method is consistent which just means 235 00:16:38,870 --> 00:16:41,690 we kept the algebra straight. 236 00:16:41,690 --> 00:16:46,430 But then the really important point about Newton's method 237 00:16:46,430 --> 00:16:50,700 is to discover how fast it converges. 238 00:16:50,700 --> 00:16:55,350 So now let me do xK plus 1 minus 3. 239 00:16:55,350 --> 00:17:00,650 So now, I'm looking at the error which is, I hope, 240 00:17:00,650 --> 00:17:02,420 approaching 0. 241 00:17:02,420 --> 00:17:03,770 Is it approaching 0? 242 00:17:03,770 --> 00:17:05,540 How quickly is it approaching 0? 243 00:17:05,540 --> 00:17:09,470 These are the fundamental questions of optimization. 244 00:17:09,470 --> 00:17:14,940 So I'm going to subtract 3 from both sides somehow. 245 00:17:14,940 --> 00:17:20,970 OK, from here, I guess, I'm going to subtract 3. 246 00:17:20,970 --> 00:17:24,569 So I was just checking that it was correct. 247 00:17:24,569 --> 00:17:25,950 OK. 248 00:17:25,950 --> 00:17:30,310 Now, so xK plus 1 minus 3, I'm going 249 00:17:30,310 --> 00:17:32,310 to subtract 3 from both sides. 250 00:17:32,310 --> 00:17:38,190 I'm going to subtract 3 there, and then I hope that-- 251 00:17:38,190 --> 00:17:40,110 that box is what goes down here. 252 00:17:40,110 --> 00:17:41,070 Right? 253 00:17:41,070 --> 00:17:48,960 Subtracted 3 from both sides, so I'm hoping now things go to 0. 254 00:17:48,960 --> 00:17:52,200 OK, so what do I have there? 255 00:17:52,200 --> 00:17:55,330 Let me factor out the 1 over xK. 256 00:17:58,860 --> 00:18:01,850 So what do I have then left? 257 00:18:01,850 --> 00:18:08,750 1 over xK, so there's a 9/2 from there, 1 over xK. 258 00:18:08,750 --> 00:18:12,710 So I really have 1/2 of xK squared, 259 00:18:12,710 --> 00:18:15,080 because I've divided by an xK. 260 00:18:15,080 --> 00:18:21,120 And this minus 3, I better put minus 3xK, 261 00:18:21,120 --> 00:18:23,230 because I'm dividing by xK. 262 00:18:23,230 --> 00:18:25,570 I claim that that's-- 263 00:18:25,570 --> 00:18:26,360 now I've got it. 264 00:18:30,740 --> 00:18:34,970 And let's see, let me take out the 2-- 265 00:18:34,970 --> 00:18:41,480 2, forget these 2s, and make that a 6. 266 00:18:41,480 --> 00:18:47,210 So I have 1 over 2xK times 9 plus xK squared minus 6. 267 00:18:47,210 --> 00:18:48,410 Anything good about that? 268 00:18:50,980 --> 00:18:51,880 We hope so. 269 00:18:56,450 --> 00:19:00,450 We hope that that is something attractive. 270 00:19:00,450 --> 00:19:05,730 So this is, again, the error at set K plus 1, 271 00:19:05,730 --> 00:19:11,940 and it's 1 over 2xK times this thing in brackets-- 272 00:19:11,940 --> 00:19:15,000 9 plus xK squared minus 6xK. 273 00:19:15,000 --> 00:19:23,235 And we recognize that as xK minus 3 squared. 274 00:19:25,980 --> 00:19:34,710 xK squared minus 6 of them plus 9, that's xK minus 3 squared. 275 00:19:34,710 --> 00:19:40,370 OK, that was the goal, of course. 276 00:19:40,370 --> 00:19:44,390 That's the goal that shows why Newton's method is fantastic. 277 00:19:44,390 --> 00:19:48,080 If you can execute it, if you can start near enough, 278 00:19:48,080 --> 00:19:50,850 notice that-- 279 00:19:50,850 --> 00:19:54,750 so how do I describe this great equation? 280 00:19:54,750 --> 00:19:58,850 It says that the error is squared at every step, squared 281 00:19:58,850 --> 00:20:00,770 at every step. 282 00:20:00,770 --> 00:20:08,132 So if I'm converging to a limit, it will satisfy the-- 283 00:20:08,132 --> 00:20:15,050 it'll be 3, or I guess minus 3, is that possible? 284 00:20:15,050 --> 00:20:17,940 Yeah, minus 3 is another solution here. 285 00:20:17,940 --> 00:20:20,810 So we've got two solutions. 286 00:20:20,810 --> 00:20:22,955 Newton's method could converge to 3. 287 00:20:25,585 --> 00:20:29,190 Am I right, it could converge to minus 3? 288 00:20:29,190 --> 00:20:34,920 So I'd have a similar equation sort of centered at minus 3, 289 00:20:34,920 --> 00:20:37,920 or does it always do one of those? 290 00:20:40,710 --> 00:20:41,700 It could blow up. 291 00:20:44,730 --> 00:20:48,430 So there are sort of regions of attraction. 292 00:20:48,430 --> 00:20:51,690 They're all the starting points that approach 3, 293 00:20:51,690 --> 00:20:53,190 and the whole point of that equation 294 00:20:53,190 --> 00:20:57,870 is with quadratic convergence the error 295 00:20:57,870 --> 00:20:59,730 being squared at every step. 296 00:20:59,730 --> 00:21:01,760 It zooms in on 3. 297 00:21:01,760 --> 00:21:04,060 Then, there is all the starting points 298 00:21:04,060 --> 00:21:07,420 that would go to minus 3, and then there 299 00:21:07,420 --> 00:21:10,920 are the starting points that would blow up. 300 00:21:10,920 --> 00:21:15,170 And those, maybe for this very simple problem, 301 00:21:15,170 --> 00:21:17,760 the picture is not too difficult to sort out 302 00:21:17,760 --> 00:21:18,675 those three regions. 303 00:21:21,600 --> 00:21:29,010 And this is allowing for a vector, two equations or n 304 00:21:29,010 --> 00:21:31,950 equations, then we're in n variables, 305 00:21:31,950 --> 00:21:35,310 and really you get beautiful pictures. 306 00:21:35,310 --> 00:21:39,900 You get some of the type of pictures that gave rise 307 00:21:39,900 --> 00:21:45,030 to these books on fractals, picture books on fractals 308 00:21:45,030 --> 00:21:46,920 for these basins of attraction. 309 00:21:46,920 --> 00:21:52,410 Does the starting point lead you to one of the solutions, 310 00:21:52,410 --> 00:21:54,780 or does it lead you to infinity? 311 00:21:54,780 --> 00:21:59,790 Here, that would be interesting to just draw it for this, 312 00:21:59,790 --> 00:22:03,900 but the essential point is the quadratic convergence, 313 00:22:03,900 --> 00:22:05,670 if it's close enough. 314 00:22:05,670 --> 00:22:09,210 You see that it has to be close. 315 00:22:09,210 --> 00:22:15,760 If x0 is pretty near 3, then this is about 1/6 of that, 316 00:22:15,760 --> 00:22:21,430 and there would be a good region of attraction in this case. 317 00:22:21,430 --> 00:22:22,140 OK. 318 00:22:22,140 --> 00:22:29,640 So that's Newton's method for equations. 319 00:22:29,640 --> 00:22:31,860 And now I want to do Newton's method. 320 00:22:31,860 --> 00:22:34,230 I just want to convert all those words over 321 00:22:34,230 --> 00:22:38,450 to Newton's method for optimization. 322 00:22:38,450 --> 00:22:45,810 So remember, these boards were solving f equals 0. 323 00:22:45,810 --> 00:22:48,570 This board is minimizing capital F, 324 00:22:48,570 --> 00:22:50,860 and what's the connection between them? 325 00:22:50,860 --> 00:23:02,200 Well of course, this corresponds to solving the gradient 326 00:23:02,200 --> 00:23:04,330 equals 0. 327 00:23:04,330 --> 00:23:07,420 At a minimum, if I'm minimizing, I'm 328 00:23:07,420 --> 00:23:11,750 finding a point where all the first derivatives are 0. 329 00:23:11,750 --> 00:23:16,060 So that will be the match between these. 330 00:23:16,060 --> 00:23:22,510 This grad F in this picture is the small f in that picture. 331 00:23:22,510 --> 00:23:23,010 OK. 332 00:23:25,740 --> 00:23:28,710 Now, I guess here I have-- 333 00:23:28,710 --> 00:23:33,930 and this is sort of the heart of our applications 334 00:23:33,930 --> 00:23:38,070 to deep learning-- we have very complicated loss functions 335 00:23:38,070 --> 00:23:40,260 to minimize, functions of thousands 336 00:23:40,260 --> 00:23:42,900 or hundreds of thousands of variables. 337 00:23:42,900 --> 00:23:44,840 OK. 338 00:23:44,840 --> 00:23:47,840 So that means that we would like to use Newton's method, 339 00:23:47,840 --> 00:23:49,970 but often we can't. 340 00:23:49,970 --> 00:23:54,380 So I need him to put down here two methods-- 341 00:23:54,380 --> 00:23:59,060 one that doesn't involve those high second derivatives 342 00:23:59,060 --> 00:24:02,750 and Newton's that does. 343 00:24:02,750 --> 00:24:10,430 So first, I'll write down a method that does not involve, 344 00:24:10,430 --> 00:24:16,390 so method one, and this will be steepest descent. 345 00:24:25,040 --> 00:24:26,910 And what is that? 346 00:24:26,910 --> 00:24:35,430 That says that xK plus 1-- the new x is the old x minus-- 347 00:24:35,430 --> 00:24:40,110 steepest descent means that I move in the steepest direction 348 00:24:40,110 --> 00:24:45,240 which is the direction of the gradient of F. 349 00:24:45,240 --> 00:24:47,520 I move some distance, and I better 350 00:24:47,520 --> 00:24:51,360 have freedom to decide what that distance should be. 351 00:24:51,360 --> 00:25:01,670 So this is a step size, s, or in the language of deep learning, 352 00:25:01,670 --> 00:25:03,830 it's often called the learning rate, 353 00:25:03,830 --> 00:25:06,450 so if you see learning rate. 354 00:25:11,910 --> 00:25:12,410 OK. 355 00:25:19,000 --> 00:25:23,440 So and it's natural to choose sK. 356 00:25:23,440 --> 00:25:27,340 We're going along, do you see what this right-hand side looks 357 00:25:27,340 --> 00:25:27,970 like? 358 00:25:27,970 --> 00:25:31,480 I'm at a point in n dimensions. 359 00:25:31,480 --> 00:25:34,640 We're in n dimensions here. 360 00:25:34,640 --> 00:25:37,770 We have functions of n variables. 361 00:25:37,770 --> 00:25:39,020 There is a vector. 362 00:25:39,020 --> 00:25:43,780 There is a direction to move down the steepest slope 363 00:25:43,780 --> 00:25:46,360 of the graph. 364 00:25:46,360 --> 00:25:51,910 And here is a distance to move, and we will stop. 365 00:25:51,910 --> 00:25:59,350 We'll have to get off this step, normally. 366 00:25:59,350 --> 00:26:02,650 If we stay on it, it will swing back, 367 00:26:02,650 --> 00:26:06,910 it'll take us off to infinity. 368 00:26:06,910 --> 00:26:12,450 You would like to choose sK so that you minimize 369 00:26:12,450 --> 00:26:20,080 capital F. You take the point on this line, so this a line in R 370 00:26:20,080 --> 00:26:24,350 n, a direction in R n. 371 00:26:28,820 --> 00:26:33,790 And for all the points on that line, in that direction, 372 00:26:33,790 --> 00:26:39,090 F has some value, and what you expect is that initially, 373 00:26:39,090 --> 00:26:45,460 because you chose it sensibly, the value of F will drop. 374 00:26:45,460 --> 00:26:49,830 But then at a certain point, it will turn back on you 375 00:26:49,830 --> 00:26:51,130 and increase. 376 00:26:51,130 --> 00:26:54,430 So that would be the natural stopping point. 377 00:26:54,430 --> 00:26:57,460 I would call that an exact line search. 378 00:26:57,460 --> 00:27:08,930 So I exact line search would be, exact line search is the best 379 00:27:08,930 --> 00:27:09,430 s. 380 00:27:11,950 --> 00:27:16,390 Of course, that would take time to compute, 381 00:27:16,390 --> 00:27:19,360 and you probably, in deep learning, 382 00:27:19,360 --> 00:27:25,690 that's time you can't afford, so you fix the learning rate s. 383 00:27:25,690 --> 00:27:29,240 Maybe you choose 0.01 to be pretty safe. 384 00:27:29,240 --> 00:27:32,800 OK, so that's method one, steepest descent. 385 00:27:32,800 --> 00:27:34,800 Now, method two will be Newton's method. 386 00:27:43,380 --> 00:27:53,070 So now, we have xK plus 1 equal to xK minus something 387 00:27:53,070 --> 00:28:01,810 times delta F, and now I'm going to do the right thing. 388 00:28:01,810 --> 00:28:05,760 I'm going to live right here, and the right thing 389 00:28:05,760 --> 00:28:08,850 is the Hessian, the second derivative. 390 00:28:08,850 --> 00:28:10,770 This was cheap. 391 00:28:10,770 --> 00:28:14,280 We just took the direction and went along it. 392 00:28:14,280 --> 00:28:17,850 Now, we're getting really the right direction 393 00:28:17,850 --> 00:28:23,850 by using the second derivative, so that's H inverse. 394 00:28:23,850 --> 00:28:32,430 OK, and what I've done is to set that 0. 395 00:28:37,900 --> 00:28:40,090 Do you see that's Newton's method? 396 00:28:40,090 --> 00:28:44,110 It's totally parallel to this guy. 397 00:28:44,110 --> 00:28:48,850 Actually, I'm really happy to have these two on the board 398 00:28:48,850 --> 00:28:52,540 parallel to each other, because you have to keep straight, 399 00:28:52,540 --> 00:28:56,800 are you solving equations, or are you minimizing functions? 400 00:28:56,800 --> 00:29:00,430 And you're using different letters in the two problems, 401 00:29:00,430 --> 00:29:02,850 but now you see how they match. 402 00:29:02,850 --> 00:29:08,680 The Jacobian of-- so again the matches, 403 00:29:08,680 --> 00:29:13,600 think of f as the gradient of F. That's 404 00:29:13,600 --> 00:29:16,510 the way you should think of it. 405 00:29:16,510 --> 00:29:24,940 So the Jacobian of the gradient is the Hessian. 406 00:29:24,940 --> 00:29:27,880 The Jacobian of the gradient is the Hessian, 407 00:29:27,880 --> 00:29:30,790 and that makes sense, because the first derivative 408 00:29:30,790 --> 00:29:33,460 of the first derivative is the second derivative. 409 00:29:33,460 --> 00:29:39,640 Only we're doing matrix y, so the Jacobian of the gradient-- 410 00:29:39,640 --> 00:29:41,620 we're doing a vector matrix sentence 411 00:29:41,620 --> 00:29:46,090 instead of a scalar sentence-- the Jacobian of the gradient 412 00:29:46,090 --> 00:29:48,110 is a Hessian. 413 00:29:48,110 --> 00:29:50,600 Yeah, right. 414 00:29:50,600 --> 00:29:53,770 OK, so that's what I wanted to start with, 415 00:29:53,770 --> 00:29:56,110 just to get those basic facts down. 416 00:29:56,110 --> 00:30:03,280 And so the basic facts were the three-term Taylor series. 417 00:30:03,280 --> 00:30:09,400 And then the basic algorithms followed naturally from it 418 00:30:09,400 --> 00:30:13,420 by setting f F at the new point to 0, 419 00:30:13,420 --> 00:30:16,010 if that's what you were solving or by assuming 420 00:30:16,010 --> 00:30:17,020 you had the minimum. 421 00:30:17,020 --> 00:30:19,420 Right, good, good, good, good. 422 00:30:19,420 --> 00:30:21,100 OK. 423 00:30:21,100 --> 00:30:22,900 Now, what? 424 00:30:22,900 --> 00:30:26,380 Now, we have to think about solving these problems, 425 00:30:26,380 --> 00:30:27,730 studying. 426 00:30:27,730 --> 00:30:29,170 Do they converge? 427 00:30:29,170 --> 00:30:30,970 What rate do they converge? 428 00:30:30,970 --> 00:30:35,770 Well, the rate of convergence is like why 429 00:30:35,770 --> 00:30:39,410 I separated off this example. 430 00:30:39,410 --> 00:30:41,800 So the convergence rate for Newton's method 431 00:30:41,800 --> 00:30:43,840 will be quadratic. 432 00:30:43,840 --> 00:30:47,980 The error gets squared, and of course, that 433 00:30:47,980 --> 00:30:53,160 means super-fast convergence, if you start and close enough. 434 00:30:53,160 --> 00:30:55,860 The rate of convergence for a steepest descent 435 00:30:55,860 --> 00:30:57,600 is, of course, not. 436 00:30:57,600 --> 00:31:00,480 You're not squaring errors here, because you're just 437 00:31:00,480 --> 00:31:03,630 taking some number instead of the inverse 438 00:31:03,630 --> 00:31:12,610 of the correct matrix, so you can't expect super speed. 439 00:31:12,610 --> 00:31:15,810 So a linear rate of convergence would be right. 440 00:31:18,430 --> 00:31:21,240 You would like to know that the error is 441 00:31:21,240 --> 00:31:26,850 multiplied at every step by some constant below 1. 442 00:31:26,850 --> 00:31:29,400 That would be a linear rate compared 443 00:31:29,400 --> 00:31:35,420 to being squared at every step. 444 00:31:35,420 --> 00:31:42,310 OK, and so this will be our basic formula 445 00:31:42,310 --> 00:31:50,830 that we build on for really large scale problems. 446 00:31:50,830 --> 00:31:53,590 And there are methods, of course, 447 00:31:53,590 --> 00:31:57,790 people are going to come up with methods that they're sort 448 00:31:57,790 --> 00:32:01,260 of a cheap Newton's method. 449 00:32:01,260 --> 00:32:05,020 Levenberg-Marquardt, and it's in the notes 450 00:32:05,020 --> 00:32:09,850 at the end of this section, at the end of 6.4 451 00:32:09,850 --> 00:32:11,105 that we'll get to. 452 00:32:11,105 --> 00:32:16,130 So Levenberg-Marquardt is a sort of cheap man's Newton's method. 453 00:32:16,130 --> 00:32:22,180 It does not compute to Hessian, but it says, OK, 454 00:32:22,180 --> 00:32:27,160 from the gradient, I can see one term in the Hessian. 455 00:32:27,160 --> 00:32:34,570 So it grabs that term, but it's not fully second order. 456 00:32:34,570 --> 00:32:36,270 OK. 457 00:32:36,270 --> 00:32:40,860 So now, we have to think about problems, 458 00:32:40,860 --> 00:32:45,360 and I guess the message here is, at our starting point, 459 00:32:45,360 --> 00:32:49,100 has to be convexity. 460 00:32:49,100 --> 00:32:55,090 Convexity is the key word for these problems, 461 00:32:55,090 --> 00:32:57,740 for the function that we want to minimize. 462 00:32:57,740 --> 00:33:01,800 If that's a convex function, well first of all, 463 00:33:01,800 --> 00:33:07,570 the convex function is likely to have one minimum. 464 00:33:07,570 --> 00:33:13,850 And the picture that's in our mind of steepest descent, 465 00:33:13,850 --> 00:33:17,410 this picture of a bowl, a bowl is the graph 466 00:33:17,410 --> 00:33:19,480 of the convex function. 467 00:33:19,480 --> 00:33:21,970 So I'm turning to convexity now. 468 00:33:21,970 --> 00:33:28,260 I'll leave that board there, because that's pretty crucial, 469 00:33:28,260 --> 00:33:32,010 and speak about the idea of convexity. 470 00:33:32,010 --> 00:33:39,960 Convex function, convex set, so let's call 471 00:33:39,960 --> 00:33:45,540 the function f of x, and a typical convex set 472 00:33:45,540 --> 00:33:50,590 would be I'll call it K. OK. 473 00:33:50,590 --> 00:33:54,760 So we just want to remember what does that word can convex mean, 474 00:33:54,760 --> 00:33:57,910 and how do you know if you have a convex function 475 00:33:57,910 --> 00:33:59,470 or a convex set? 476 00:33:59,470 --> 00:34:01,930 OK, let me start with convex set. 477 00:34:01,930 --> 00:34:05,290 So because here is my general problem, 478 00:34:05,290 --> 00:34:16,960 my convex minimization, which you hope to have, 479 00:34:16,960 --> 00:34:19,480 and in many applications, you do have. 480 00:34:19,480 --> 00:34:26,199 So you minimize a convex function 481 00:34:26,199 --> 00:34:31,330 for points in a convex set. 482 00:34:31,330 --> 00:34:34,570 So that's like the ideal situation. 483 00:34:34,570 --> 00:34:38,230 That's the ideal situation, to get something 484 00:34:38,230 --> 00:34:41,590 on your side, something powerful, convexity. 485 00:34:41,590 --> 00:34:44,409 The function is convex, and you say, well, 486 00:34:44,409 --> 00:34:47,710 let me draw a convex function, the graph. 487 00:34:47,710 --> 00:34:54,980 OK, so I'll draw a convex function, say a bowl. 488 00:34:54,980 --> 00:35:00,640 So that's a graph of f of x, and then here are the x's. 489 00:35:00,640 --> 00:35:08,800 Let me maybe put x1 and x2 in the base and the graph of f 490 00:35:08,800 --> 00:35:11,680 of 1x x2 up here. 491 00:35:16,110 --> 00:35:16,610 OK. 492 00:35:16,610 --> 00:35:17,820 Actually, I'm over there. 493 00:35:17,820 --> 00:35:21,340 I should be calling this function F, I think. 494 00:35:24,268 --> 00:35:25,732 Is that right? 495 00:35:29,160 --> 00:35:33,220 Yeah, a little f would be the gradient of this guy. 496 00:35:33,220 --> 00:35:34,000 Yeah, I think so. 497 00:35:38,543 --> 00:35:39,043 OK. 498 00:35:43,200 --> 00:35:49,120 Now, I'm minimizing it over certain x's, not all x's. 499 00:35:49,120 --> 00:35:55,750 I might be minimizing, for example, 500 00:35:55,750 --> 00:36:05,230 K might be the set where Ax equals B. K might be, 501 00:36:05,230 --> 00:36:09,460 in that case, a subspace or a shifted subspace. 502 00:36:09,460 --> 00:36:14,350 I said subspace, but then 18.06 is reminding me in my mind 503 00:36:14,350 --> 00:36:16,710 that I only have a subspace when B is 0. 504 00:36:19,330 --> 00:36:23,530 You know the word for a subspace that's sort of moved over? 505 00:36:23,530 --> 00:36:27,110 Affine, so I'll just put that word down here. 506 00:36:30,190 --> 00:36:35,200 Bunch of words to learn for this topic, 507 00:36:35,200 --> 00:36:36,760 but they're worth learning. 508 00:36:36,760 --> 00:36:37,630 OK. 509 00:36:37,630 --> 00:36:42,700 So it's like a plane but not necessarily through the origin. 510 00:36:42,700 --> 00:36:44,640 If B is 0, it doesn't go through it. 511 00:36:44,640 --> 00:36:46,980 If B it's not 0, it doesn't go through the origin. 512 00:36:46,980 --> 00:36:47,480 OK. 513 00:36:47,480 --> 00:36:49,940 Anyway, or I have some other convex set. 514 00:36:49,940 --> 00:36:57,519 Let me just put this convex set K in the base for you, 515 00:36:57,519 --> 00:36:59,470 and did I make it convex? 516 00:36:59,470 --> 00:37:04,060 I think pretty luckily I did. 517 00:37:04,060 --> 00:37:05,200 So now what's the? 518 00:37:07,840 --> 00:37:11,540 Well, the convex sets the constraint, 519 00:37:11,540 --> 00:37:13,840 so this is the constraint set. 520 00:37:13,840 --> 00:37:23,290 Constraint is that x must be in the set K. OK, 521 00:37:23,290 --> 00:37:28,120 and I drew it as a convex blob. 522 00:37:28,120 --> 00:37:32,770 Here was an example where it's flat, not 523 00:37:32,770 --> 00:37:36,110 a blob but a flat plane. 524 00:37:36,110 --> 00:37:41,020 But let me come back to what does convex mean. 525 00:37:41,020 --> 00:37:42,450 What's a convex set? 526 00:37:42,450 --> 00:37:47,430 Yeah, we have to do that, should have done that before. 527 00:37:55,140 --> 00:38:00,040 In the notes, I had the fun of figuring out, 528 00:38:00,040 --> 00:38:04,870 if I took a triangle, is that a convex set? 529 00:38:04,870 --> 00:38:05,830 Let's just be sure. 530 00:38:09,640 --> 00:38:11,530 So what's a convex set? 531 00:38:11,530 --> 00:38:14,020 That is a convex set, because if I 532 00:38:14,020 --> 00:38:19,120 take any two points in the set and draw the line between them, 533 00:38:19,120 --> 00:38:21,710 it stays in the set. 534 00:38:21,710 --> 00:38:31,370 So that's convexity, any edge, line, from x1 535 00:38:31,370 --> 00:38:37,970 to x2 stays in the set. 536 00:38:37,970 --> 00:38:39,920 OK, good. 537 00:38:39,920 --> 00:38:43,380 So here's my little exercise to myself. 538 00:38:43,380 --> 00:38:46,720 What if I took the union of two triangles? 539 00:38:46,720 --> 00:38:52,760 All I want to get you to do is just visualize convex and not 540 00:38:52,760 --> 00:38:54,620 convex possibilities. 541 00:38:54,620 --> 00:39:02,120 Suppose I have one triangle, even if it was obtuse, 542 00:39:02,120 --> 00:39:04,730 that's still convex, right? 543 00:39:04,730 --> 00:39:06,230 No problem. 544 00:39:06,230 --> 00:39:08,630 But now what if I put those two triangles 545 00:39:08,630 --> 00:39:11,310 together, take their union? 546 00:39:11,310 --> 00:39:15,740 Well, if I take them sitting with a big gap between, 547 00:39:15,740 --> 00:39:17,330 like I've lost. 548 00:39:17,330 --> 00:39:20,390 I mean, I never had a chance that way, 549 00:39:20,390 --> 00:39:24,290 because if it was the union of these two-- well, 550 00:39:24,290 --> 00:39:25,880 you know what I'm going to say. 551 00:39:25,880 --> 00:39:28,100 If I'm doing that point and that point, of course, 552 00:39:28,100 --> 00:39:30,270 it goes outside and stupid. 553 00:39:30,270 --> 00:39:30,770 All right. 554 00:39:36,140 --> 00:39:41,150 What if what if that triangle, that lower triangle, 555 00:39:41,150 --> 00:39:42,980 overlaps the upper triangle? 556 00:39:42,980 --> 00:39:44,240 Is that a convex set? 557 00:39:47,000 --> 00:39:49,130 Everybody's right saying no. 558 00:39:49,130 --> 00:39:53,240 Why how do I see that the union of those two triangles 559 00:39:53,240 --> 00:39:56,200 is not a convex set? 560 00:39:56,200 --> 00:39:59,530 Guys, you tell me where to pick two points, 561 00:39:59,530 --> 00:40:00,940 where the line goes out. 562 00:40:00,940 --> 00:40:03,700 Well, I take one from that corner and one 563 00:40:03,700 --> 00:40:08,260 from that corner, and the line between them went outside. 564 00:40:08,260 --> 00:40:17,305 So union is usually not convex. 565 00:40:21,990 --> 00:40:24,240 Well, if I think of the union of two sets, 566 00:40:24,240 --> 00:40:29,560 my mind automatically goes to the other corresponding 567 00:40:29,560 --> 00:40:36,540 possibility which is the intersection of the two sets. 568 00:40:36,540 --> 00:40:39,384 So if I take the intersection of two sets. 569 00:40:43,680 --> 00:40:46,890 Now, what's the deal with that? 570 00:40:46,890 --> 00:40:50,670 When I had two triangles, two separated triangles, 571 00:40:50,670 --> 00:40:54,780 what can we say about the intersection of those two 572 00:40:54,780 --> 00:40:55,440 triangles? 573 00:40:55,440 --> 00:40:56,790 AUDIENCE: [INAUDIBLE] 574 00:40:56,790 --> 00:40:58,510 GILBERT STRANG: It's empty. 575 00:40:58,510 --> 00:41:02,380 So should we regard the empty set as a convex set? 576 00:41:02,380 --> 00:41:03,950 Yes. 577 00:41:03,950 --> 00:41:04,797 Isn't it? 578 00:41:04,797 --> 00:41:06,290 AUDIENCE: Yeah, it's vacuous. 579 00:41:06,290 --> 00:41:09,770 GILBERT STRANG: Vacuous, so it hasn't got any problems. 580 00:41:09,770 --> 00:41:11,430 Right? 581 00:41:11,430 --> 00:41:18,770 OK, but now the intersection is always convex. 582 00:41:18,770 --> 00:41:22,640 I'm assuming the two sets that we start with are. 583 00:41:22,640 --> 00:41:26,640 Now, that's an important fact, that the intersection of convex 584 00:41:26,640 --> 00:41:27,140 sets. 585 00:41:27,140 --> 00:41:30,305 Let's just draw a picture that shows an example. 586 00:41:34,160 --> 00:41:35,840 So what's the intersection? 587 00:41:35,840 --> 00:41:39,560 Just this part and it's convex. 588 00:41:39,560 --> 00:41:42,480 OK, can you give me a little proof 589 00:41:42,480 --> 00:41:48,670 that the intersection is convex? 590 00:41:48,670 --> 00:41:51,060 So I take two points in the intersection-- 591 00:41:51,060 --> 00:41:54,070 let me start the proof. 592 00:41:54,070 --> 00:41:57,550 To test if something's convex, how do you test it? 593 00:41:57,550 --> 00:42:01,570 You take two points in the set in the intersection, 594 00:42:01,570 --> 00:42:03,580 and you want to show that the line between them 595 00:42:03,580 --> 00:42:06,850 is in the intersection. 596 00:42:06,850 --> 00:42:08,420 OK, why is that? 597 00:42:08,420 --> 00:42:14,350 So take two points, take x1 in the intersection. 598 00:42:14,350 --> 00:42:17,530 We've got two sets here, and that's 599 00:42:17,530 --> 00:42:19,870 the symbol for intersection, and we've 600 00:42:19,870 --> 00:42:22,017 got another point in the intersection. 601 00:42:24,640 --> 00:42:27,950 And now, we want to look at the line between them, 602 00:42:27,950 --> 00:42:33,100 the line from x1 to 2x. 603 00:42:33,100 --> 00:42:35,236 What's the deal with that one? 604 00:42:38,070 --> 00:42:41,038 Is that fully in K1? 605 00:42:41,038 --> 00:42:42,830 AUDIENCE: Yes. 606 00:42:42,830 --> 00:42:45,510 GILBERT STRANG: Why is it fully in K1? 607 00:42:45,510 --> 00:42:48,830 I took two points in the intersection, 608 00:42:48,830 --> 00:42:52,850 I'm looking at the line between them, and I'm asking, 609 00:42:52,850 --> 00:42:54,680 is it in the first set K1? 610 00:42:54,680 --> 00:42:57,950 And the answer is yes, because those points 611 00:42:57,950 --> 00:43:02,010 were in K1, and K1's convex. 612 00:43:02,010 --> 00:43:05,520 And is that line between them in K2? 613 00:43:05,520 --> 00:43:11,550 Yes, same reason, the two endpoints were in K2, 614 00:43:11,550 --> 00:43:13,890 so the line between them is in K2. 615 00:43:13,890 --> 00:43:17,490 So the intersection of convex sets is always convex. 616 00:43:17,490 --> 00:43:22,990 The intersection of convex sets is convex. 617 00:43:22,990 --> 00:43:23,490 Good. 618 00:43:26,420 --> 00:43:29,570 So you'll see in the note these possibilities 619 00:43:29,570 --> 00:43:31,560 with two triangles. 620 00:43:31,560 --> 00:43:35,770 Sometimes, you can take the union but not very often. 621 00:43:35,770 --> 00:43:36,590 OK. 622 00:43:36,590 --> 00:43:41,130 Now, what's the next thing I have to do? 623 00:43:41,130 --> 00:43:45,230 Convex functions, we got convex sets, 624 00:43:45,230 --> 00:43:48,590 what are convex functions, and we're good. 625 00:43:48,590 --> 00:43:53,140 Because this is our prototype of a problem, 626 00:43:53,140 --> 00:43:57,930 and I now want to know what it means for that F to be-- 627 00:43:57,930 --> 00:43:58,640 oh, I'm sorry. 628 00:43:58,640 --> 00:44:03,920 I now know what it means for the set K to be convex set, 629 00:44:03,920 --> 00:44:07,525 but now I have to look at the other often more important part 630 00:44:07,525 --> 00:44:08,150 of the problem. 631 00:44:08,150 --> 00:44:10,550 What's the function I'm minimizing, 632 00:44:10,550 --> 00:44:15,110 and I'm looking for functions with this kind of a picture. 633 00:44:15,110 --> 00:44:16,610 OK. 634 00:44:16,610 --> 00:44:22,580 The coolest way is to connect the definition of a convex 635 00:44:22,580 --> 00:44:28,340 function to the definition of a convex set. 636 00:44:28,340 --> 00:44:31,820 This is really the nicest way. 637 00:44:31,820 --> 00:44:32,780 It's a little quick. 638 00:44:32,780 --> 00:44:34,610 It just swishes by you. 639 00:44:34,610 --> 00:44:38,240 But tell me, do you see a convex set in that picture? 640 00:44:38,240 --> 00:44:40,930 [INAUDIBLE] 641 00:44:40,930 --> 00:44:43,550 You see a convex set in that picture. 642 00:44:43,550 --> 00:44:46,300 That's the picture of a graph of a convex function. 643 00:44:46,300 --> 00:44:48,850 It's a picture of a bowl. 644 00:44:48,850 --> 00:44:53,400 Are the points on that surface, is that a convex set? 645 00:44:53,400 --> 00:44:55,650 No, certainly not. 646 00:44:55,650 --> 00:45:03,310 No, but where is a convex set to be found here, in that picture? 647 00:45:03,310 --> 00:45:04,290 Yes. 648 00:45:04,290 --> 00:45:07,230 AUDIENCE: The set of y, if y is greater than [INAUDIBLE] 649 00:45:07,230 --> 00:45:11,210 GILBERT STRANG: Yes, the points on and above the bowl, 650 00:45:11,210 --> 00:45:16,470 inside the bowl, we could say, these points. 651 00:45:16,470 --> 00:45:21,600 So convex function, yes, a function's 652 00:45:21,600 --> 00:45:37,320 convex when the points on and above the graph are convex set. 653 00:45:48,650 --> 00:45:51,110 You could say, OK, mathematicians 654 00:45:51,110 --> 00:45:52,100 are just being lazy. 655 00:45:52,100 --> 00:45:55,190 Having got one definition straight for a convex set, 656 00:45:55,190 --> 00:45:59,045 now they're just using that to give an easy definition 657 00:45:59,045 --> 00:46:00,930 of a convex function. 658 00:46:00,930 --> 00:46:04,190 Actually, it's quite useful for functions 659 00:46:04,190 --> 00:46:09,080 that could maybe equal infinity, sort of generalized functions. 660 00:46:09,080 --> 00:46:15,260 But it's not the quickest way to tell if the function is convex. 661 00:46:15,260 --> 00:46:19,070 It's not our usual test for convex functions. 662 00:46:19,070 --> 00:46:21,520 So now I want to give such a test. 663 00:46:21,520 --> 00:46:22,020 OK. 664 00:46:24,740 --> 00:46:36,030 So now, the definition of convex function, of a smooth convex, 665 00:46:36,030 --> 00:46:36,530 yeah. 666 00:46:39,640 --> 00:46:45,190 This fact, I shouldn't rush off away from it, 667 00:46:45,190 --> 00:46:51,560 from the definition of a convex function 668 00:46:51,560 --> 00:46:55,020 as having a convex set above its graph. 669 00:46:55,020 --> 00:46:59,450 The really official French name for the set above the graph 670 00:46:59,450 --> 00:47:03,780 is the epigraph, but I won't even write that word down. 671 00:47:03,780 --> 00:47:05,690 OK. 672 00:47:05,690 --> 00:47:08,300 Why do I come back to that for a minute? 673 00:47:08,300 --> 00:47:16,720 Because I would like to think about two functions, F1 and F2. 674 00:47:16,720 --> 00:47:18,640 Out of two functions, I can always 675 00:47:18,640 --> 00:47:23,140 create the minimum or the maximum. 676 00:47:23,140 --> 00:47:28,795 So suppose I have to convex functions, convex function 677 00:47:28,795 --> 00:47:30,535 F1 and F2. 678 00:47:34,440 --> 00:47:35,050 OK. 679 00:47:35,050 --> 00:47:38,840 Then, I could choose a minimum. 680 00:47:38,840 --> 00:47:41,450 I could choose my new function. 681 00:47:41,450 --> 00:47:44,090 Shall I call it little m for minimum? 682 00:47:44,090 --> 00:47:51,340 m of x is the minimum of F1 and F2. 683 00:47:51,340 --> 00:47:54,380 And I could choose a maximum function 684 00:47:54,380 --> 00:47:59,330 which would be the maximum of F1 of x and F2 685 00:47:59,330 --> 00:48:03,870 of x at the same point x. 686 00:48:03,870 --> 00:48:07,280 It's just a natural to think, OK, I have two functions. 687 00:48:07,280 --> 00:48:11,212 I've got a bowl and I've got another bowl, 688 00:48:11,212 --> 00:48:12,545 and suppose they're both convex. 689 00:48:18,030 --> 00:48:20,472 So I'm just stretching you to think here. 690 00:48:20,472 --> 00:48:25,060 If I've got the graphs of two convex functions, 691 00:48:25,060 --> 00:48:29,560 and I would like to consider the minimum of those two functions 692 00:48:29,560 --> 00:48:32,150 and also the maximum of those two functions. 693 00:48:32,150 --> 00:48:34,730 I believe life is good. 694 00:48:34,730 --> 00:48:38,830 One of these will be convex, and the other won't. 695 00:48:38,830 --> 00:48:41,980 And can you identify which one is convex 696 00:48:41,980 --> 00:48:43,300 and which one is not convex? 697 00:48:46,310 --> 00:48:49,100 What about the minimum? 698 00:48:49,100 --> 00:48:50,690 Is that a convex function? 699 00:48:50,690 --> 00:48:52,470 So just look at the graph. 700 00:48:52,470 --> 00:48:54,110 What does the minimum look like? 701 00:48:54,110 --> 00:48:58,250 The minimum is this guy until they meet somehow 702 00:48:58,250 --> 00:49:01,190 on some surface and then this guy. 703 00:49:01,190 --> 00:49:02,240 Is that convex? 704 00:49:02,240 --> 00:49:05,900 We have like one minute to answer that question. 705 00:49:05,900 --> 00:49:07,160 Absolutely no. 706 00:49:07,160 --> 00:49:10,370 It's got this bad kink in it. 707 00:49:10,370 --> 00:49:14,220 What about the maximum of the two functions? 708 00:49:14,220 --> 00:49:16,490 So the maximum is the one that is above, 709 00:49:16,490 --> 00:49:22,150 all the points or things that are above or on. 710 00:49:26,010 --> 00:49:27,770 There is the maximum function. 711 00:49:27,770 --> 00:49:29,990 That was the minimum function. 712 00:49:29,990 --> 00:49:31,460 It had a kink. 713 00:49:31,460 --> 00:49:33,380 The maximum function is like that, 714 00:49:33,380 --> 00:49:41,530 and it is convex, so maximum yes, minimum no. 715 00:49:44,310 --> 00:49:50,740 OK, and we could have a maximum of 1,500 functions. 716 00:49:50,740 --> 00:49:53,800 If the 1,500 functions are all convex, 717 00:49:53,800 --> 00:49:55,600 the maximum will be, because it's 718 00:49:55,600 --> 00:50:00,520 the part way above everybody's graph, 719 00:50:00,520 --> 00:50:03,580 and that would be the graph of the maximum. 720 00:50:03,580 --> 00:50:05,410 OK, good. 721 00:50:05,410 --> 00:50:11,260 And now finally, let me just say, how do you know 722 00:50:11,260 --> 00:50:13,480 whether a function is convex? 723 00:50:13,480 --> 00:50:16,570 How to test, how of test. 724 00:50:20,330 --> 00:50:24,890 OK, so let me take just a function of one variable. 725 00:50:24,890 --> 00:50:28,130 What's the test you learned in calculus, freshman 726 00:50:28,130 --> 00:50:36,360 calculus actually, just show that this is a convex function? 727 00:50:36,360 --> 00:50:38,542 What's the test for that? 728 00:50:38,542 --> 00:50:40,067 AUDIENCE: Use second derivative. 729 00:50:40,067 --> 00:50:41,900 GILBERT STRANG: Second derivative should be? 730 00:50:41,900 --> 00:50:42,692 AUDIENCE: Positive. 731 00:50:42,692 --> 00:50:45,200 GILBERT STRANG: Positive or possibly 0, 732 00:50:45,200 --> 00:50:53,360 so second derivative greater or equals 0 everywhere. 733 00:50:53,360 --> 00:50:54,360 That's convex. 734 00:50:57,480 --> 00:51:02,920 OK, final question, suppose F is a vector. 735 00:51:02,920 --> 00:51:09,370 So this is a vector, and so I have n functions of n variable. 736 00:51:09,370 --> 00:51:10,020 No, I don't. 737 00:51:10,020 --> 00:51:14,030 I have one, sorry, I've got one function, 738 00:51:14,030 --> 00:51:15,490 but I'm in n variables. 739 00:51:19,510 --> 00:51:21,680 So this was just one. 740 00:51:21,680 --> 00:51:23,360 What's the test for convexity? 741 00:51:27,350 --> 00:51:33,080 So it would be passed, for example, by x1 squared plus x2 742 00:51:33,080 --> 00:51:35,150 squared. 743 00:51:35,150 --> 00:51:37,683 Would it be passed by-- 744 00:51:37,683 --> 00:51:39,350 so here would be the question-- would it 745 00:51:39,350 --> 00:51:45,830 be passed by x transpose some symmetric matrix S? 746 00:51:45,830 --> 00:51:51,000 That would be a quadratic, a pure quadratic. 747 00:51:51,000 --> 00:51:53,760 Would it be convex? 748 00:51:53,760 --> 00:51:54,780 What would be the test? 749 00:51:57,500 --> 00:52:00,350 I'm looking for an n dimensional equivalent 750 00:52:00,350 --> 00:52:04,460 of positive second derivative. 751 00:52:04,460 --> 00:52:07,370 The n dimensional equivalent of positive second derivative 752 00:52:07,370 --> 00:52:12,560 is convexity, and we have to recognize what's the test. 753 00:52:12,560 --> 00:52:14,990 So I could apply it to this function, 754 00:52:14,990 --> 00:52:20,090 or I could apply it to any function of n variables. 755 00:52:24,170 --> 00:52:25,280 It should be OK. 756 00:52:27,960 --> 00:52:29,220 What's the test here? 757 00:52:29,220 --> 00:52:33,340 Here, I have a matrix instead of a number. 758 00:52:33,340 --> 00:52:38,090 So what's the requirement going to be? 759 00:52:38,090 --> 00:52:39,527 Times out, yeah? 760 00:52:39,527 --> 00:52:43,380 [INAUDIBLE] Positive definite or semidefinite, 761 00:52:43,380 --> 00:52:45,440 or semidefinite just as here. 762 00:52:45,440 --> 00:52:45,940 Yeah. 763 00:52:45,940 --> 00:52:51,680 So the test is positive, semidefinite, Hessian. 764 00:52:54,520 --> 00:52:56,830 And here, the Hessian is actually that S, 765 00:52:56,830 --> 00:53:00,700 because the second derivatives will produce-- 766 00:53:00,700 --> 00:53:02,220 I'll put a 1/2 in there-- 767 00:53:02,220 --> 00:53:06,940 the second derivatives will produce S equal the Hessian H. 768 00:53:06,940 --> 00:53:08,590 So here, the S-- 769 00:53:08,590 --> 00:53:11,800 so positive semidefinite, Hessian 770 00:53:11,800 --> 00:53:17,920 in general, second derivative matrix for a quadratic. 771 00:53:17,920 --> 00:53:19,150 OK. 772 00:53:19,150 --> 00:53:23,950 So its convex problems that we're 773 00:53:23,950 --> 00:53:28,150 going to get farther with. 774 00:53:28,150 --> 00:53:31,040 We run into no saddle points. 775 00:53:31,040 --> 00:53:34,220 We run into no local minimum. 776 00:53:34,220 --> 00:53:36,950 Once we found the minimum, it's the global minimum. 777 00:53:36,950 --> 00:53:38,520 These are the good problems. 778 00:53:38,520 --> 00:53:41,180 OK, again, happy to see you today, 779 00:53:41,180 --> 00:53:43,822 and I look forward to Wednesday.