1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation, or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:23,018 --> 00:00:23,810 GILBERT STRANG: OK. 9 00:00:23,810 --> 00:00:27,070 So what I promised, and now I'm going to do it, 10 00:00:27,070 --> 00:00:34,010 to talk about gradient descent and its descendants. 11 00:00:34,010 --> 00:00:37,190 So from the basic gradient descent formula, 12 00:00:37,190 --> 00:00:38,390 which we all know-- 13 00:00:38,390 --> 00:00:40,640 let me just write that down-- 14 00:00:40,640 --> 00:00:47,150 the new point is the old point. 15 00:00:47,150 --> 00:00:49,280 We're going downwards, so with a minus 16 00:00:49,280 --> 00:00:52,520 sign that's the step size. 17 00:00:52,520 --> 00:00:57,260 And we compute the gradient at XK. 18 00:00:57,260 --> 00:01:00,110 So we're descending in the direction 19 00:01:00,110 --> 00:01:02,810 of the negative gradient. 20 00:01:02,810 --> 00:01:09,470 And that's the basic formula, and is in every book studied. 21 00:01:09,470 --> 00:01:18,980 So my main reference for some of these lectures 22 00:01:18,980 --> 00:01:24,910 is the book by Stephen Boyd and Lieven Vandenberghe. 23 00:01:24,910 --> 00:01:30,350 And I mention again, Professor Boyd is talking, in this room, 24 00:01:30,350 --> 00:01:35,360 next week Wednesday, Thursday and he's speaking 25 00:01:35,360 --> 00:01:38,870 somewhere on Friday at 4:30-- 26 00:01:38,870 --> 00:01:41,310 and of course, about optimization. 27 00:01:41,310 --> 00:01:43,950 And he's a good lecturer, yeah, very good. 28 00:01:43,950 --> 00:01:45,200 OK. 29 00:01:45,200 --> 00:01:49,790 So there's steepest descent, and I've redrawn my picture 30 00:01:49,790 --> 00:01:50,780 from last time. 31 00:01:50,780 --> 00:01:53,840 Now I'll go over there and look at that picture. 32 00:01:53,840 --> 00:01:56,390 But let me say what's coming. 33 00:01:56,390 --> 00:01:59,350 So that's pretty standard-- 34 00:01:59,350 --> 00:02:02,800 very standard, you could say. 35 00:02:02,800 --> 00:02:09,889 Then this is the improvement that is widely used. 36 00:02:09,889 --> 00:02:11,980 Adding in something called momentum 37 00:02:11,980 --> 00:02:15,920 to avoid the zigzag that we're going to see over there. 38 00:02:15,920 --> 00:02:17,500 And there's another way to do it. 39 00:02:17,500 --> 00:02:20,640 There's a Russian guy named Nesterov. 40 00:02:20,640 --> 00:02:22,920 His papers are not easy to read, but they've 41 00:02:22,920 --> 00:02:25,330 got serious content. 42 00:02:25,330 --> 00:02:29,370 And one thing he did was find an alternative 43 00:02:29,370 --> 00:02:37,110 to momentum that also accelerated the descent. 44 00:02:37,110 --> 00:02:42,870 So this produces-- these both produce faster descent 45 00:02:42,870 --> 00:02:45,600 than the ordinary one. 46 00:02:45,600 --> 00:02:46,410 OK. 47 00:02:46,410 --> 00:02:48,930 And then you know, looking ahead, 48 00:02:48,930 --> 00:02:52,410 that for problems of machine learning, 49 00:02:52,410 --> 00:02:56,750 they're so large that the gradient-- 50 00:02:56,750 --> 00:02:58,440 we have so many variables-- 51 00:02:58,440 --> 00:03:01,740 all those weights are variables. 52 00:03:01,740 --> 00:03:07,420 And that could-- hundreds of thousands is not uncommon. 53 00:03:07,420 --> 00:03:12,690 So then the gradient becomes a pretty big calculation, 54 00:03:12,690 --> 00:03:15,070 and we just don't have to do it all at once. 55 00:03:15,070 --> 00:03:16,440 We don't have to change-- 56 00:03:16,440 --> 00:03:22,900 so XK is a vector of all the weights, or-- and using-- 57 00:03:22,900 --> 00:03:29,740 and our equations are matching the training data. 58 00:03:29,740 --> 00:03:33,030 So we don't have to use all the training data at once, 59 00:03:33,030 --> 00:03:34,290 and we don't. 60 00:03:34,290 --> 00:03:38,970 We take a batch of training data, like one. 61 00:03:38,970 --> 00:03:42,240 But that's sort of inefficient in the opposite direction, 62 00:03:42,240 --> 00:03:43,920 to do them one at a time. 63 00:03:43,920 --> 00:03:46,210 So we don't know want to do them one at a time, 64 00:03:46,210 --> 00:03:49,710 but we don't want to do all million at a time. 65 00:03:49,710 --> 00:03:53,580 So the compromise is a mini batch. 66 00:03:53,580 --> 00:03:59,490 So stochastic gradient descent does a mini batch at a time-- 67 00:03:59,490 --> 00:04:04,710 a mini batch of training, of samples 68 00:04:04,710 --> 00:04:08,730 training data each step. 69 00:04:12,620 --> 00:04:15,330 And it can choose them stochastically-- 70 00:04:15,330 --> 00:04:19,560 meaning randomly, or more systematically-- 71 00:04:19,560 --> 00:04:23,190 but we do a batch at a time. 72 00:04:23,190 --> 00:04:26,266 And that will come after the-- 73 00:04:26,266 --> 00:04:32,340 it'll come next week after a marathon, of course, on Monday. 74 00:04:32,340 --> 00:04:33,510 OK. 75 00:04:33,510 --> 00:04:36,810 So let me just go back to that picture for a moment, 76 00:04:36,810 --> 00:04:39,420 but then the real content of today 77 00:04:39,420 --> 00:04:43,970 is this one with momentum added. 78 00:04:43,970 --> 00:04:44,550 OK. 79 00:04:44,550 --> 00:04:48,720 I just-- I probably haven't got the picture perfect yet. 80 00:04:48,720 --> 00:04:56,620 I'm just not an artist, but I think I'm closer. 81 00:04:56,620 --> 00:05:01,120 So this is-- those are the level sets. 82 00:05:01,120 --> 00:05:06,550 Those are the sets f of x equal constant. 83 00:05:06,550 --> 00:05:11,590 And in our model problem, f of x is x1 squared-- 84 00:05:11,590 --> 00:05:16,750 or let's say x squared plus b-y squared equal constant with 85 00:05:16,750 --> 00:05:18,280 small b-- 86 00:05:18,280 --> 00:05:22,240 b below 1 and maybe far below 1. 87 00:05:22,240 --> 00:05:23,260 So those are ellipses. 88 00:05:26,277 --> 00:05:27,860 Those are the equations of an ellipse, 89 00:05:27,860 --> 00:05:30,050 and that's what I tried to draw. 90 00:05:30,050 --> 00:05:32,240 And if b is small, then the ellipses 91 00:05:32,240 --> 00:05:35,300 are long and thin like that. 92 00:05:35,300 --> 00:05:38,510 And now, what's the picture? 93 00:05:38,510 --> 00:05:42,470 You start with a point x nought, and you descend 94 00:05:42,470 --> 00:05:44,400 in the steepest direction. 95 00:05:44,400 --> 00:05:47,740 So the steepest direction is perpendicular to the level set, 96 00:05:47,740 --> 00:05:48,240 right? 97 00:05:48,240 --> 00:05:49,650 Perpendicular to the ellipse. 98 00:05:49,650 --> 00:05:51,270 So you're down, down, down. 99 00:05:51,270 --> 00:05:53,100 You're passing through more ellipses, 100 00:05:53,100 --> 00:05:55,470 more ellipses, more ellipses. 101 00:05:55,470 --> 00:05:58,620 Eventually, your tangent to a-- 102 00:05:58,620 --> 00:06:00,840 it seems to me it has to be tangent. 103 00:06:00,840 --> 00:06:05,180 I didn't read this, but looks reasonable to me that 104 00:06:05,180 --> 00:06:11,760 the farthest in level set-- farthest in ellipse-- 105 00:06:11,760 --> 00:06:15,220 you're tangent to, and then you start going up again. 106 00:06:15,220 --> 00:06:19,440 So that's the optimal point to stop to end that step. 107 00:06:19,440 --> 00:06:21,810 And then where does the next step go? 108 00:06:21,810 --> 00:06:23,420 Well, you're here. 109 00:06:23,420 --> 00:06:25,080 You're on an ellipse. 110 00:06:25,080 --> 00:06:26,520 That's a level set. 111 00:06:26,520 --> 00:06:28,980 You want to move in the gradient direction. 112 00:06:28,980 --> 00:06:31,350 That's perpendicular to the level set. 113 00:06:31,350 --> 00:06:33,870 So you're going down somewhere here, 114 00:06:33,870 --> 00:06:37,620 and you're passing again through more and more ellipses, 115 00:06:37,620 --> 00:06:44,760 until you're tangent to a smaller ellipse here. 116 00:06:44,760 --> 00:06:48,280 And you see the zigzag pattern. 117 00:06:48,280 --> 00:06:50,460 And that zigzag pattern is what we 118 00:06:50,460 --> 00:07:00,240 see, by formula, in Boyd's book, and many other places, too. 119 00:07:00,240 --> 00:07:06,990 The formula has those powers of the magic number. 120 00:07:06,990 --> 00:07:08,850 So we start at the-- 121 00:07:08,850 --> 00:07:21,300 start at the point b1, and follow this path. 122 00:07:21,300 --> 00:07:27,120 Then the X's are the same b times this quantity 123 00:07:27,120 --> 00:07:29,670 to the kth power. 124 00:07:29,670 --> 00:07:34,180 And here is that quantity, b minus 1 over b plus 1. 125 00:07:34,180 --> 00:07:38,410 So you see, for a small b, that's a negative number. 126 00:07:38,410 --> 00:07:42,720 So it's flipping sine in the X's, as we saw in the picture. 127 00:07:42,720 --> 00:07:45,690 At least that part of the picture is correct. 128 00:07:45,690 --> 00:07:48,390 The y's don't flip sine. 129 00:07:48,390 --> 00:07:55,380 So this was XK, and when k is 0, we got b. 130 00:07:55,380 --> 00:08:03,600 YK is, I think, is not flipping sine. 131 00:08:03,600 --> 00:08:05,400 So that looks good. 132 00:08:05,400 --> 00:08:09,780 And then FK-- the value of f-- 133 00:08:09,780 --> 00:08:12,690 also was the same quantity. 134 00:08:12,690 --> 00:08:20,910 FK is that same quantity to the kth times f0. 135 00:08:20,910 --> 00:08:24,810 So that quantity's all important. 136 00:08:24,810 --> 00:08:27,930 And so the purpose of today's lecture, 137 00:08:27,930 --> 00:08:33,159 is to tell you what the momentum term-- 138 00:08:33,159 --> 00:08:35,090 what improvement-- what change that 139 00:08:35,090 --> 00:08:39,350 brings in the basic steepest descent formula. 140 00:08:39,350 --> 00:08:41,870 I'm going to add on another term, which 141 00:08:41,870 --> 00:08:43,940 is going to have some-- 142 00:08:43,940 --> 00:08:48,120 give us some memory of the previous step. 143 00:08:48,120 --> 00:08:58,080 And so when I do that, I want to track that kind of descent 144 00:08:58,080 --> 00:09:00,830 for the new-- 145 00:09:00,830 --> 00:09:06,410 for the accelerated descent, and track it and see 146 00:09:06,410 --> 00:09:11,190 what improvement the momentum term brings. 147 00:09:11,190 --> 00:09:14,450 And so the final result will be to tell you 148 00:09:14,450 --> 00:09:15,860 the improvement in the-- 149 00:09:15,860 --> 00:09:18,380 produced by the momentum term. 150 00:09:18,380 --> 00:09:21,740 Maybe while I have your attention, 151 00:09:21,740 --> 00:09:24,140 I'll tell you what it is now. 152 00:09:24,140 --> 00:09:29,430 And then will come the details, the algebra. 153 00:09:29,430 --> 00:09:33,610 And to me-- so this is as my own thought-- 154 00:09:36,360 --> 00:09:38,940 it's a miracle that the algebra, which 155 00:09:38,940 --> 00:09:43,740 is straightforward-- you really see the value of eigenvectors. 156 00:09:43,740 --> 00:09:49,410 We explained eigenvectors in class, but here you see why-- 157 00:09:49,410 --> 00:09:51,310 how to use them. 158 00:09:51,310 --> 00:09:54,180 That is really a good exercise. 159 00:09:54,180 --> 00:10:03,940 But to me it's a miracle that the expression with momentum 160 00:10:03,940 --> 00:10:07,450 is very much like that expression, but different, 161 00:10:07,450 --> 00:10:08,770 of course. 162 00:10:08,770 --> 00:10:18,100 The decay-- the term that tells you how fast the decay is-- 163 00:10:18,100 --> 00:10:19,900 is smaller. 164 00:10:19,900 --> 00:10:21,520 So you're taking kth power. 165 00:10:21,520 --> 00:10:25,070 So let me-- I'll write that down, if that's all right. 166 00:10:25,070 --> 00:10:26,560 I didn't plan to do-- 167 00:10:26,560 --> 00:10:34,010 to reveal the final result at the beginning of the lecture. 168 00:10:34,010 --> 00:10:37,560 But I think you want to see where we're going. 169 00:10:37,560 --> 00:10:48,500 So with momentum-- and we have to see what that means-- 170 00:10:48,500 --> 00:10:56,810 this term of 1 minus b over 1 plus b becomes-- 171 00:10:56,810 --> 00:11:06,090 changes to-- 1 minus square root of b over 1 172 00:11:06,090 --> 00:11:08,820 plus square root of b. 173 00:11:08,820 --> 00:11:12,930 So I mentioned that before, but I don't think 174 00:11:12,930 --> 00:11:16,950 I wrote it down as clearly. 175 00:11:16,950 --> 00:11:20,940 So the miracle to me is to get such a nice expression 176 00:11:20,940 --> 00:11:22,080 for the-- 177 00:11:22,080 --> 00:11:31,030 because you'll see the algebra is-- it works, but it involves 178 00:11:31,030 --> 00:11:34,150 more terms because of momentum, involves 179 00:11:34,150 --> 00:11:37,420 doing a minimization of eigenvalues, 180 00:11:37,420 --> 00:11:39,460 and yet it comes out nicely. 181 00:11:39,460 --> 00:11:42,860 And then you have to see the importance of that. 182 00:11:42,860 --> 00:11:45,340 So let me-- I will just take the same example 183 00:11:45,340 --> 00:11:47,290 that I mentioned before. 184 00:11:47,290 --> 00:11:55,750 If b is 1 over 100, then this is 0.99 over 1.01. 185 00:11:55,750 --> 00:11:58,490 I think that these-- 186 00:11:58,490 --> 00:12:05,470 there's a square here, 2k. 187 00:12:05,470 --> 00:12:10,830 So if we're-- so I'll just keep the square there, 188 00:12:10,830 --> 00:12:14,460 no big change, but I'm looking at-- now here-- at the square. 189 00:12:17,450 --> 00:12:21,700 Maybe squares are everywhere. 190 00:12:21,700 --> 00:12:22,200 OK. 191 00:12:22,200 --> 00:12:26,730 So that's close to 1. 192 00:12:26,730 --> 00:12:29,540 And now let's compare that with what we have. 193 00:12:29,540 --> 00:12:32,510 So if b is 1 over 100-- 194 00:12:32,510 --> 00:12:35,120 so I'm taking b to be 1 over 100-- 195 00:12:35,120 --> 00:12:37,460 and square root of b is 1 over 10. 196 00:12:37,460 --> 00:12:45,090 So this is 0.9 over 1.1 squared. 197 00:12:45,090 --> 00:12:46,790 And there's a tremendous-- that's 198 00:12:46,790 --> 00:12:49,950 a lot smaller than that is. 199 00:12:49,950 --> 00:12:51,180 Right. 200 00:12:51,180 --> 00:12:57,120 9/10-- 9 over 11, compared to 99 over 101. 201 00:12:57,120 --> 00:13:03,690 This one is definitely-- oh, sorry. 202 00:13:03,690 --> 00:13:10,560 Yeah, this reduction factor is well below that one. 203 00:13:10,560 --> 00:13:11,520 So it's a good thing. 204 00:13:11,520 --> 00:13:14,040 It's worth doing. 205 00:13:14,040 --> 00:13:17,020 And now what does it involve? 206 00:13:17,020 --> 00:13:25,240 So I'll write down the expression for the stochastic-- 207 00:13:25,240 --> 00:13:26,200 here we go. 208 00:13:26,200 --> 00:13:27,070 OK. 209 00:13:27,070 --> 00:13:29,530 So here's one way to see it. 210 00:13:29,530 --> 00:13:36,130 The new X is the old X minus the gradient. 211 00:13:40,330 --> 00:13:47,300 And now comes an extra term, which gives us a little memory. 212 00:13:47,300 --> 00:13:50,640 Well, sorry. 213 00:13:50,640 --> 00:13:54,700 The algebra is slightly nicer if I write it a little bit 214 00:13:54,700 --> 00:13:56,100 differently. 215 00:13:56,100 --> 00:14:06,480 I'll create a new quantity, ZK, with a step size. 216 00:14:06,480 --> 00:14:06,980 OK. 217 00:14:15,800 --> 00:14:19,100 So if I took ZK to be just the gradient, 218 00:14:19,100 --> 00:14:20,540 that would be steepest descent. 219 00:14:20,540 --> 00:14:22,370 Nothing has changed. 220 00:14:22,370 --> 00:14:26,240 But instead, I'm going to take ZK-- 221 00:14:26,240 --> 00:14:28,400 well, it's leading term will be the gradient. 222 00:14:31,530 --> 00:14:34,080 But here comes the momentum term. 223 00:14:34,080 --> 00:14:36,930 I add on a multiple beta. 224 00:14:36,930 --> 00:14:40,050 One way to do it is of the previous Z. 225 00:14:40,050 --> 00:14:42,600 So the Z is the search direction. 226 00:14:42,600 --> 00:14:44,305 Z is the gradient you're traveling. 227 00:14:44,305 --> 00:14:47,730 It is the direction you're moving. 228 00:14:47,730 --> 00:14:51,630 So it's different from that direction there. 229 00:14:51,630 --> 00:14:56,430 That direction was the gradient. 230 00:14:56,430 --> 00:15:02,310 This direction is the gradient corrected by a memory 231 00:15:02,310 --> 00:15:05,020 term, a momentum term. 232 00:15:05,020 --> 00:15:09,780 And one way to interpret that is to say that that ball-- 233 00:15:09,780 --> 00:15:14,730 is to think of a heavy ball, instead of just a point. 234 00:15:14,730 --> 00:15:17,610 I think of a heavy ball. 235 00:15:17,610 --> 00:15:27,060 It, instead of bouncing back and forth as uselessly as this one, 236 00:15:27,060 --> 00:15:28,500 it tends to-- 237 00:15:28,500 --> 00:15:32,640 it still bounces, of course, on the sides of the level set-- 238 00:15:32,640 --> 00:15:35,820 but it comes down the valley faster. 239 00:15:35,820 --> 00:15:37,840 And that's the effect of this. 240 00:15:37,840 --> 00:15:43,710 So you could play with different adjustment 241 00:15:43,710 --> 00:15:45,210 terms, different corrections. 242 00:15:45,210 --> 00:15:48,780 So I'll follow through this one. 243 00:15:48,780 --> 00:15:53,760 Nesterov had another way to make a change in the formula, 244 00:15:53,760 --> 00:15:57,140 and there are certainly others beyond that. 245 00:15:57,140 --> 00:16:01,260 OK, so how do we analyze that one? 246 00:16:01,260 --> 00:16:07,610 Well, the real point is, we've sort of, by taking-- 247 00:16:07,610 --> 00:16:11,900 by involving the previous step, we now 248 00:16:11,900 --> 00:16:16,620 have a three level method instead of a two level method, 249 00:16:16,620 --> 00:16:18,000 you could say. 250 00:16:18,000 --> 00:16:21,360 This involves only level K plus 1 251 00:16:21,360 --> 00:16:28,320 and level K. The formulas now involve K plus 1K, 252 00:16:28,320 --> 00:16:30,240 and K minus 1. 253 00:16:30,240 --> 00:16:35,490 It's just like going from a first order differential 254 00:16:35,490 --> 00:16:38,680 equation to a second order differential equation. 255 00:16:42,540 --> 00:16:46,300 I'm not really thinking that K is a time variable. 256 00:16:46,300 --> 00:16:50,290 But in the analogy, K could be a time variable. 257 00:16:50,290 --> 00:16:54,480 So that here we had a first order equation. 258 00:16:54,480 --> 00:16:57,150 If I wanted to model that, it's sort 259 00:16:57,150 --> 00:17:01,840 of a DXDT coming in there, equal gradient. 260 00:17:01,840 --> 00:17:04,260 And these models are highly useful 261 00:17:04,260 --> 00:17:09,390 and developed for sort of a continuous model 262 00:17:09,390 --> 00:17:12,369 of steepest descent-- 263 00:17:12,369 --> 00:17:19,740 a continuous motion instead of the discrete motion. 264 00:17:19,740 --> 00:17:20,520 OK. 265 00:17:20,520 --> 00:17:25,140 So that would-- that continuous model for that guy 266 00:17:25,140 --> 00:17:27,540 would be a first order in time. 267 00:17:27,540 --> 00:17:30,590 For this one, it'll be second order in time. 268 00:17:30,590 --> 00:17:33,990 And second order equations, of course, 269 00:17:33,990 --> 00:17:35,730 and there'd be constant coefficients 270 00:17:35,730 --> 00:17:37,560 in our model problem. 271 00:17:37,560 --> 00:17:41,730 And the thing about a second order equation that we all know 272 00:17:41,730 --> 00:17:45,420 is, there is a momentum term-- 273 00:17:45,420 --> 00:17:49,440 a damping term, you could say-- 274 00:17:49,440 --> 00:17:53,960 in multiplying the first derivative. 275 00:17:53,960 --> 00:17:59,400 So that's what a second order equation offers-- 276 00:17:59,400 --> 00:18:02,670 is the inclusion of a damping term 277 00:18:02,670 --> 00:18:07,800 which isn't present in the original first order. 278 00:18:07,800 --> 00:18:09,300 OK. 279 00:18:09,300 --> 00:18:11,010 So how do we analyze this? 280 00:18:13,590 --> 00:18:17,010 I have to-- so how do you analyze second order 281 00:18:17,010 --> 00:18:19,080 differential equations? 282 00:18:19,080 --> 00:18:22,860 You write them as a system of two first order equations. 283 00:18:22,860 --> 00:18:25,170 So that's exactly what we're going to do here, 284 00:18:25,170 --> 00:18:27,000 in the discrete case. 285 00:18:27,000 --> 00:18:29,880 We're going to see-- 286 00:18:29,880 --> 00:18:31,890 because we have two equations. 287 00:18:31,890 --> 00:18:34,590 And they're first order, and we can-- 288 00:18:34,590 --> 00:18:38,940 let me play with them for a moment to make them good. 289 00:18:38,940 --> 00:18:39,690 OK. 290 00:18:39,690 --> 00:18:40,920 So I'm going to have-- 291 00:18:40,920 --> 00:18:46,680 so this will go to two first order equations, 292 00:18:46,680 --> 00:18:48,180 in which the first one-- 293 00:18:48,180 --> 00:18:51,720 I'm just going to copy, XK plus 1 294 00:18:51,720 --> 00:18:56,620 is XK minus that step size ZK. 295 00:19:00,330 --> 00:19:00,880 Yeah. 296 00:19:00,880 --> 00:19:02,220 OK. 297 00:19:02,220 --> 00:19:03,050 Yeah. 298 00:19:03,050 --> 00:19:03,550 OK. 299 00:19:03,550 --> 00:19:07,150 Time the previous times step here-- 300 00:19:07,150 --> 00:19:09,320 the next time step on the left. 301 00:19:09,320 --> 00:19:09,820 OK. 302 00:19:09,820 --> 00:19:11,500 So I just copied that. 303 00:19:11,500 --> 00:19:17,090 Now this one I'm going to increase K by 1. 304 00:19:17,090 --> 00:19:21,190 So in order to have that looking to match this, 305 00:19:21,190 --> 00:19:27,160 I'll write that as ZK plus 1, and I'll bring the K, saying, 306 00:19:27,160 --> 00:19:35,920 grad FK plus 1 equal beta ZK. 307 00:19:35,920 --> 00:19:37,610 That work with you? 308 00:19:37,610 --> 00:19:42,850 I just, in this thing, instead of looking at it at K, 309 00:19:42,850 --> 00:19:47,260 I went to K plus 1. 310 00:19:47,260 --> 00:19:50,080 And I put the K plus 1 terms on one side. 311 00:19:50,080 --> 00:19:50,920 OK. 312 00:19:50,920 --> 00:19:52,440 So now I have a-- 313 00:19:56,600 --> 00:19:57,100 let's see. 314 00:19:57,100 --> 00:19:58,680 Let's remember, we're doing-- 315 00:19:58,680 --> 00:20:04,060 the model we're doing is F equal a half X transpose SX. 316 00:20:04,060 --> 00:20:09,250 So the gradient of F is SX. 317 00:20:09,250 --> 00:20:13,450 So what I've written there, for gradient, is really-- 318 00:20:13,450 --> 00:20:15,980 I know what that gradient is. 319 00:20:15,980 --> 00:20:20,350 So that's really SX K plus 1. 320 00:20:23,190 --> 00:20:23,690 OK. 321 00:20:27,700 --> 00:20:30,040 How to analyze that. 322 00:20:30,040 --> 00:20:36,940 What happens as K travels forward 1, 2, 3, 4, 5? 323 00:20:36,940 --> 00:20:40,870 We have a constant coefficient problem at every step. 324 00:20:40,870 --> 00:20:46,190 The XZ variable is getting multiplied by a matrix. 325 00:20:46,190 --> 00:20:52,480 So here's XZ at K plus 1. 326 00:20:52,480 --> 00:20:58,750 And over here will be XZ at step K. 327 00:20:58,750 --> 00:21:02,620 And I just have to figure out what matrix 328 00:21:02,620 --> 00:21:05,580 is multiplying here and here. 329 00:21:05,580 --> 00:21:07,220 OK. 330 00:21:07,220 --> 00:21:09,310 And I guess here I see it. 331 00:21:09,310 --> 00:21:12,730 For the first equation has a 1 and a minus S, 332 00:21:12,730 --> 00:21:14,740 looks like, in the first row. 333 00:21:14,740 --> 00:21:17,590 And it has a beta in the second row. 334 00:21:17,590 --> 00:21:24,100 And here the first equation has a 1, 0 in that row. 335 00:21:24,100 --> 00:21:27,070 And then a minus S. So I'll put in 336 00:21:27,070 --> 00:21:31,840 minus S, multiplying XK plus 1, and then 337 00:21:31,840 --> 00:21:34,902 the 1 that multiplies ZK plus 1. 338 00:21:34,902 --> 00:21:35,700 Is that all right? 339 00:21:39,400 --> 00:21:39,900 Sorry. 340 00:21:39,900 --> 00:21:42,412 I've got two S's, and I didn't draw that one-- 341 00:21:42,412 --> 00:21:43,995 didn't write that one in large enough, 342 00:21:43,995 --> 00:21:47,370 and I'd planned to erase it anyway. 343 00:21:47,370 --> 00:21:49,800 This is the step sizes. 344 00:21:49,800 --> 00:21:51,620 This is the matrix. 345 00:21:51,620 --> 00:21:57,050 But it's not quite fitting its place. 346 00:21:57,050 --> 00:22:01,730 This is the point where I'm going to use eigenvalues. 347 00:22:01,730 --> 00:22:05,450 I'm going to follow each eigenvalue. 348 00:22:05,450 --> 00:22:06,800 That's the whole point. 349 00:22:06,800 --> 00:22:10,070 When I follow each eigenvalue-- each eigenvector, 350 00:22:10,070 --> 00:22:11,000 I should say-- 351 00:22:11,000 --> 00:22:17,510 I'll follow each eigenvector of S. So let's do that. 352 00:22:17,510 --> 00:22:23,450 So eigenvectors of S-- what are we going to call those? 353 00:22:23,450 --> 00:22:26,000 Lambda, probably. 354 00:22:26,000 --> 00:22:30,850 So SX equal lambda X. I think that's what's coming. 355 00:22:34,930 --> 00:22:41,560 Or Q. To do things right, I want to remember 356 00:22:41,560 --> 00:22:45,350 that S is a positive, definite symmetric matrix. 357 00:22:45,350 --> 00:22:47,620 That's why I call it S, instead of A. 358 00:22:47,620 --> 00:22:49,240 So I really should call the eigen-- 359 00:22:49,240 --> 00:22:54,940 it doesn't matter, but to be on the ball, 360 00:22:54,940 --> 00:22:58,780 let me call the eigenvector Q, and the eigenvalue lambda. 361 00:23:02,224 --> 00:23:02,724 OK. 362 00:23:06,280 --> 00:23:10,720 So now I want to follow this eigenvector. 363 00:23:10,720 --> 00:23:17,930 So I'm supposing that XK is sum CK times Q. 364 00:23:17,930 --> 00:23:21,200 I'm assuming that X is in the-- 365 00:23:21,200 --> 00:23:23,390 tracking this eigenvector. 366 00:23:23,390 --> 00:23:30,350 And I'm going to assume that ZK is some other constant times Q. 367 00:23:30,350 --> 00:23:31,670 Everybody, do you see? 368 00:23:31,670 --> 00:23:33,390 That's a vector and that's a vector. 369 00:23:33,390 --> 00:23:35,330 And I want scalars. 370 00:23:35,330 --> 00:23:40,580 I want to attract just scalar CK and DK. 371 00:23:40,580 --> 00:23:43,010 So that's really what I have here. 372 00:23:43,010 --> 00:23:47,630 This was a little tricky, because X here is a vector, 373 00:23:47,630 --> 00:23:50,180 and two components are N components. 374 00:23:50,180 --> 00:23:51,630 I didn't want that. 375 00:23:51,630 --> 00:23:56,030 I really wanted just to track an eigenvector. 376 00:23:56,030 --> 00:23:59,210 Once I've settled on the direction Q, 377 00:23:59,210 --> 00:24:02,690 everything is-- all vectors are in the direction of Q. 378 00:24:02,690 --> 00:24:06,840 So we just have numbers C and D to track. 379 00:24:06,840 --> 00:24:07,340 OK. 380 00:24:07,340 --> 00:24:15,660 So I'm going to rewrite this correctly, as, yeah. 381 00:24:15,660 --> 00:24:19,310 Well, let me keep going with this little formula. 382 00:24:19,310 --> 00:24:20,650 Then what will-- 383 00:24:20,650 --> 00:24:22,320 I needed an SX. 384 00:24:22,320 --> 00:24:26,020 What will SXK be? 385 00:24:26,020 --> 00:24:33,000 If XK is in the direction of the eigenvector Q, and it's CK-- 386 00:24:33,000 --> 00:24:34,890 what happens when I multiply by S? 387 00:24:37,740 --> 00:24:39,170 Q was an eigenvector. 388 00:24:39,170 --> 00:24:42,035 So the multiplying by S gives me a-- 389 00:24:42,035 --> 00:24:42,910 AUDIENCE: Eigenvalue. 390 00:24:42,910 --> 00:24:44,327 GILBERT STRANG: Eigenvalue, right? 391 00:24:44,327 --> 00:24:50,810 So it's CK lambda Q. Everything is a multiple of Q. 392 00:24:50,810 --> 00:24:53,810 And it's only those multiples I'm looking 393 00:24:53,810 --> 00:24:56,000 for, the C's and the D's. 394 00:24:56,000 --> 00:24:59,340 And then the lambda comes into the S term. 395 00:24:59,340 --> 00:24:59,840 Yeah. 396 00:24:59,840 --> 00:25:05,970 I think that's probably all I need to do this. 397 00:25:05,970 --> 00:25:07,350 And then the gradient-- yeah. 398 00:25:07,350 --> 00:25:09,840 So that's the gradient, of course. 399 00:25:09,840 --> 00:25:15,170 This is the gradient of F at K-- 400 00:25:15,170 --> 00:25:15,820 is that one. 401 00:25:15,820 --> 00:25:16,830 OK. 402 00:25:16,830 --> 00:25:19,520 So instead of this, let me just write 403 00:25:19,520 --> 00:25:26,400 what's happening if I'm tracking the coefficients CK plus 1 404 00:25:26,400 --> 00:25:29,400 and DK plus 1. 405 00:25:29,400 --> 00:25:34,200 Then what I really meant to have there is 1, 0. 406 00:25:34,200 --> 00:25:39,060 And minus S is a minus lambda. 407 00:25:43,450 --> 00:25:44,810 Is that right? 408 00:25:44,810 --> 00:25:45,310 Yeah. 409 00:25:45,310 --> 00:25:50,150 When I multiply the eigenvector by S, I'm just getting-- 410 00:25:50,150 --> 00:25:54,040 oh, it's a lambda times a CK. 411 00:25:54,040 --> 00:25:54,970 Yeah. 412 00:25:54,970 --> 00:25:57,280 Lambda times the CK-- that's good. 413 00:25:57,280 --> 00:26:02,160 I think that that's the left hand side of my equation. 414 00:26:02,160 --> 00:26:10,540 And on the right hand side, I have here. 415 00:26:10,540 --> 00:26:11,260 That's 1. 416 00:26:11,260 --> 00:26:14,320 And this was the scalar, the step size. 417 00:26:14,320 --> 00:26:16,300 And this was the other coefficient. 418 00:26:16,300 --> 00:26:17,710 It's the beta. 419 00:26:17,710 --> 00:26:18,900 So I want to choose-- 420 00:26:18,900 --> 00:26:22,550 what's my purpose now? 421 00:26:22,550 --> 00:26:26,330 That gives me the-- 422 00:26:26,330 --> 00:26:30,770 what happens at every step to the C and D. 423 00:26:30,770 --> 00:26:34,820 So I want to choose the two things that I have-- 424 00:26:34,820 --> 00:26:36,740 I'm free to choose are S and beta. 425 00:26:36,740 --> 00:26:39,770 So that's my big job-- 426 00:26:39,770 --> 00:26:41,510 choose S and beta. 427 00:26:44,140 --> 00:26:45,180 OK. 428 00:26:45,180 --> 00:26:46,980 Now I-- to make-- 429 00:26:46,980 --> 00:26:52,350 oh, let me just shape this by multiplying the inverse 430 00:26:52,350 --> 00:26:54,090 of that, and get it over here. 431 00:26:54,090 --> 00:26:55,410 So that will really-- 432 00:26:55,410 --> 00:26:57,060 you'll see everything. 433 00:26:57,060 --> 00:27:02,520 So CK plus 1, DK plus 1. 434 00:27:02,520 --> 00:27:05,190 What's the inverse of 1, 0? 435 00:27:05,190 --> 00:27:07,410 Oh, I don't think I want to-- that 436 00:27:07,410 --> 00:27:12,030 would have a tough time finding an inverse. 437 00:27:12,030 --> 00:27:13,390 It was a 1, wasn't it? 438 00:27:19,180 --> 00:27:20,050 Yeah. 439 00:27:20,050 --> 00:27:20,660 OK. 440 00:27:20,660 --> 00:27:23,630 So I'm going to multiply by the inverse of that matrix 441 00:27:23,630 --> 00:27:26,330 to get it over here. 442 00:27:26,330 --> 00:27:29,320 And what's the inverse of 1, 1 minus lambda? 443 00:27:29,320 --> 00:27:32,420 It's 1, 1 plus lambda. 444 00:27:32,420 --> 00:27:34,640 So that the inverse brought it over here, 445 00:27:34,640 --> 00:27:41,040 times this matrix, 1, 0 beta, and minus the step size. 446 00:27:41,040 --> 00:27:42,520 That's what multiply CK DK. 447 00:27:45,770 --> 00:27:49,310 So we have these simple, beautiful steps 448 00:27:49,310 --> 00:27:53,950 which come from tracking one eigenvector-- 449 00:27:53,950 --> 00:27:55,790 makes the whole problem scalar. 450 00:27:55,790 --> 00:27:58,880 So I multiply those two matrices and I finally 451 00:27:58,880 --> 00:28:01,610 get the matrix that I really have to think about. 452 00:28:01,610 --> 00:28:06,110 1, 0 times that'll be 1 minus S. Lambda 1 times that'll 453 00:28:06,110 --> 00:28:08,030 be a lambda there. 454 00:28:08,030 --> 00:28:11,840 And minus lambda S plus beta. 455 00:28:11,840 --> 00:28:15,950 Beta minus lambda S. That's the matrix 456 00:28:15,950 --> 00:28:18,970 that we see at every step. 457 00:28:18,970 --> 00:28:27,422 Let me call that matrix R. 458 00:28:27,422 --> 00:28:32,320 So I've done some algebra-- more than I would always do 459 00:28:32,320 --> 00:28:33,440 in a lecture-- 460 00:28:33,440 --> 00:28:35,300 but it's really my-- 461 00:28:35,300 --> 00:28:37,670 I wouldn't do it if it wasn't nice algebra. 462 00:28:37,670 --> 00:28:39,680 What's the conclusion? 463 00:28:39,680 --> 00:28:44,060 That conclusion is that with the momentum term-- 464 00:28:44,060 --> 00:28:49,610 with this number beta available to choose, as well as S, 465 00:28:49,610 --> 00:28:51,200 the step-- 466 00:28:51,200 --> 00:28:57,080 the coefficient of the eigenvector 467 00:28:57,080 --> 00:29:00,665 is multiplied at every step by that matrix R. 468 00:29:00,665 --> 00:29:04,200 R is that matrix. 469 00:29:04,200 --> 00:29:06,910 And of course, that matrix involves the eigenvalue. 470 00:29:10,120 --> 00:29:14,610 So we have to think about-- 471 00:29:14,610 --> 00:29:17,110 what do we want to do now? 472 00:29:17,110 --> 00:29:23,780 We want to choose beta and S to make 473 00:29:23,780 --> 00:29:26,740 R as small as possible, right? 474 00:29:26,740 --> 00:29:29,350 We want to make R as small as possible. 475 00:29:29,350 --> 00:29:34,055 And we are free to choose beta and S, but R depends on lambda. 476 00:29:36,780 --> 00:29:39,360 So I'm going to make it as small as possible 477 00:29:39,360 --> 00:29:42,240 over the whole range of possible lambdas. 478 00:29:42,240 --> 00:29:45,840 So let me-- so now here we really go. 479 00:29:49,410 --> 00:29:55,740 So we have lambda between sum. 480 00:29:55,740 --> 00:30:04,520 These are the eigenvalue of S. And what we know-- 481 00:30:04,520 --> 00:30:09,100 what's reasonable to know-- is a lower bound. 482 00:30:09,100 --> 00:30:10,160 It's a positive. 483 00:30:10,160 --> 00:30:13,250 This is a symmetric positive definite matrix. 484 00:30:13,250 --> 00:30:20,880 A lower bound and an upper bound, for example, m was B, 485 00:30:20,880 --> 00:30:25,880 and M was 1, in that 2 by 2 problem. 486 00:30:25,880 --> 00:30:28,310 And this is what we know, that the eigenvalues 487 00:30:28,310 --> 00:30:38,850 are between m and M. And the ratio of m to M-- 488 00:30:38,850 --> 00:30:42,000 well, if I write-- 489 00:30:45,380 --> 00:30:50,880 this is the key quantity. 490 00:30:50,880 --> 00:30:53,020 And what's it called? 491 00:30:53,020 --> 00:30:55,675 Lambda max divided by lambda min is the-- 492 00:30:55,675 --> 00:30:56,800 AUDIENCE: Condition number. 493 00:30:56,800 --> 00:30:57,730 GILBERT STRANG: Condition number. 494 00:30:57,730 --> 00:30:58,230 Right. 495 00:30:58,230 --> 00:31:00,910 This is all sometimes written kappa-- 496 00:31:00,910 --> 00:31:10,420 Greek letter kappa-- the condition number of S. 497 00:31:10,420 --> 00:31:14,830 And when that's big, then the problem is going to be harder. 498 00:31:14,830 --> 00:31:19,780 When that's 1, then my matrix is just a multiple 499 00:31:19,780 --> 00:31:21,260 of the identity matrix. 500 00:31:21,260 --> 00:31:22,480 And the problem is trivial. 501 00:31:22,480 --> 00:31:27,710 When capital M and small m are the same, 502 00:31:27,710 --> 00:31:31,810 then that's saying that the largest and smallest 503 00:31:31,810 --> 00:31:34,840 eigenvalues are identical, that the matrix is 504 00:31:34,840 --> 00:31:36,730 a multiple of the identity. 505 00:31:36,730 --> 00:31:39,310 That's the condition number one. 506 00:31:39,310 --> 00:31:47,980 But the bad one is when it's 1 over b, in our example, 507 00:31:47,980 --> 00:31:51,790 and that could be very large. 508 00:31:51,790 --> 00:31:52,540 OK. 509 00:31:52,540 --> 00:31:56,680 That's where we have our problem. 510 00:31:56,680 --> 00:32:05,830 Let me just insert about the ordinary gradient descent. 511 00:32:05,830 --> 00:32:11,470 Of course, the textbooks find a estimate for how fast that is. 512 00:32:11,470 --> 00:32:15,590 And of course, it depends on that number. 513 00:32:15,590 --> 00:32:16,090 Yeah. 514 00:32:16,090 --> 00:32:19,810 So it depends on that number, and you exactly 515 00:32:19,810 --> 00:32:23,070 saw how it depended on that number. 516 00:32:23,070 --> 00:32:25,210 Right. 517 00:32:25,210 --> 00:32:27,070 But now we have a different problem. 518 00:32:27,070 --> 00:32:29,570 And we're going to finish it. 519 00:32:29,570 --> 00:32:30,070 OK. 520 00:32:30,070 --> 00:32:31,000 So what's my job? 521 00:32:31,000 --> 00:32:38,650 I'm going to choose S and beta to keep the eigenvalues of R. 522 00:32:38,650 --> 00:32:42,450 So let's give the eigenvalues of R a name. 523 00:32:42,450 --> 00:32:50,490 So R-- let's say R has eigenvalues e1, that 524 00:32:50,490 --> 00:32:56,840 depends on the lambda and the S and the beta and e2. 525 00:33:00,400 --> 00:33:03,700 So those are the eigenvalues of R-- 526 00:33:03,700 --> 00:33:07,210 just giving a letter to them. 527 00:33:07,210 --> 00:33:09,430 So what's our job? 528 00:33:09,430 --> 00:33:14,680 We want to choose S and beta to make those eigenvalues as 529 00:33:14,680 --> 00:33:16,900 small as possible. 530 00:33:16,900 --> 00:33:17,680 Right? 531 00:33:17,680 --> 00:33:24,770 Small eigenvalues-- if R has small eigenvalues, its powers-- 532 00:33:24,770 --> 00:33:29,930 every step multiplies by R. So the convergence rate 533 00:33:29,930 --> 00:33:32,450 with momentum is-- 534 00:33:32,450 --> 00:33:36,410 depends on the powers of R getting small fast. 535 00:33:36,410 --> 00:33:39,350 It depends on the eigenvalues being small. 536 00:33:39,350 --> 00:33:48,500 We want to minimize the largest eigenvalue. 537 00:33:48,500 --> 00:33:56,000 So I'll say the maximum of e1 and e2-- 538 00:33:56,000 --> 00:33:57,650 that's our job. 539 00:33:57,650 --> 00:34:01,430 Minimize-- we want to choose S and beta to minimize 540 00:34:01,430 --> 00:34:03,550 the largest eigenvalue. 541 00:34:03,550 --> 00:34:05,560 Because if there's one small eigenvalue, 542 00:34:05,560 --> 00:34:08,679 but the other is big, then the other one is going to kill us. 543 00:34:08,679 --> 00:34:12,670 So we have to get both eigenvalues down. 544 00:34:12,670 --> 00:34:16,239 And of course, those depend on lambda. 545 00:34:16,239 --> 00:34:18,050 E1 depends on lambda. 546 00:34:18,050 --> 00:34:20,620 So we have a little algebra problem. 547 00:34:20,620 --> 00:34:23,679 And this is what I described as a miracle-- 548 00:34:23,679 --> 00:34:26,770 the fact that this little algebra problem-- 549 00:34:26,770 --> 00:34:30,969 the eigenvalues of that matrix, e1 and e2, which 550 00:34:30,969 --> 00:34:35,080 depend on lambda in some way. 551 00:34:35,080 --> 00:34:39,159 And we want to make both e1 and e2 small-- 552 00:34:39,159 --> 00:34:42,040 the maximum of those-- of them. 553 00:34:42,040 --> 00:34:47,050 And we have to do it for all the eigenvalues lambda, 554 00:34:47,050 --> 00:34:48,639 because we have to-- 555 00:34:48,639 --> 00:34:54,370 we're now thinking-- we've been tracking each eigenvector. 556 00:34:54,370 --> 00:34:56,020 So that gave us 1-- 557 00:34:56,020 --> 00:34:59,930 so this is for all possible lambda. 558 00:34:59,930 --> 00:35:03,350 So we have to decide, what do I mean by all possible lambda? 559 00:35:03,350 --> 00:35:12,910 And I mean all lambda that are between some m and M. 560 00:35:12,910 --> 00:35:17,200 There is a beautiful problem. 561 00:35:17,200 --> 00:35:18,790 You have a 2 by 2 matrix. 562 00:35:18,790 --> 00:35:22,960 You can find its eigenvalues. 563 00:35:22,960 --> 00:35:24,610 They depend on lambda. 564 00:35:24,610 --> 00:35:27,815 And what we-- all we know about lambda is it's between m 565 00:35:27,815 --> 00:35:32,920 and cap M. And also, they also depend on S and beta-- 566 00:35:32,920 --> 00:35:35,380 the two parameters we can choose. 567 00:35:35,380 --> 00:35:37,780 And we want to choose those parameters, 568 00:35:37,780 --> 00:35:43,060 so that for all the possible eigenvalues, 569 00:35:43,060 --> 00:35:45,910 the larger of the two eigenvalues 570 00:35:45,910 --> 00:35:47,490 will be as small as possible. 571 00:35:47,490 --> 00:35:51,040 That's-- it's a little bit of algebra, 572 00:35:51,040 --> 00:35:54,730 but do you see that that's the tricky-- 573 00:35:54,730 --> 00:35:59,680 that-- I shouldn't say tricky, because it comes out-- 574 00:35:59,680 --> 00:36:03,760 this is the one that is a miracle in the simplicity 575 00:36:03,760 --> 00:36:05,270 of the solution. 576 00:36:05,270 --> 00:36:05,930 OK. 577 00:36:05,930 --> 00:36:07,150 And I'm going to-- 578 00:36:07,150 --> 00:36:10,120 in fact, maybe I'll move over here to write the answer. 579 00:36:13,930 --> 00:36:16,570 OK. 580 00:36:16,570 --> 00:36:19,690 And I just want to say that miracles 581 00:36:19,690 --> 00:36:22,440 don't happen so often in math. 582 00:36:22,440 --> 00:36:26,470 There is-- all of mathematics-- the whole point of math 583 00:36:26,470 --> 00:36:28,810 is to explain miracles. 584 00:36:28,810 --> 00:36:33,850 So there is something to explain here, 585 00:36:33,850 --> 00:36:37,390 and I don't have my finger on it yet. 586 00:36:37,390 --> 00:36:41,230 Because-- anyway, it happens. 587 00:36:41,230 --> 00:36:45,550 So let me tell you what the right S, and the right beta, 588 00:36:45,550 --> 00:36:53,500 and the resulting minimum eigenvalue are. 589 00:36:53,500 --> 00:37:00,300 So again, they depend on little m and big M. 590 00:37:00,300 --> 00:37:05,230 That's a very nice feature, which we expect. 591 00:37:05,230 --> 00:37:07,680 And they depend on the ratio. 592 00:37:07,680 --> 00:37:08,190 OK. 593 00:37:08,190 --> 00:37:09,540 So that ratio-- all right. 594 00:37:09,540 --> 00:37:11,340 Let's see it. 595 00:37:11,340 --> 00:37:12,300 OK. 596 00:37:12,300 --> 00:37:13,275 So the best S-- 597 00:37:18,750 --> 00:37:29,470 the S optimal has the formula 2 over square root of lambda max. 598 00:37:29,470 --> 00:37:37,290 That's the square root of M and the squared of m squared. 599 00:37:37,290 --> 00:37:38,730 Amazing OK. 600 00:37:38,730 --> 00:37:49,020 And beta optimal turns out to be the square root of M 601 00:37:49,020 --> 00:37:53,760 minus the square of little m, over the square root of M 602 00:37:53,760 --> 00:37:57,592 plus the square root of little m, all squared. 603 00:37:57,592 --> 00:37:59,550 And of course, we know what these numbers are-- 604 00:37:59,550 --> 00:38:02,430 1 and beta, in our model problem. 605 00:38:02,430 --> 00:38:06,720 That's where I'm going to get this square root of-- 606 00:38:06,720 --> 00:38:09,660 this is 1 minus the square root-- oh sorry, b. 607 00:38:09,660 --> 00:38:13,050 This is 1 minus the square root of b. 608 00:38:13,050 --> 00:38:17,520 In fact, for our example-- 609 00:38:17,520 --> 00:38:19,670 well, let me just write what they would be. 610 00:38:19,670 --> 00:38:25,080 2 over 1 plus square root of b squared, 611 00:38:25,080 --> 00:38:29,700 and 1 minus square root of b over 1 plus square-- 612 00:38:29,700 --> 00:38:33,530 you see where this is-- 613 00:38:33,530 --> 00:38:36,510 1 minus square root of b is beginning to appear in that. 614 00:38:36,510 --> 00:38:38,910 It appears in this solution to this problem. 615 00:38:38,910 --> 00:38:41,775 And then I have to tell you what the-- 616 00:38:45,090 --> 00:38:49,700 how small do these optimal choices 617 00:38:49,700 --> 00:38:52,520 make the eigenvalues of R, right? 618 00:38:52,520 --> 00:38:57,600 This is what we're really paying attention to, because 619 00:38:57,600 --> 00:38:59,210 if the eigenvalues-- 620 00:38:59,210 --> 00:39:02,600 that matrix tells us what happens at every step. 621 00:39:02,600 --> 00:39:06,860 And its eigenvalues have to be small to get fast convergence. 622 00:39:06,860 --> 00:39:08,570 So how small are they? 623 00:39:08,570 --> 00:39:09,830 Well they involve this-- 624 00:39:13,480 --> 00:39:13,980 yeah. 625 00:39:13,980 --> 00:39:17,150 So it's the number that I've seen. 626 00:39:17,150 --> 00:39:21,630 So in this case, the e's-- 627 00:39:21,630 --> 00:39:29,300 the eigenvalues of R-- 628 00:39:29,300 --> 00:39:32,090 that's the iterating matrix-- 629 00:39:32,090 --> 00:39:36,560 are below-- now you're going to see the 1 minus square root 630 00:39:36,560 --> 00:39:41,060 of b over 1 plus square root of b-- 631 00:39:41,060 --> 00:39:43,220 I think, maybe, squared. 632 00:39:43,220 --> 00:39:44,480 Let me just see. 633 00:39:44,480 --> 00:39:45,590 Yeah. 634 00:39:45,590 --> 00:39:50,450 It happens to come out that number again. 635 00:39:50,450 --> 00:39:53,470 So that's the conclusion. 636 00:39:53,470 --> 00:39:57,790 That with the right choice of S and beta, 637 00:39:57,790 --> 00:40:03,490 by adding this look back term-- look back one step-- 638 00:40:03,490 --> 00:40:05,920 you get this improvement. 639 00:40:05,920 --> 00:40:13,490 And it happens, and you see it in practice, of course. 640 00:40:13,490 --> 00:40:15,650 You'll see it exactly. 641 00:40:15,650 --> 00:40:26,310 And so you do the job to use momentum. 642 00:40:26,310 --> 00:40:30,290 Now I'm going to mention what the Nesterov-- 643 00:40:30,290 --> 00:40:33,600 Nesterov had a slightly different way to do it, 644 00:40:33,600 --> 00:40:37,170 and I'll tell you what that is. 645 00:40:37,170 --> 00:40:40,320 But it's the same idea-- get a second thing. 646 00:40:40,320 --> 00:40:42,540 So let's see if I can find that. 647 00:40:42,540 --> 00:40:44,300 Yeah, Nesterov. 648 00:40:44,300 --> 00:40:44,800 OK. 649 00:40:51,250 --> 00:40:53,040 Here we go. 650 00:40:53,040 --> 00:40:55,770 So let me bring Nesterov's name down. 651 00:41:01,740 --> 00:41:07,320 So that's basically what I wanted to say about number 1. 652 00:41:07,320 --> 00:41:09,300 And when you see Nesterov, you'll 653 00:41:09,300 --> 00:41:14,910 see that it's a similar idea of involving the previous time 654 00:41:14,910 --> 00:41:16,140 value. 655 00:41:16,140 --> 00:41:17,550 OK. 656 00:41:17,550 --> 00:41:24,720 There are very popular methods in use now 657 00:41:24,720 --> 00:41:28,500 for machine learning that involve-- 658 00:41:28,500 --> 00:41:29,940 by a simple formula-- 659 00:41:29,940 --> 00:41:34,020 all the previous values, by sort of a-- 660 00:41:34,020 --> 00:41:36,970 just by an addition of a bunch of terms. 661 00:41:36,970 --> 00:41:44,160 So it's really-- so it goes under the names 662 00:41:44,160 --> 00:41:50,970 adagrad, or others. 663 00:41:50,970 --> 00:41:54,510 Those of you who already know about machine learning 664 00:41:54,510 --> 00:41:55,980 will know what I'm speaking about. 665 00:41:55,980 --> 00:41:58,020 And I'll say more about those. 666 00:41:58,020 --> 00:41:59,910 Yeah. 667 00:41:59,910 --> 00:42:02,790 But it doesn't involve a separate coefficient 668 00:42:02,790 --> 00:42:05,490 for each previous value, or that would 669 00:42:05,490 --> 00:42:08,880 be a momentous amount of work. 670 00:42:08,880 --> 00:42:12,120 So now I just want to tell you what Nesterov is, and then 671 00:42:12,120 --> 00:42:13,240 we're good. 672 00:42:13,240 --> 00:42:13,740 OK. 673 00:42:13,740 --> 00:42:14,880 Nesterov's idea. 674 00:42:18,366 --> 00:42:20,820 Let me bring that down. 675 00:42:20,820 --> 00:42:22,660 Shoot this up. 676 00:42:22,660 --> 00:42:23,972 Bring down Nesterov. 677 00:42:31,060 --> 00:42:35,170 Because he had an idea that you might not have thought of. 678 00:42:35,170 --> 00:42:38,790 Somehow the momentum idea was pretty natural-- 679 00:42:38,790 --> 00:42:41,770 to use that previous value. 680 00:42:41,770 --> 00:42:43,780 And actually, I would like to know 681 00:42:43,780 --> 00:42:46,810 what happens if you use two previous values, or three 682 00:42:46,810 --> 00:42:47,890 previous values. 683 00:42:47,890 --> 00:42:57,310 Can you then get improvements on this convergence rate 684 00:42:57,310 --> 00:43:00,550 by going back two steps or three steps? 685 00:43:00,550 --> 00:43:05,170 If I'd use the analogy with ordinary differential 686 00:43:05,170 --> 00:43:07,870 equations, maybe you know. 687 00:43:07,870 --> 00:43:12,720 So there are backward difference formulas. 688 00:43:12,720 --> 00:43:14,800 Do you know about those for-- 689 00:43:14,800 --> 00:43:18,380 those would be in MATLAB software, 690 00:43:18,380 --> 00:43:20,440 and all other software. 691 00:43:20,440 --> 00:43:22,750 Backward differences-- so maybe you 692 00:43:22,750 --> 00:43:27,040 go back two steps or four steps. 693 00:43:27,040 --> 00:43:29,800 If you're doing planetary calculations, 694 00:43:29,800 --> 00:43:33,460 if you're an astronomer, you go back maybe seven or eight steps 695 00:43:33,460 --> 00:43:35,950 to get super high accuracy. 696 00:43:35,950 --> 00:43:40,050 So that doesn't seem to have happened yet, 697 00:43:40,050 --> 00:43:42,110 but it's should happen here-- 698 00:43:42,110 --> 00:43:43,150 to go back more. 699 00:43:43,150 --> 00:43:48,010 But Nesterov has this different way to go back. 700 00:43:48,010 --> 00:43:52,870 So his formula is XK plus 1-- the new X-- 701 00:43:52,870 --> 00:43:58,360 is YK-- so he's introducing something a little different-- 702 00:43:58,360 --> 00:44:03,790 minus S gradient f at YK. 703 00:44:09,100 --> 00:44:10,930 I'm a little surprised about that YK, 704 00:44:10,930 --> 00:44:13,330 but this is the point, here-- 705 00:44:13,330 --> 00:44:15,940 that the gradient is being evaluated 706 00:44:15,940 --> 00:44:18,010 at some different point. 707 00:44:18,010 --> 00:44:22,750 And then he has to give a formula for that to track those 708 00:44:22,750 --> 00:44:23,950 Y's. 709 00:44:23,950 --> 00:44:27,760 So the Y's are like the X's, but they 710 00:44:27,760 --> 00:44:33,230 are shifted a little bit by some term-- and beta would be fine. 711 00:44:33,230 --> 00:44:35,830 Oh no. 712 00:44:35,830 --> 00:44:39,830 Yeah-- beta-- have we got Nesterov here? 713 00:44:39,830 --> 00:44:40,330 Yes. 714 00:44:40,330 --> 00:44:45,150 Nesterov has a factor gamma in. 715 00:44:45,150 --> 00:44:45,650 Yeah. 716 00:44:45,650 --> 00:44:47,240 So all right. 717 00:44:47,240 --> 00:44:50,170 Let me try to get this right. 718 00:44:50,170 --> 00:44:52,870 OK. 719 00:44:52,870 --> 00:44:53,540 All right. 720 00:44:53,540 --> 00:44:56,890 On a previous line, I've written the whole Nesterov thing. 721 00:44:56,890 --> 00:44:59,240 Here, let's see a Nesterov completely. 722 00:44:59,240 --> 00:45:00,230 And then it'll break-- 723 00:45:00,230 --> 00:45:04,010 then this is the step that breaks it into two first order. 724 00:45:04,010 --> 00:45:06,780 But you'll see the main formula here. 725 00:45:06,780 --> 00:45:08,230 XK plus 1 is XK. 726 00:45:10,750 --> 00:45:19,600 And then a beta times XK minus XK minus 1. 727 00:45:19,600 --> 00:45:22,570 So that's a momentum term. 728 00:45:22,570 --> 00:45:26,560 And then a typical gradient. 729 00:45:26,560 --> 00:45:29,950 But now here is Nesterov speaking up. 730 00:45:29,950 --> 00:45:35,710 Nesterov evaluates the gradient not at XK, not at XK minus 1. 731 00:45:35,710 --> 00:45:38,650 But it his own, Nesterov point. 732 00:45:38,650 --> 00:45:41,950 So this is Nesterov's favorite point. 733 00:45:41,950 --> 00:45:46,210 Gamma XK minus XK minus 1. 734 00:45:46,210 --> 00:45:54,950 Some point, part way along that step. 735 00:45:54,950 --> 00:46:01,190 So this point-- because gamma is going to be some non-integer-- 736 00:46:01,190 --> 00:46:04,900 this evaluation point for the gradient of f 737 00:46:04,900 --> 00:46:07,570 is a little unexpected and weird, 738 00:46:07,570 --> 00:46:09,970 because it's not a mesh point. 739 00:46:09,970 --> 00:46:13,470 It's somewhere between. 740 00:46:13,470 --> 00:46:15,190 OK. 741 00:46:15,190 --> 00:46:17,170 Yeah. 742 00:46:17,170 --> 00:46:29,410 And then that-- so that involves XK plus 1, XK, and XK minus 1. 743 00:46:29,410 --> 00:46:33,260 So it's a second order-- 744 00:46:33,260 --> 00:46:35,580 there's a second order method here. 745 00:46:35,580 --> 00:46:39,350 We're going to-- to analyze it, we're going to go through this 746 00:46:39,350 --> 00:46:45,260 same process of writing it as two first order steps-- 747 00:46:45,260 --> 00:46:48,590 two first-- two single step-- 748 00:46:48,590 --> 00:46:58,460 two one step from K to K plus 1 coupled with one step thing. 749 00:46:58,460 --> 00:47:03,230 Follow that same thing through, and then the result 750 00:47:03,230 --> 00:47:08,280 is, the same factor appears for him. 751 00:47:08,280 --> 00:47:11,810 The same factor-- this is also-- 752 00:47:11,810 --> 00:47:24,140 so the point is, this is for momentum and Nesterov, 753 00:47:24,140 --> 00:47:33,530 with some constant-- different by some constant. 754 00:47:33,530 --> 00:47:41,840 But the key quantity is that one and that appears in both. 755 00:47:41,840 --> 00:47:49,550 So I don't propose, of course, to repeat these steps 756 00:47:49,550 --> 00:47:50,660 for Nesterov. 757 00:47:50,660 --> 00:47:54,770 But you see what you could do. 758 00:47:54,770 --> 00:47:59,720 You see that it involves K minus 1, KNK plus 1. 759 00:47:59,720 --> 00:48:01,550 You write it as-- 760 00:48:01,550 --> 00:48:03,890 you follow an eigenvector. 761 00:48:03,890 --> 00:48:08,900 You write it as a coupled system of-- that's a one step. 762 00:48:08,900 --> 00:48:10,570 That has a matrix. 763 00:48:10,570 --> 00:48:12,320 You find the matrix. 764 00:48:12,320 --> 00:48:14,840 You find the eigenvalues of the matrix. 765 00:48:14,840 --> 00:48:17,210 You make those eigenvalues as small as possible. 766 00:48:17,210 --> 00:48:22,320 And you have optimized the coefficients in Nesterov. 767 00:48:22,320 --> 00:48:22,820 OK. 768 00:48:22,820 --> 00:48:27,800 That's sort of a lot of algebra that's 769 00:48:27,800 --> 00:48:32,840 at the heart of accelerated gradient descent. 770 00:48:32,840 --> 00:48:37,670 And of course, it's worth doing because it's 771 00:48:37,670 --> 00:48:42,590 a tremendous saving in the convergence rate. 772 00:48:42,590 --> 00:48:44,630 OK. 773 00:48:44,630 --> 00:48:49,640 Anybody running in the marathon or just watching? 774 00:48:49,640 --> 00:48:53,480 It's possible to run, you know. 775 00:48:53,480 --> 00:48:57,350 Anyway, I'll see you after the marathon, next Wednesday. 776 00:48:57,350 --> 00:49:01,300 And Professor Boyd will also see you.